This paper from FCCM 2007 is written by Nicholas Moore (Northeastern), Albert Conti (Northeastern, now with Mitre Corporation), Miriam Leeser (Northeastern) and Laurie Smith King (Holy Cross).
It discusses Vforce (VSIPL++ FOr Reconfigurable Computing). Vforce is designed to allow the same application code to run on different reconfigurable computing platforms, and to permit the runtime binding of applications to hardware. The authors believe that application-level code needs to be separated from platform-specific code, so that no hardware-specific code is required in the application code. Vforce is based on VSIPL++, the Vector, Signal and Image Processing Library, which provides an object-oriented library of commonly used signal and image processing algorithms via a C++ API. VSIPL++ is a standard API designed to provide application-level portability between microprocessor-based platforms. The diagram below shows the abstraction layers of Vforce.
Under the VSIPL++ layer, we can see the software stack that supports the VSIPL++ software API. Some implementations are built on an ANSI C VSIPL implementation, which is in turn built on an implementation of the standard C library libc (and other standard libraries). In this case the libc implementation seems to be optimised for the PowerPC architecture. VSIPL++ may also sit atop libraries such as Mercury’s Scientific Algorithm Library (SAL). Vforce is above the VSIPL++ stack, but it also leverages a separate special purpose processor (SPP) stack. An SPP in this case is an FPGA, though it could be taken to mean any exotic processing technology. The FPGA-oriented SPP stack would normally consist of the RC platform vendor’s own API for accessing the hardware. To achieve their goal of portability and runtime binding, the authors have further divided up the SPP stack as shown below:
Depending on the availability of hardware, Vforce implemented functions may either run in software (as Vforce or as VSIPL++ functions) or in hardware. For hardware implementations, the authors have implemented an Internal Programming Interface (IPI), hidden from the end user. This allows them to use a single standard interface that is translated to the underlying vendor APIs for different RC hardware. The application code is portable because it only contains calls to a generic hardware object. The dynamically linked shared object library (DLSO) holds platform-specific API code. Hardware implementations of functions are provided in a processing kernel library for each hardware platform. The RTRM is the run time resource manager, and it hides hardware-specific information, manages hardware resources and binds application tasks to physical hardware. The RTRM also arbitrates when multiple Vforce applications run in parallel.
Users are likely to wish to accelerate functions at a coarser level than the single VSIPL++ function level, to avoid the repeated initialisation and data transfer costs that come with the chaining of calls to VSIPL++ functions.
A nice feature is that Vforce processing objects catch exceptions coming from the hardware and invoke the software version, so that application code doesn’t need to catch hardware errors.
To add support for more platforms, developers need an RTRM that will run on their RC’s operating system (there are POSIX and MCOE versions at present). They will also need a new hardware DLSO for ther API that communicates with the FPGA-based element of the platform. A library of kernel bitstreams must also be produced for that specific platform.
The authors acknowledge two limitations of Vforce. One is that the overhead of Vforce adds to the computation time. However this overhead is neglible in comparison to a bitstream load, and Vforce can in many cases avoid unnecessary bitstream loads. The second limitation is the reliance on a pre-built library of hardware implementations and processing objects.
In the paper are two example applications. One is a 1D FFT implemented on a Cray XD1. The other is an adaptive beamforming application implemented on a Mercury 6U VME system.
The FFT implementation was limited by data transfer times and communication latencies, and did not outperform the software (for both the native Cray API and Vforce). The results indicated that Vforce adds very little overhead compared to the native API, while providing ease of programming.
The beamforming application, where computation was partitioned between the host microprocessor and the FPGA compute node showed speedup of between 2× and 200×, depending on the configurations of the application. Hardware processing time was again dominated by data transfer.
The authors stated they were implementing the same beamforming application on the Cray XD1 to illustrate portability. The authors also intend to tackle a wider range of RC platforms and applications, and to port existing VSIPL++ applications to Vforce.



