Skip to main content

Login

Don’t have an account yet? Register one!

Registration or login is required to send inquiries

Only registered users can send inquiries. Please register or login to continue.

FPGAs for Accelerating HPC Engineering Workloads – the Why and the How

Type

Whitepaper

Description

Running high-performance workloads on Field Programmable Gate Arrays (FPGAs) has been explored but has yet to demonstrate widespread success. Software developers have traditionally felt a significant disconnect from the knowledge required to effectively exploit FPGAs, which included esoteric programming technologies, long build times, and a lack of familiar software tooling. Furthermore, for the few developers that invested time and effort into FPGAs, from a performance perspective the hardware historically struggled to compete against the latest generation CPUs and GPUs when it came to Floating Point Operations per Second (FLOPS).

In addition to significant developments in FPGA hardware over the past several years, there have also been large improvements in the software eco-system. High-Level Synthesis (HLS) is a key aspect, not only enabling developers to write code in C or C++, and for this to be synthesised down to the underlying Hardware Description Language (HDL), but also allowing programmers to reason about their code on the FPGA at the algorithmic level. Moreover, AMD Xilinx and Intel, who are the two major FPGA vendors, have built a software environment around HLS, Vitis and Quartus Prime respectively. This not only automates lower-level aspects but also provides some profiling and debugger support. Whilst significant advances, such developments are not a silver bullet in enabling easy exploitation of the technology, not least because whilst the tooling is reliably able to generate correct code, if such programs are still based upon the CPU code then this is seldom fast.

Consequently, programmers must adopt a different style of programming when it comes to FPGAs, and an important question is whether FPGAs can follow the same trajectory in the HPC community as GPUs have over the past 15 years, from highly specialist hardware that demonstrates promise for a small number of applications, to widespread use. If such is to occur, then not only is support and leadership required from the FPGA community, but furthermore two general questions must be strongly answered, firstly why would one ever choose to accelerate their HPC application on an FPGA compared to other hardware? and secondly how can we best design our high-performance FPGA codes algorithmically so that they are fast by construction?

In this white paper, we use the lessons learnt during the EXCELLERAT CoE to explore these two questions and use five diverse HPC kernels to drive our discussion. These are:

  • MONC PW advection kernel which calculates the movement of quantities through the air due to kinetic forces. MONC is a popular atmospheric model in the UK and advection represents around 40% of the runtime. It is stencil based and was previously ported to the AlphaData ADM8K5 in[1] and [2], and an Alveo U280 in [3].
  • Nekbone AX kernel which applies the Poisson operator as part of the Conjugate Gradient (CG) iterative method. Neckbone is a mini-app which represents the principal computational structure of the highly popular Nek5000 application [4], and is also representative of many other Navier Stokes-based workloads. This has been ported to both the Xilinx Alveo [5] and Intel Stratix FPGAs [6].
  • Alya incompressible flow matrix assembly engine which accounts for around 64% of the model runtime for Alya benchmarks. Alya itself is a high-performance computational mechanics code used to solve complex coupled multi-physics, multi-scale, and multi-domain problems and is used extensively in industrial engineering simulation. This incompressible flow engine was ported to the Alveo U280 in [7].
  • Himeno benchmark which measures the performance of a linear solve of the Poisson equation using a point-Jacobi iterative method. A popular benchmark, it has been ported to Xilinx Alveo U280 [8]and Intel [9] FPGAs.
  • Credit Default Swap (CDS) engine used for calculating financial credit risk. Based on the industry standard Quantlib CPU library, we have ported this engine to the FPGA [10]. Furthermore, Xilinx have developed an open-source version in their Vitis library [11].

Web-URL

Access the whitepaper on this link. 

 

Acknowledgement 

The authors of this white paper would like to thank the ExCALIBUR H&ES FPGA testbed for access to computing resources used in this work. This work was funded under the EU EXCELLERAT CoE, grant agreement number 823691.

License

Public