Exploring Field-Programmable Gate Arrays and High-Level Synthesis to perform numerical linear algebra operations for High Performance Computing.

Since the end of Dennard scaling, power consumption has become a primary constraint in computing hardware design because it no longer scales proportionally with transistor size. The push for increased performance has led manufacturers to develop multi-core devices, revolutionizing the field over the...

Full description

Saved in:
Bibliographic Details
Main Author: Favaro, Federico (author)
Format: doctoralThesis
Language:English
Published: 2025
Subjects:
Online Access:https://hdl.handle.net/20.500.12008/52959
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Since the end of Dennard scaling, power consumption has become a primary constraint in computing hardware design because it no longer scales proportionally with transistor size. The push for increased performance has led manufacturers to develop multi-core devices, revolutionizing the field over the past few decades. Recently, the High Performance Computing field has focused on energy-efficient heterogeneous platforms, with Field-Programmable Gate Arrays (FPGAs) gaining attention due to their superior energy efficiency. Traditionally, FPGAs were excluded from mainstream software development due to the complex design process, but this is changing with the introduction of High-Level Synthesis (HLS) tools that generate hardware descriptions from standard programming languages like C++. Numerical Linear Algebra (NLA) traverses several disciplines of science and engineering and is crucial in scientific computing. Solving problems in these areas often involves basic NLA operations, which are typically the most computationally expensive steps. This thesis focuses on developing and evaluating NLA kernels using HLS tools on modern FPGA-based platforms. To support this, we set up a laboratory for measuring the performance and energy consumption of heterogeneous platforms. This step involves the acquisition and installation of FPGA-based acceleration platforms, along with the necessary development tools and infrastructure. The thesis evaluates the performance of FPGAs for both dense and sparse NLA operations using floating-point arithmetic, comparing the results against multi-core CPUs. The study begins with a review of the state-of-the-art in HLS implementations on FPGA. It then explores developments in two of the most prominent dense linear algebra kernels: matrix-vector multiplication (GEMV) and matrix-matrix multiplication (GEMM). Additionally, the thesis focuses on one of the key kernels in sparse linear algebra, sparse matrix-vector multiplication (SpMV). The findings indicate that while CPUs often outperform FPGAs in dense operations, FPGAs offer competitive energy efficiency. For sparse operations, FPGAs not only deliver competitive performance but also consistently demonstrate superior energy efficiency. A key aspect of maximizing kernel efficiency is exploring analytical tools to understand performance and guide the optimization process. The roofline model (RLM) is a well-known graphical tool that facilitates the analysis of computational performance and identifies primary bottlenecks. We propose a novel extension of the RLM for FPGAs to assist in the optimization of runtime and energy consumption for NLA kernels based on sparse blocked storage formats. We apply this extended model to our SpMV kernel with a block-sparse storage format. The final effort in this thesis focuses on the analysis and extension of AMD’s Vitis library for the SpMV operation, recognized as state-of-the-art in the field. Several optimizations and matrix preprocessing techniques were explored, leading to significant performance gains. Among these extensions, techniques were introduced that combine reducing block sizes (to minimize padding) with the concurrent processing of multiple blocks (to compensate for the reduced data processed per block). For single precision, the optimizations achieved an average speedup of 1.5× and a maximum of 3.7×. Furthermore, a custom SpMV kernel optimized for band matrices was developed, delivering substantial performance improvements over the Vitis kernel for these kind of matrices.