Introduction
In this article, we introduce a novel framework that allows the seamless integration of FPGAs under Apache Arrow development platform. The integration of FPGA with Apache Arrow-compatible frameworks allows the acceleration of data science applications without any prior experience on FPGAs.
We present a prototype in Java that allows the seamless communication of Apache Arrow-enabled frameworks with FPGAs. At first, we briefly explain the objectives of our implementation while addressing specific technical obstacles. Then, we describe our pipeline of data sharing and transferring among CPU and FPGAs.
Motivation
Over the last few years, machine learning has gained unprecedented attention by multiple industries. Initially, machine learning jobs were executed mainly on general purpose processors. However, as ML is extremely computation intensive, specialized hardware is required in order to process efficiently the huge amounts of data. Several tech giants are shifting towards specialized hardware solutions (among the recent examples is Google’s Tensor Processing Unit.)
Maintaining an FPGA data center on-premise constitutes a tedious process which stymied their wide adoption until major cloud providers decided to incorporate them in their services. Hyperscale public cloud providers like AWS, Alibaba cloud, Baidu and Huawei that have recently deployed FPGAs in their data centers. This recent change resulted in frameworks being unprepared to account for interoperability with FPGAs. Our vision is to bridge such frameworks with the world of FPGAs to enable further machine learning innovation.
What is Apache Arrow?
Recently, Apache Arrow has gained significant attention in the domain of in-memory data frameworks.
Apache Arrow specifies a standardized language-independent columnar memory format for flat and hierarchical data, organized for efficient analytic operations on modern hardware. The Arrow memory format supports zero-copy reads for efficient data-access without serialization overhead. Apache Software
In simple terms, arrow defines an efficient way to store data in-memory which enables significant optimizations while processing those data (I will not go into details regarding the benefits of columnar storage but if interested you could check this detailed article). Additionally, its strictly defined memory format enables inter-operability between different systems and state-of-the-art frameworks widely used by the data science community.
Why Apache Arrow + FPGAs?
Currently, Apache Arrow has primarily focused on CPUs and GPUs. Our goal is to make a step towards enabling FPGAs to utilize and access vectorized data, considering that FPGAs and SIMD processing arrays share many architectural features. FPGAs (or Xilinx’s Adaptive Compute Acceleration Platforms -ACAP) offer the advantage of tailored-made architectures for specific applications providing much better performance and higher performance/watt.
InAccel, a leader on FPGA acceleration-as-a-service, was the first company that offered acceleration of Apache Spark ML on AWS using Xilinx’s f1 FPGA instances. Through AWS it offers more than 15x speedup for Machine learning applications. Following our recent integration with Apache Spark, we introduce our prototype of integration with Apache Arrow in Java.
Problem Description
Let’s assume that we want to train a machine learning model using an FPGA cluster. We know that FPGAs perform better than general-purpose hardware if they are assigned a well-defined task to execute. However, data scientists are processing their data in-memory and when they schedule their computational task the CPU simply has to execute it while reading the data from its native memory. On the other hand, an FPGA kernel has to bring the data in its memory bank to proceed with the execution.
In a best case scenario, a DMA operation between the FPGA and memory where data reside will suffice for the FPGA execution to take place. But, that is not always the case. We should make certain to minimize the copy overheads that may incur during the process while being able to access arrow-backed data frames from FPGAs in an efficient manner.
System Design Decisions
The core of our acceleration pipeline relies heavily on InAccel’s Coral FPGA manager, who is responsible to handle acceleration requests and schedule them to the optimal hardware resources available. FPGA manager communicates to FPGAs which task they need to execute and where the required data reside. Particularly, the request is serialized and later transmitted via TCP sockets.
Ideally, we would like to communicate data addresses between user’s process, InAccel’s FPGA-manager and the actual FPGAs. Shared memory satiated our desire to achieve zero-copy data communication and was achieved by memory mapping such addresses to Linux famous directory “/dev/shm”. That way all of the above components have access to the same memory area where data reside.
Therefore, to enable arrow-backed data to be communicated to the FPGAs, we need to find the addresses where each arrow column resides. That way, we can memory map this address to a uniquely identified file in /dev/shm and inform our manager respectively.
We tweaked Apache Arrow’s implementation to perform differently when InAccel metadata are passed. In such a case, the allocation of a column will memory-map the starting memory address to a shared-memory file which will be subsequently accessed by the FPGAs.
Page-alignment Obstacle
In order to force memory mapping of a file to a specific memory address, Linux requires that the supplied address is page-aligned. Arrow’s implementation did not support such option, which led us to pad memory space until we had page-aligned data buffers. That way we ensured that all data buffers with InAccel metadata were page-aligned. Apparently, non-InAccel arrow columns are not affected in any way.
You can find a use case on Logistic Regression using InAccel’s FPGA Manager communicating with Apache Arrow here.
Future directions
The majority of Arrow use cases pertain to submitting PySpark jobs to forego overheads between user’s Python environment and the Spark executor operating on JVM. Therefore, we are currently working on extending our integration to encompass the C++ implementation as well as trending data science conversions (e.g from pandas to arrow).Luckily, the memory allocation implemented in C++ enables page-alignment of data which will significantly facilitate our integration.
Finally, in the next part of our series we will demonstrate a use case of submitting spark jobs through PySpark with Arrow-backed pandas Dataframe and FPGAs handling the workload under the hood.