Machine learning algorithms are extremely computationally intensive and time consuming when they must be trained on large amounts of data. Typical processors are not optimized for machine learning applications and therefore offer limited performance. Therefore, both academia an industry is focused on the development of specialized architectures for the efficient acceleration of machine learning applications.
One of the most efficient ML algorithm widely-used in the last few years is XGboost. XGBoost is an open-source software library which provides a gradient boosting framework.
FPGAs are programmable chips that can be configured with tailored-made architectures optimized for specific applications. As FPGAs are optimized for specific tasks, they offer higher performance and lower energy consumption compared with general purpose CPUs or GPUs. FPGAs are widely used in applications like image processing, telecommunications, networking, automotive and machine learning applications.
Recently major cloud and HPC providers like Amazon AWS, Alibaba, Huawei and Nimbix have started deploying FPGAs in their data centers. However, currently there are limited cases of wide utilization of FPGAs in the domain of machine learning.
The FPGA accelerated solution for the XGBoost algorithm is based on the Exact (Greedy) algorithm for tree creation. It can provide up to 26x speedup compared to a single threaded execution and up to 5x compared to an 8 threaded CPU execution respectively. The acceleration is attained by exposing parallelism and reusing data in the features dimension of the dataset.
The accelerator accumulates the gradients for each feature, calculates possible splits and keeps the best split for each node. To avoid frequent accesses to the FPGA DDR RAM, we load up to 65536 entries to BRAM inside the accelerator. Also, to keep the accumulation and best calculated split of each node, we keep up to 2048 nodes to BRAM inside the accelerator. To be able to accumulate floating point values with minimal interval, we convert them to fixed point arithmetics with negligible change to the results.
The necessary software that integrates the accelerator with the XGBoost library is also provided. A new tree method is added, called fpga_exact that uses our updater and the pruner.
The IP core for XGboost leverage the processing power of the Xilinx FPGAs. The IP core is optimized for the Xilinx FPGAs like Alveo U200 and U250 cards and the FPGAs available as instances on the cloud providers (f1 on AWS and f3 on Alibaba cloud).
The release of the XGboost IP core will help demonstrate the advantages of the FPGAs in the domain of machine learning and it will offer to the data science community the chance to experiment, deploy and utilize FPGAs in order to speedup their machine learning applications.
InAccel offers all the required APIs for seamless integration with Python, Java and Scala. That means that data scientist and data engineers do not need to change their code at all. Also, thought the unique FPGA Resource Manager it allows instant scalability to multiple FPGA boards.
The IP core is available on: https://github.com/InAccel/xgboost