A new ecology of computing power, with insights into the opportunities and challenges of heterogeneous computing - Weekly Sharing

Summary : While computing power is becoming increasingly important, its development faces a contradiction between supply and demand. The industry has put forward various ideas and methods to solve the computing power bottleneck problems, among which heterogeneous computing gradually stands out and is highly anticipated by enterprises and the industry.

Computing power is fuelling economic growth and becoming a new engine for developing the digital economy. In April this year, the "2021-2022 Global Computing Power Index Assessment Report", jointly launched by IDC, Tide Information, and Tsinghua University's Global Industry Research Institute, showed that for every 1-point increase in the Computing Power Index, the digital economy and GDP would grow by 3.5 ‰ and 1.8 ‰ respectively. The White Paper on China's Arithmetic Development Index, published by the China Academy of Information and Communication Technology, shows that every 1 RMB invested in arithmetic will drive 3-4 RMB in economic output. For every 1-point increase in the Arithmetic Development Index, GDP grows by approximately 129.3 billion RMB.

While computing power is becoming increasingly important, its development faces a contradiction between supply and demand. On the one hand, the demand for computing power is growing rapidly. The digital transformation of enterprises and the increasing scale of consumption of smart terminals and mobile data traffic continue to unleash demand for computing power. On the other hand, traditional single computing architectures face performance and power bottlenecks and cannot meet the rising demand for computing power. In short, computing power has hit a bottleneck, which has become a problem in front of enterprises and industries.

I. Heterogeneous Computing stands out

The industry has put forward various ideas and methods to solve the computing power bottleneck problems, among which heterogeneous Computing gradually stands out and is highly anticipated by enterprises and the industry.

Heterogeneous Computing refers to the computing method of systems composed of different types of instruction sets and architectures of computing units. It is widely used in cloud data centers and edge computing scenarios.

The rise of Heterogeneous Computing is superficially related to power bottlenecks, but in a deeper sense, it is closely related to workloads. Although general-purpose CPUs are widely used, after more than 30 years of development, the traditional approach of increasing computing power by increasing the CPU clock frequency and the number of cores has encountered heat and energy bottlenecks.

Moreover, since 2020, applications such as telecommuting, online learning, and home entertainment have further developed due to the epidemic's impact, stimulating diverse demand for technologies such as big data, cloud computing, and artificial intelligence and accelerating digital transformation across all industries. A range of application scenarios such as high-performance Computing, cloud computing and virtualization, and big data analysis can bring very complex workloads, which require strong computing power to support them.

Workloads drive heterogeneous Computing, said Gao Ming, Technical Director of Internet Industry, Industry Solutions Division, Intel. In today's increasingly large data volume, people need to use multiple heterogeneous computing units to accelerate data processing to get higher throughput, lower latency, and pay a lower cost.

Compared to traditional single computing architectures, heterogeneous Computing not only improves arithmetic power and performance and reduces power consumption and cost but also can handle multiple types of tasks and has great potential for development. Specifically, heterogeneous Computing can give full play to the flexibility of CPU/GPU for general-purpose Computing, respond to data processing needs on time, and combine it with special capabilities such as FPGA/ASIC to give full play to the performance of co-processors and allocate computing resources according to specific needs. And, as neural network algorithms and their corresponding computing architectures are increasing, the use of constantly updated ASIC architectures that eventually sink to users and enterprises will lead to high usage and replacement costs. In contrast, heterogeneous Computing is cheaper and has greater advantages in industrial implementation.

Combined with the advantages of heterogeneous Computing, Gao Ming summarized six application scenarios: the first category, HPC high-performance computing scenarios, including automotive and aerospace modeling and simulation, electronic automation design and verification, life sciences, etc. The second category is artificial intelligence scenarios. Whether deep learning training or deep learning inference, many matrix operations are required, especially for large-scale Internet application scenarios, such as recommendations and advertising. The third category is the IoT, and edge computing scenario, where online inference tasks require large amounts of edge and cloud computing power for acceleration as massive amounts of data have to be processed at the edge or in the cloud and the fourth category is the 5G and communication scenario. Although some network functions run on CPUs as software NFV, some algorithms still require heterogeneous accelerators (e.g., FPGAs or ASICs) for acceleration. The fifth category is the multimedia processing and cloud gaming scenario. In HD video transcoding, video image rendering, and image super-resolution scenarios, heterogeneous computing power is indispensable for obtaining high throughput and low latency. The last category is cloud computing, where cloud computing companies are gradually deploying more heterogeneous accelerators in the cloud to accelerate Computing, network, and storage to enable cloud computing platforms to provide higher performance, lower cost, and meet the need for infrastructure management.

CPU+GPU, CPU+FPGA, and SVMS architectures have emerged to advance the implementation of heterogeneous Computing, making full use of the computing power of both GPUs and CPUs to improve compute processing performance and reduce processing power consumption effectively. The SVMS architecture was introduced by Intel, which in 2018 presented its XPU vision: using multiple computing architectures to meet complex computing needs fully. Specifically, the SVMS architecture comprises Scalar, Vector, Matrix, and Spatial, allowing multiple heterogeneous processor combinations to achieve high performance for multiple loads.

II. Boosting performance, reducing costs, and increasing efficiency, how Kwai implements heterogeneous computing

Whether CPU+GPU or CPU+FPGA, the value of heterogeneous computing can only be realized when it is implemented in real business scenarios. As a short video app with over 300 million daily activities, Kwai's heterogeneous computing practice is quite representative.

It is understood that Kwai's recommendation system faces huge performance challenges in a large-scale complex business. As a short-video content platform, content production, understanding, distribution, content consumption, and user interaction constitute a large-scale complex business, creating more diverse demands on computing power. To break the arithmetic bottleneck problem, Kwai has launched LaoFe NDP (Latency oriented FPGA engine for Near Data Processing) architecture that enables heterogeneous computing, accelerating the computation of different scenarios and getting the optimal performance execution on Intel hardware.

Please take the recommendation business scenario as an example, it needs to recommend the content of interest to users based on their profiles. The first step is to select the results related to the user's characteristics from a large amount of information and then prioritize the content by "sorting." How do you ensure that the task is completed efficiently and accurately in this process? The parameter server is crucial, as it is responsible for storing and processing many data features and the parameters of the sorting model.

Kwai's recommendation system adopts an architecture model that separates computation and storage. To cope with the massive data impact, Kwai's recommendation system adopts an architecture model that separates computation and storage. The parameter server is a storage service that stores and updates hundreds of millions of user profiles, billions of short video features, and hundreds of billions of ranking model parameters in real-time. It is limited by capacity and bandwidth and has to support hundreds of millions of KV requests per second, which consumes a lot of CPU resources and is a major bottleneck in its performance.

The best solution to this problem is to use heterogeneous computing, using different computing devices to handle different loads. By using Intel® Xeon® Scalable Processors, Intel® Agilex™ FPGAs, and Intel® Agile™ persistent memory, the LaoFe NDP near data architecture of Racer is innovative in its computing architecture with integrated hardware and software, domain-specific accelerator design that enables triple acceleration of network, storage and computes, providing low latency, high concurrency, high throughput and low total cost of ownership for each business system.

Kwai LaoFe NDP Heterogeneous Computing Architecture

In my opinion, the triple acceleration of network, storage, and computing truly demonstrates the value that heterogeneous computing brings.

At the network level, the LaoFe NDP architecture offloads network data operations from the CPU to the FPGA, and request packets sent by the Client are sent directly to the FPGA. The GRPC is based on TCP/IP (network protocol stack), which is too complex to guarantee performance and latency solutions. Using an FPGA-based implementation of the SD-RDMA protocol, the application layer adds fields to ensure a reliable transfer similar to gRPC and significantly reduces request latency.

At the storage level, the LaoFe NDP architecture also offloads CPU storage operations to the FPGA. To maximize the capabilities of the FPGA, Racer has customized a KV (Key-Value) engine for easy FPGA access based on a generic KV storage scenario. It also supports SSD/Intel® Orthon™ persistent/DRAM memory and a hash-based Key-Value storage engine to accelerate storage performance. The throughput of the KV lookup table is proven to be more than 5 times higher than CPU solutions.

On the computational side, LaoFe NDP's computational acceleration relies on FPGAs as domain-specific processing to process data more efficiently in parallel, providing a more efficient memory hierarchy and customized execution units to support scenarios such as machine learning, deep learning, and big data. Intel® FPGAs offer resilient programmable hardware capabilities, low and precisely controllable latency, low power consumption per unit of arithmetic, and large on-chip memory, making them suitable for Racer applications with high latency requirements, relatively small batch sizes, and high concurrency and repetitiveness.

With Intel's hardware and software optimization, the Kwai LaoFe NDP architecture achieves the following: First, The system throughput is significantly improved, and the latency is significantly reduced. The throughput performance of the parameter server is improved by 5-6 times, and the overall request latency is reduced by 70%-80%, providing a better user experience. Second, better control of TCO, the powerful performance of FPGA provides far more throughput than the traditional scheme, only a small number of servers can be deployed to meet the performance requirements of the characteristics, substitution ratio can reach 1:5, effectively reducing TCO. Third, Reduce performance jitter. Software solutions based on CPU often have performance jitter due to the need for high-frequency updates.

III. The dilemma of heterogeneous computing

The heterogeneous computing practice of Kwai shows that heterogeneous computing has great potential and scope for future development. However, before enterprises can adopt heterogeneous computing, they also need to recognize the technical challenges associated with heterogeneous computing.

One is that heterogeneous computing products need to face different system architectures, instruction sets, and programming models and need to reduce the difficulty that diverse computing brings to software developers.
Second, in addition to breakthroughs in chip design, heterogeneous computing chip products also need to address the adaptation and upgrade issues between different chip manufacturing and packaging architectures.
Thirdly, heterogeneous computing should achieve unity in performance diversity to meet various needs such as artificial intelligence training, inference, and image and video processing.

In particular, the hardware complexity associated with heterogeneous computing poses a demanding challenge to programmers. Performance and compatibility between different development frameworks, as well as learning costs, have been one of the main factors affecting development efficiency. Complex development environments and frameworks that cannot be updated simultaneously lead to a lot of effort for developers to solve problems independently. All of this relies on the building of an ecosystem. Developing and promoting standards and support for languages, compilers, frameworks, runtime libraries, etc., are no easy tasks.

It's not easy, but vendors have taken action and introduced various solutions, among which Intel's oneAPI is worth mentioning. As a unified software programming architecture, oneAPI supports a wide range of heterogeneous computing units, including not only Intel hardware but also hardware from other vendors. At the same time, it offers an open, unified programming language, DPC++. And oneAPI also offers high-performance API-based libraries that can run on a wide range of heterogeneous platforms and deliver extremely high performance, many of which will be open-sourced, offering the possibility of further extensions to add new functionality.

Today, oneAPI is being adopted by many independent software providers, operating system vendors, end users, and academia. The cross-architectural compatibility it offers has greatly increased developer productivity and innovation.

IV. At the end

Previously, it was computation-centric, with command and control flows driving computation; in the future, it will be data-centric, with data flows driving computation. In the data-centric era, CPUs, GPUs, FPGAs, etc., are no longer as capable as they used to be, and traditional general-purpose architectures are far from adequate to meet the day's needs. Only a combination of architectures can cope with the demands of workloads that handle massive, data-intensive volumes.

Recently, the industry has been moving towards a new ecosystem of heterogeneous-based technologies, with heterogeneous computing becoming the new global competition. Throughout the industry, mainstream chip vendors are making great efforts to build a complete ecosystem for heterogeneous computing.

Heterogeneous computing will be increasingly finely divided into workloads with different characteristics and requirements and then gradually unified and standardized. In the future, heterogeneous computing will be designed according to different scenarios, data types, processing latency, and bandwidth requirements. Under this new trend, there will be more types of "PUs" than CPUs and GPUs. Intel's XPU strategy is increasingly advantageous in this context and trend. Its ever-improving product line spans CPUs, GPUs, FPGAs, and IPUs. It upholds the "software-first" philosophy, providing a unified and scalable programming model for heterogeneous computing through oneAPI, and advancing software and hardware. In addition, under the new IDM 2.0 strategy, Intel is accelerating the iterative evolution in architecture and process and working with partners to better cope with the massive and changing heterogeneous computing needs in the future.