ByteDance's Path to Golang Development - Weekly Sharing

Summary : This article is compiled from a talk by Ma Chunhui, a senior engineer at ByteDance, at DIVE Global Infrastructure Software Innovation Conference 2022, titled "ByteDance's Path to Golang Development".

At DIVE Global Infrastructure Software Innovation Conference 2022, Li Sanhong, the head of the AliCloud programming language and compiler team, presented a special session on "New Trends in Programming Languages at DIVE". This article is compiled from a talk by Ma Chunhui, a senior engineer at ByteDance, at DIVE Global Infrastructure Software Innovation Conference 2022, titled ByteDance's Path to Golang Development.

Image Source: Medium

The Status of Golang

Golang (Go language) has rapidly become a very popular language in programming in just over a decade since it joined open source in 2009. According to JetBrains, in 2019, there were more than two million developers worldwide, and it continues to grow. Many foreign companies, such as Google and Uber, and domestic companies like ByteDance and Tencent, are using Go on a relatively large scale, so much so that many people refer to Go as the best language for cloud-native.

In ByteDance, the most used language for microservices is Golang, but ByteDance didn't use Golang from the beginning, the earliest language was Python, in 2014, we went through a large-scale microservices development process and switched from Python to Golang. It is said that one of the most important reasons was the low utilization of CPU resources in Python at that time. The students who were responsible for language selection finally chose Golang after investigation. Now it seems that this student is very forward-looking.

Image Source: Geeks for Geeks

Why didn't they choose Java at that time? Java has many advantages and has been a dominant force until now. But from a microservices perspective, it has inherent disadvantages, such as resource overhead. Lower resource overhead means higher deployment density and lower computational costs. On the other hand, Java spends more resources on JIT compilation during runtime. In addition, the JVM itself takes up about 50 or 60 megabytes of memory. In microservices, memory cannot be oversold and oversold, so relatively speaking, the JVM takes up more memory. In addition, the JVM has to occupy about one or two hundred megabytes of the disk, which affects the distribution and deployment speed for microservices with distributed architecture. Furthermore, the startup speed of Java has been criticized. For microservices that require fast iterations and rollbacks, slow startup speed will affect delivery efficiency and fast rollbacks and may also make users feel the access delay. Of course, Java has also been optimized. CDS, for example, and static compilation have been on the rise for a few years now. But unfortunately, when ByteDance made the language selection, these items did not exist. Another reason why Java was not chosen was that the students in charge of language selection at that time did not really like Java.

Advantages of Golang

Professor Cui Huimin of the Chinese Academy of Sciences said, "There have always been two goals in designing programming languages, one is to make programming easier, and the other is to take full advantage of hardware qualities to perform at a higher level when new hardware architectures emerge."

Image Source: Miquido

Golang is one of those languages that make programming easier, and it strikes a good balance between development efficiency and performance.

Golang has many advantages. First, it supports high concurrency from the language level. It comes with a Goroutine, or concurrency, which makes it easier for programmers to use concurrency by taking full advantage of the performance of multiple cores. Second, it is very easy to learn and efficient to develop, with only 25 keywords in Go, compared to about 40 in C11. Although Go has fewer keywords, it is expressive and supports some of the more useful features in most other languages. It compiles very fast as well.

The Problems with Golang

Image Source: Medium

Golang is an open-source language, and the core members of the Go team have publicly stated that Go is completely open-source and actively embraces the community. Still, there is a persistent narrative within the community: "Go is Google's Go, not the Go of the community". One of the more typical stories is the history of Go's module development or its ascendancy. In general, Go's development has been firmly controlled by Google's Go Team core team, and outside voices, community voices, seem less important to the development of the Go language, i.e., it is difficult for outsiders to take the lead in designing a complete feature. Of course, we are actively working on the community thing and hope to get good feedback from the community.

Another problem is that as the microservices become larger, the larger the individual microservices grow, and the number of containers for deploying microservices also increases. After reaching a certain level, many performance problems will be encountered, which we will focus on later.

In addition, there is another problem: after the number of microservices increases, some observation problems will be faced.

The Performance Issue

As mentioned earlier, as the size of a single microservice itself increases and the number of machines deploying microservices grows, we encounter more and more performance problems. These performance problems can be divided into two aspects: GC, a memory management problem, and the quality of compiled code. In addition, we also have colleagues working on some optimization analysis in the scheduling area.

Image Source: Digiday

1. Performance issues of GC

First, let's talk about GC, or memory management.

Memory management includes both memory allocation and garbage collection, and for Go, GC is a concurrent-mark-clear (CMS) algorithmic collector. However, it is important to note that Go's implementation of GC focuses too much on pause time or Stop the World (STW) time at the expense of other features in GC.

As we know, many aspects of GC need to be looked at, such as throughput - GC will definitely slow down the program, so how much does it affect throughput? How much garbage can be recovered in a fixed amount of CPU time? The timing and frequency of Stop the World, the speed of allocating newly requested memory, and the speed at which memory is allocated. There is also the speed of allocating newly requested memory, the waste of space when allocating memory, whether GC can fully use multiple cores on a multi-core machine, and many other aspects.

Unfortunately, Golang was designed and implemented with an overemphasis on limited pause time. This has other implications: for example, the heap cannot be compressed during execution, which means that objects cannot be moved, and it is also a non-disaggregated GC, so the performance is reflected in memory allocation and GC usually take up more CPU resources.

Some of our colleagues have conducted some statistics and found that many microservices can take up more than 30% of CPU resources in memory allocation and GC time during the late peak period. There are two reasons for such high resource consumption: Go performs memory allocation operations more frequently. The other is that Go is relatively heavy in implementation when allocating heap memory, which consumes more CPU resources. For example, it has acquired M and GC locks in the middle, its code path is also longer, the number of instructions is also higher, and the localization of memory allocation is not well. Therefore, one of the first things our students did was to reduce memory management, especially the overhead of memory allocation, and thus the GC overhead.

After researching, many students found that when many microservices perform memory allocation, most of the objects allocated are relatively small objects. Based on this observation, we designed the GAB (Goroutine allocation buffer) mechanism to optimize memory allocation for small objects. tcmalloc is used for memory allocation in Go. Our Gab preallocates a larger buffer for each Goroutine and then uses a bump-pointer to allocate smaller objects that fit into the Gab quickly. Our algorithm is fully compatible with the tcmalloc algorithm, and its allocation operation can be interrupted by Stop the World at will. Although our Gab optimization may cause some space wastage, it has been tested on many microservices and found a CPU performance savings of about 5% to 12%.

2. Generation of code for performance issues

Another issue is the quality of the code generated by Golang. Go's the compiler is a relatively rudimentary implementation compared to traditional compilers, with fewer optimizations. Go has more than 40 Passes in the compilation phase, while LLVM has over 200 optimized Passes in O2 for comparison. When Go is compiled and optimized, the optimized algorithm implementation mostly chooses algorithms that are not very accurate but are faster. In other words, Go focuses heavily on compilation time, resulting in inefficient code generation.

We can be less concerned about compilation speed for some of our microservices scenarios. Many of our microservices are compiled once and deployed to run on tens or even hundreds of thousands of cores, and they usually run for a long time. In this case, increasing the compilation time a little is acceptable and saving CPU resources is acceptable.

We have optimized on top of the Golang compiler at the cost of compilation speed and binary size. Of course, we still control compilation speed and binary size costs. For example, our binary size usually grows between 5% to 15%, and the compilation speed does not decrease much, around 50% to 100%.

We currently have about five optimizations in the compiler, and I'll pick two or three to highlight.

The first optimization is inline. The inline optimization is the basis for the other optimizations, and it works by replacing the definition of a function with the location of the call at compile time. Function calls themselves have overhead. Before Go 1.17, Go passed references on the stack, and there was overhead in stacking functions in and out of the stack. Making a function call was performing a transformation, which might also have the overhead of missing the instruction cache.

Golang's native inlining optimization is more limited. For example, some language features prevent inlining. For example, if a function contains a defer inside, inlining the function to the place where it is called may cause the timing of the execution of the defer function to be inconsistent with the original semantics. So there is no way for Go to do inlining in this case. In addition, if a function is a function call of type interface, that function will not be inlined either.

In addition, Go's compiler has only supported the inlining of non-leaf nodes since 1.9. Although the inlining of non-leaf nodes is turned on by default, the policy is very conservative. For example, if two function calls exist in a non-leaf node function, that function will not be inlined during inline evaluation. Also, the inlining strategy is done conservatively from an implementation point of view. We have modified the inlining policy in the go compiler for bytes to allow more functions to be inlined. This most immediate benefit is that it significantly reduces function call overhead. Although the overhead of a single function call may not be particularly large, it adds up to a lot.

More importantly, inlining increases the opportunities for other optimizations, such as escape analysis, common subexpression removal, etc. Since most compiler optimizations are local optimizations within functions, inlining is equivalent to expanding the scope of analysis of these optimizations, which can make the later analysis and optimization more effective.

Of course, although inlining is good, it cannot be unlimited because inlining also has an overhead. For example, after inline optimization, the binary size increased by 5% to 10%, and the compilation time also increased. At the same time, it has another more important runtime overhead. The increase in inlining leads to an increase in stack length, leading to a significant increase in runtime stack expansion overhead. In order to reduce the stack expansion overhead, we have also targeted Golang's initial stack size.

To reduce memory requirements, the stack of each goroutine cannot be as large as the stacks of threads in other languages, which can be as large as two to eight megabytes. Otherwise, it is easy to OOM. Go will check the remaining space on the current stack at the beginning of the function to see if it meets the needs of the current function so that it will insert a stack check instruction at the beginning. If it does not, it will trigger a stack expansion operation: first, request a piece of memory, copy the current stack, and then traverse the stack and modify the stack frame by frame to avoid a pointer pointing to an old stack. Avoid the situation where the pointer points to the old stack. This overhead is significant, and the adjustment of the inline strategy will aggravate this phenomenon by allocating more data to the stack, so we adjusted the starting stack size of GO.

One of our biggest gains should be inline policy tuning. We also made other optimizations, such as the Gab optimization mentioned earlier, where we would generate Gab's fast allocation path directly into the compiler code at compile time, which would speed up memory allocation for objects allocated to Gab.

Since Go's memory allocation optimization overhead is still relatively high, one of our optimization priorities is to find ways to reduce the allocation on the heap. The process of allocating objects to the heap or the stack in Golang is controlled by escape analysis, so we have also made some escape analysis optimizations.

As you can see, most of the optimizations we have implemented on the compiler so far are generic optimizations. Theoretically, all microservices may enjoy the benefits of these optimizations, as evidenced by the microservices, we have online so far.

Let's look at the benefits of these optimizations. You can see that on Microbenchmark, the Go1 benchmark that comes with Go, there is a performance improvement of close to 20% for many and a little over 10% for others.

We found that basically all microservices on the line save some CPU resources to a greater or lesser extent. In addition, latency is reduced to varying degrees, and memory usage is reduced to varying degrees. For some of the microservices we've put online, we've saved about a hundred thousand cores for peak CPU.

Because we've been able to get such significant optimization results with a few simple optimizations on the compiler, we have two strategies we're currently pursuing. The first is to continue introducing more compiler optimizations into the Go native compiler, hoping to improve Go's native compiler performance further. The other is to consider leveraging LLVM's powerful optimization capabilities to compile Go's source code into LLVM IR and generate executable code for performance optimization.

There is already such a project in the community, Gollvm, which is available but does not support many important features. For example, it does not support assembly language. If a microservice or a referenced third-party library contains Plan9 assembly, Gollvm does not support it now. In addition, its GC does not support accurate stack scanning at the moment, and it uses a conservative stack scanning strategy. In addition, Gollvm's performance is still quite different from the GO native compiler. But we are investigating and studying it now.

3. Performance observation issues

In addition, during the Go's launch, we also found a rather obvious problem with performance observation, specifically, inaccurate measurement.

The pprof tool that comes with it is not very accurate. There was some discussion within the Go community about how Go's pprof tool uses itimer to generate signals that trigger pprof sampling. Still, these signals may not be as accurate on Linux, especially some versions of Linux. According to our pprof results, about 20% or even 50% of the results are thrown away in some containers. It also has the problem that signals triggered on one thread may be sampled on another M, and this adoption signal triggered on one M may pick up data on another M.

As for perf, unfortunately, many of our online containers don't support perf internally, and for some security policy reasons, we don't allow the installation perf tools online.

As you may have heard, Uber has developed a pprof++ tool on Go, similar to pprof, which also calls some of pprof's interfaces and uses the hardware PMU to trigger sampling. But one of the problems with Uber's pprof++ is its very high-performance penalty. After some verification, we found that on some small examples, there was a performance loss of about 3% after patching Uber's pprof++, just by patching it instead of opening the pprof. We did compiler optimizations and memory allocation optimizations in front of us, and the performance improvement was only 5% to 10%, but just by putting the patch on, the performance loss was 3%. So we can't accept this performance loss.

Fortunately, after Go1.18, it came up with the pprof of per-M, which samples each M, and the results are relatively accurate.

In Go1.6 and 1.17, we followed Uber's example and adopted the PMU pprof form. This approach has been verified to have a relatively low-performance loss, and we have managed to avoid the performance loss problem of Uber. In addition, it can provide some more powerful CPU sampling analysis capabilities such as branchmiss/icache.

Summary

Image Source: Free Code Camp

Golang is relatively promising and is growing rapidly.

In the long run, software systems based on more advanced programming languages will gradually gain a competitive advantage. As the price of hardware resources such as CPUs falls further, development costs, development labour costs, project development risks, and system stability and security may become more important decisions and considerations. At present, Go has very good development efficiency and performance comparable to C and provides various useful features for server-side development in the Internet environment. Many people refer to Go as the best language for cloud-native development.

Of course, there may be a new language with better efficiency, higher performance, and a more open design and development environment that will eventually replace Go. I'm very hopeful that such a new language will emerge.

Need more help? Check out the Zentao blog.

Author bio :

Ma Chunhui is currently engaged in software development support work related to programming languages in ByteDance. He has more than ten years of experience in compiler virtual machine-related software development and has participated in the development of the HP ACC compiler, IBMJDK, Huawei Bison compiler / virtual machine and other projects, and has rich experience in the compiler, runtime, performance optimization and other fields.