Performance Optimization

Introduction

Performance matters. Although speed may not be the most important factor in some languages, it is a universal truth that nearly all languages can benefit from higher performance.

Higher performance can lead to better user experience, lower server costs, and more efficient use of resources. It can also give your application a competitive edge.

However, achieving high performance is not easy. It requires a deep understanding of the language, the runtime, and the hardware. It also requires a lot of time and effort. In this blog post, we will discuss some of the techniques that can be used to optimize the performance of a programming language and how we applied them to Pivot Lang.

Main Factors Affecting Performance

1. LLVM Optimization

LLVM is a powerful compiler infrastructure that can be used to generate highly optimized machine code. However, it can not do much if the ir code generated by the frontend is not following the best practices. It is not enough to just generate correct code, you need to add various metadata to help LLVM optimize the code. You can find more information about this in the LLVM documentation.

2. Garbage Collection

If you are using a garbage collector, it is important to make sure that it integrates well with the rest of the system. Even the best garbage collector can cause performance issues if it is not used correctly.

Garbage collectors have two basic types: conservative and precise. There's not much we can do while using third party fully conservative garbage collectors (like bdwgc), but with custom precise garbage collectors, we can make a large difference by changing the mechanism of stack scanning, adjusting collect frequency etc.

Stack Scanning

There are many different ways to scan the stack, if you are not using llvm's gc api, you can scan the stack conservatively, very much like bdwgc does, it's fast enough in most cases. Otherwise, you can choose between llvm's gc.root and gc.statepoint to perform precise stack scanning. Even with the same llvm scanning method, you can still make a difference by changing the way you crawl the stack, for example, you can use libunwind to walk the whole stack, or you can use llvm's inline assembly to get the stack pointer and walk the stack manually by the help of stackmap. Those mechanisms are discussed in detail in the LLVM and GC post.

Collect Frequency

The frequency of garbage collection can have a big impact on performance. Different GC algorithms may have different optimal collect frequencies, and it may not make sense at all for some concurrent GCs. Generally speaking, GC should take less than 10% of the total time, otherwise, you should consider optimizing it.

Allocation

The way you allocate memory can also have a big impact on performance. Memory allocation is a well-known hot path in many programs, especially in programs that allocate all objects on the heap. You should make sure your allocator is fast and efficient. A good allocator should allocate memory at least as fast as libc's malloc in single-threaded environments.

Although a good allocator can improve performance, it won't help much if you are allocating memory too frequently. You should try to minimize the number of heap allocations in your program. This can be done either by dead code elimination or by replacing some heap allocations with stack allocations.

As mentioned in the blog, you can add metadata allockind("alloc") to your custom allocate function to allow llvm to optimize away some allocations.

On the other hand, you can also use alloca to replace some heap allocations, the well-known language golang uses a simmilar mechanism to reduce the number of heap allocations. However there's no built-in pass in llvm to do this, you have to write your own pass. Fortunately, llvm does provide an analysis pass named CaptureTracking which can help you to determine whether a heap pointer can be safely replaced with alloca.

Please note that replacing heap allocations with stack allocations can be dangerous, especially if you are using a precise relocating garbage collector. Things can get complicated in this case, and you may need to add some extra metadata to help the pass to determine whether a not captured pointer can be safely replaced with alloca.

Take Pivot Lang as an example, we originally used a precise relocating garbage collector, and our GC used a visitor function on the object header to precisely mark complex objects like structs, arrays, etc. Imagine if we replaced a heap allocation of a complex object with alloca, then the visitor function is also on the stack, and we in fact cannot use it to mark the object, as we cannot distinguish the visitor function from other stack data. This is not a problem if we conservatively scan the stack if all the heap pointers in the stack are ordinary -- which means our GC can know the exact way to scan their content based only on the pointer itself. In Pivot Lang, array types are not ordinary, so we added extra metadata to help the pass to disambiguate the array type from others. So at last, our GC becomes a hybrid one, which scans the stack conservatively and scans the heap precisely. We are undertaking this approach to achieve a substantial reduction in the number of heap allocations, even if it means slightly slowing down the speed of stack scanning.