Start From Scratch To Talk About LLVM

What is LLVM? The first time I think about this question is when I find myself lacking knowledge of O-LLVM (Obfuscator-LLVM). I need to reuse O-LLVM project to develop a tool to obfuscate Android Native C/C++ Code and study the technology of Virtual Machine Protection. In fact, O-LLVM is developed from LLVM project. It uses the infrastructure of LLVM to obfuscate the Intermediate Representation before sending it to the back end which generate Machine Code to the target.

Let's See the LLVM definition in Wikipedia:

The LLVM compiler infrastructure project is a "collection of modular and reusable compiler and toolchain technologies" used to develop compiler front ends and back ends.

1 History of LLVM

Pic. Chrise Lattner picture

Chris Lattner went to the University of Illinois at Urbana–Champaign (UIUC) in 2000 to read master. He was a super scholar, getting a full GPA of 4, travelling around many places of interest of America, reading the book of Compilers: Principles, Techniques, and Tools again and again. The LLVM project was started at that time under the direction of Vikram Adve and himself. To his study life, his core research field was the compiler and his master's thesis LLVM: an infrastructure for multi-stage optimisation proposing a kind virtual instruction with low-level representation but with high-level type information. He called it the Low Level Virtual Machine and this is the prototype of LLVM Project (LLVM is short for Low Level Virtual Machine).

He was acutely aware of several problems that compilers/interpreters had at that time. He determined to conquer all of these.

1.1 Problems of existing compilers

From its beginning in December 2000, LLVM was designed as a set of reusable libraries with well-defined interfaces [LA04]. At the time, open source programming language implementations were designed as special-purpose tools which usually had monolithic executables. For example, it was very difficult to reuse the parser from a static compiler (e.g., GCC) for doing static analysis or refactoring. While scripting languages often provided a way to embed their runtime and interpreter into larger applications, this runtime was a single monolithic lump of code that was included or excluded. There was no way to reuse pieces, and very little sharing across language implementation projects.

Beyond the composition of the compiler itself, the communities surrounding popular language implementations were usually strongly polarized: an implementation usually provided either a traditional static compiler like GCC, Free Pascal, and FreeBASIC, or it provided a runtime compiler in the form of an interpreter or Just-In-Time (JIT) compiler. It was very uncommon to see language implementation that supported both, and if they did, there was usually very little sharing of code.

Existing problemsThe optimal direction
Intermediate Representation has no strong applicablity and powerful experssional abilityRedesign an IR language to contain both low-level and high-level information
Monolithic design structureModular design model
Hard to reuse as SDK or as a libraryDesign a perfect calling mechanism

Table. The optimal direction of LLVM originally

1.2 What does LLVM do?

LLVM re-designs a kind of Intermediate Representation. This novel LLVM virtual instruction provides the benefits of a low-level representation (compact representation, wide variety of available transformations, etc.) as well as providing high-level information to support aggressive interprocedural optimisations at link time and post-link time. In particular, this system is designed to support optimisation in the field, both at run-time and during otherwise unused idle time on the machine.

In a word, LLVM was originally developed as a research infrastructure to investigate dynamic compilation techniques for static and dynamic programming languages.

During Chrise Lattner's PhD period, he went further using GCC as front end to do semantic analysis generating Intermediate Format (IF), then finishing optimisation and code generation by LLVM. He was beginning to make a name for himself as a compiler engineer.

1.3 LLVM's potential was explored by Apple

Therefore, in 2005, Apple Inc. hired Lattner and formed a team to work on the LLVM system for various uses within Apple's development systems. LLVM is an integral part of Apple's latest development tools for macOS and iOS. Since 2013, Sony has been using LLVM's primary front end Clang compiler in the software development kit (SDK) of its PlayStation 4 console.

The name LLVM was originally an initialism for Low Level Virtual Machine. This initialism has officially been removed to avoid confusion, as the LLVM has evolved into an umbrella project that has little relationship to what most current developers think of as virtual machines. Now, LLVM is a brand that applies to the LLVM umbrella project, the LLVM intermediate representation (IR), the LLVM debugger, the LLVM implementation of the C++ Standard Library (with full support of C++11 and C++14), etc. LLVM is administered by the LLVM Foundation.

Now, LLVM is a much better compiler toolchain than others. Compared to GCC, it has these kinds of advantages:

  1. Compile much faster than GCC in some certain plaforms. (e.g., 3 times faster than GCC to compiler Objective-C in debug mode)
  2. Only Use 1/5 times as many RAM as GCC when generating AST.
  3. The LLVM debugged informtion is friendly to developers.
  4. LLVM is designed as a modular libraries and is more easily embeded in IDE or reused in other purposes.
  5. LLVM is nimble and can be extended more easily than GCC.

LLVM is written in C++ and is designed for compile-time, link-time, run-time, and "idle-time" optimization of programs written in arbitrary programming languages. Originally implemented for C and C++, the language-agnostic design of LLVM has since spawned a wide variety of front ends: languages with compilers that use LLVM include ActionScript, Ada, C#, Common Lisp, Crystal, CUDA, D, Delphi, Fortran, Graphical G Programming Language, Halide, Haskell, Java bytecode, Julia, Kotlin, Lua, Objective-C, OpenGL Shading Language, Pony, Python, R, Ruby, Rust, Scala, Swift, and Xojo.

I have read LLVM introduction written by Chris Lattner in the book of The Architecture of Open Source Applications. I will next make some essential notes and take down the contents to illustrate LLVM.

2 Three-Phase Compiler Design

The most popular design for a traditional static compiler (like most C compilers) is the three phase design whose major components are the front end, the optimizer and the back end . The front end parses source code, checking it for errors, and builds a language-specific Abstract Syntax Tree (AST) to represent the input code. The AST is optionally converted to a new representation for optimization, and the optimizer and back end are run on the code.

Pic. Three-Phase Compiler Design

The optimizer is responsible for doing a broad variety of transformations to try to improve the code's running time, such as eliminating redundant computations, and is usually more or less independent of language and target. The back end (also known as the code generator) then maps the code onto the target instruction set. In addition to making correct code, it is responsible for generating good code that takes advantage of unusual features of the supported architecture. Common parts of a compiler back end include instruction selection, register allocation, and instruction scheduling.

This model applies equally well to interpreters and JIT compilers. The Java Virtual Machine (JVM) is also an implementation of this model, which uses Java bytecode as the interface between the front end and optimizer.

2.1 Advantages of This Mode

The most important win of this classical design comes when a compiler decides to support multiple source languages or target architectures. If the compiler uses a common code representation in its optimiser, then a front end can be written for any language that can compile to it, and a back end can be written for any target that can compile from it, as shown in the following picture.

Pic. RetargetableCompiler

There are three advantages when using this three-phase compiler design:

  1. Reusing. Because of the separation of front end and back end, when porting a new source of language (e.g. Algol or BASIC), it just requires implementing a new front end, but the existing optimiser and back end can be reused.
  2. Abundant developed supporters. This design means it supports more than one source languages and targets (e.g. Intel CPUs, ARM, MIPS) so that it serves a broader set of programmers. Then if more developers are involved in this project, the more quality code will be published which naturally leads more enhancements and improvements to the compiler.
  3. Specialise. The skills required to implement a front end are different than those required for the optimiser and back end. Separating these makes it easier for a "front-end person" to enhance and maintain their part of the compiler. While this is a social issue, not a technical one, it matters a lot in practice, particularly for open source projects that want to reduce the barrier to contributing as much as possible.

2.2 Existing Language Implementations

While the benefits of a three-phase design are compelling and well-documented in compiler textbooks, in practice it is almost never fully realised. Looking across open source language implementations (back when LLVM was started), you'd find that the implementations of Perl, Python, Ruby and Java share no code. Further, projects like the Glasgow Haskell Compiler (GHC) and FreeBASIC are retargetable to multiple different CPUs, but their implementations are very specific to the one source language they support. There is also a broad variety of special purpose compiler technology deployed to implement JIT compilers for image processing, regular expressions, graphics card drivers, and other subdomains that require CPU intensive work.

That said, there are three major success stories for this model.

Java and .NET vitual machines. The first are the Java and .NET virtual machines. These systems provide a JIT compiler, runtime support, and a very well defined bytecode format. This means that any language that can compile to the bytecode format (and there are dozens of them) can take advantage of the effort put into the optimiser and JIT as well as the runtime. The tradeoff is that these implementations provide little flexibility in the choice of runtime: they both effectively force JIT compilation, garbage collection, and the use of a very particular object model. This leads to suboptimal performance when compiling languages that don't match this model closely, such as C (e.g., with the LLJVM project).

Translate to C Method. A second success story is perhaps the most unfortunate, but also most popular way to reuse compiler technology: translate the input source to C code (or some other language) and send it through existing C compilers. This allows reuse of the optimiser and code generator, gives good flexibility, control over the runtime, and is really easy for front-end implementers to understand, implement, and maintain. Unfortunately, doing this prevents efficient implementation of exception handling, provides a poor debugging experience, slows down compilation, and can be problematic for languages that require guaranteed tail calls (or other features not supported by C).

GCC Compiler. A final successful implementation of this model is GCC. GCC supports many front ends and back ends, and has an active and broad community of contributors. GCC has a long history of being a C compiler that supports multiple targets with hacky support for a few other languages bolted onto it. As the years go by, the GCC community is slowly evolving a cleaner design. As of GCC 4.4, it has a new representation for the optimizer (known as "GIMPLE Tuples") which is closer to being separate from the front-end representation than before. Also, its Fortran and Ada front ends use a clean AST.

While very successful, these three approaches have strong limitations to what they can be used for, because they are designed as monolithic applications. As one example, it is not realistically possible to embed GCC into other applications, to use GCC as a runtime/JIT compiler, or extract and reuse pieces of GCC without pulling in most of the compiler. People who have wanted to use GCC's C++ front end for documentation generation, code indexing, refactoring, and static analysis tools have had to use GCC as a monolithic application that emits interesting information as XML, or write plugins to inject foreign code into the GCC process.

There are multiple reasons why pieces of GCC cannot be reused as libraries, including rampant use of global variables, weakly enforced invariants, poorly-designed data structures, sprawling code base, and the use of macros that prevent the codebase from being compiled to support more than one front-end/target pair at a time. The hardest problems to fix, though, are the inherent architectural problems that stem from its early design and age. Specifically, GCC suffers from layering problems and leaky abstractions: the back end walks front-end ASTs to generate debug info, the front ends generate back-end data structures, and the entire compiler depends on global data structures set up by the command line interface.