Code Translation

Category	System

Overview

Computers can only execute machine code instructions.

Machine code instructions are part of an instruction set, specific to a type of processor.

Programs are written in source code using a low-level or high-level language, and stored as text files.

The source code is translated into machine code.

The machine code is stored into an executable file that is run by the operating system and executed by the computer.

The machine code can also be stored into library files that are loaded by the operating system to provide shared functionality across multiple programs.

The source code of some high-level languages is translated into bytecode that is interpreted by a virtual machine, which in turn generates the machine code that is executed by the computer.

Translation

The translation of source code to machine code is made of several processes.

Compilation

The compilation is performed by the compiler.

The compiler translates source code from a high-level language to a low-level language.

Assembly languages are compiled into machine code.

Native languages like C and C++ are compiled into object code.

Managed languages like C# and Java are compiled into bytecode.

💡

Bytecode is also known as intermediate language (IL).

The compilation itself is made of several stages: the front end, middle end, and back end.

The front end performs the translation of the source code into an intermediate representation in several steps.

Preprocessing

The preprocessing is performed by the preprocessor.

The preprocessor performs textual substitutions and macro expansions before the compilation.

Tokenization

The tokenization (or lexical analysis) is performed by the lexer.

The tokenization is the process of converting a sequence of characters into a sequence of tokens.

Tokens are strings with an identified meaning.

Parsing

The parsing (or syntax analysis) is performed by the parser.

The parsing is the process of identifying the syntactic structure of the program from a sequence of tokens.

The parser builds an abstract syntax tree (AST).

Assembly

The assembly is performed by the assembler.

The assembler translates object code to machine code.

Linking

The linking is performed by the linker.

The linker combines the machine code from multiple object files into a single executable file.

Toolsets

Clang and LLVM provide the compilation of C languages to machine code.

Clang

Clang handles the front end stage of the compilation of a C language source code to intermediate representation (IR).

LLVM

LLVM handles the middle end and back end stages of the compilation of an intermediate representation to machine code.

The LLVM Core libraries provides a modern optimizer and supports code generation for many CPUs.

Example

Let's review an example of the translation of a simple C program.

void main()
{
    puts("Hello World!");
}

Preprocessing

Include directive

Adds the contents of the file named stdio.h to the source code.

#include <stdio.h>

Macro directive

Substitutes the token string WIDTH with the as the integer constant 42.

#define WIDTH    42

Conditional directives

Controls the compilation of portions of a source file depending on the value of the DEBUG symbol. If the DEBUG preprocessor symbol is defined, the "Debug" string is written; otherwise, the "Release" string is written.

#if defined(DEBUG)
puts("Debug");
#else
puts("Release");
#endif

Tokenization

The program is converted into the following tokens:

void

main

(

)

{

puts

(

"Hello World!"

)

;

}