Code Translation
Category | System |
---|
Overview
Computers can only execute machine code instructions.
Machine code instructions are part of an instruction set, specific to a type of processor.
Programs are written in source code using a low-level or high-level language, and stored as text files.
The source code is translated into machine code.
The machine code is stored into an executable file that is run by the operating system and executed by the computer.
The machine code can also be stored into library files that are loaded by the operating system to provide shared functionality across multiple programs.
The source code of some high-level languages is translated into bytecode that is interpreted by a virtual machine, which in turn generates the machine code that is executed by the computer.
Translation
The translation of source code to machine code is made of several processes.
Compilation
The compilation is performed by the compiler.
The compiler translates source code from a high-level language to a low-level language.
Assembly languages are compiled into machine code.
Native languages like C and C++ are compiled into object code.
Managed languages like C# and Java are compiled into bytecode.
The compilation itself is made of several stages: the front end, middle end, and back end.
The front end performs the translation of the source code into an intermediate representation in several steps.
Preprocessing
The preprocessing is performed by the preprocessor.
The preprocessor performs textual substitutions and macro expansions before the compilation.
Tokenization
The tokenization (or lexical analysis) is performed by the lexer.
The tokenization is the process of converting a sequence of characters into a sequence of tokens.
Tokens are strings with an identified meaning.
Parsing
The parsing (or syntax analysis) is performed by the parser.
The parsing is the process of identifying the syntactic structure of the program from a sequence of tokens.
The parser builds an abstract syntax tree (AST).
Assembly
The assembly is performed by the assembler.
The assembler translates object code to machine code.
Linking
The linking is performed by the linker.
The linker combines the machine code from multiple object files into a single executable file.
Toolsets
Clang and LLVM provide the compilation of C languages to machine code.
Clang
Clang handles the front end stage of the compilation of a C language source code to intermediate representation (IR).
LLVM
LLVM handles the middle end and back end stages of the compilation of an intermediate representation to machine code.
The LLVM Core libraries provides a modern optimizer and supports code generation for many CPUs.
Example
Let's review an example of the translation of a simple C program.
void main()
{
puts("Hello World!");
}
Preprocessing
Include directive
Adds the contents of the file named stdio.h
to the source code.
#include <stdio.h>
Macro directive
Substitutes the token string WIDTH
with the as the integer constant 42
.
#define WIDTH 42
Conditional directives
Controls the compilation of portions of a source file depending on the value of the DEBUG
symbol.
If the DEBUG
preprocessor symbol is defined, the "Debug"
string is written; otherwise, the "Release"
string is written.
#if defined(DEBUG)
puts("Debug");
#else
puts("Release");
#endif
Tokenization
The program is converted into the following tokens:
- void
- main
- (
- )
- {
- puts
- (
- "Hello World!"
- )
- ;
- }