Graphics Pipeline



To produce a scene, the 3D objects need several properties.

Retained Mode and Immediate Mode

A retained-mode API is declarative. The application constructs a scene from graphics primitive. Each time a new frame is drawn, the graphics library transforms the scene into a set of drawing commands.

An immediate-mode API is procedural. Each time a new frame is drawn, the application directly issues the drawing commands.

Modern graphic APIs are generally immediate. Game engines built on top of these APIs are generally retained.



A CPU is optimized to handle a variety of data structures.

Modern CPUs have multiple cores and SIMD vector processors for parallel processing.

To improve the latency of accessing data from the main memory, CPUs use a hierarchy of local caches.

CPU use different technique to improve performance such a branch prediction and out-of-order execution.


A GPU is a stream processor that contains a set of processors called shader cores.

GPUs are optimized for throughput. They are able to process a massive amount of data very quickly in parallel.


Vertex programs are executed in threads, which have their own registers.

Threads that execute the same shader program are grouped into warps.

Shader programs are executed in lock-step on all processors.

When a stall occur in a warp (for example a texture fetch in a pixel shader), the warp is swapped out for a different warp that is also executed on all processors.

The cost of swapping is very low, and this is the main technique used by GPUs to hide the latency of accessing memory.


Frame Buffer

The frame buffer (or color buffer) contains the color data that is presented to the display.

Depth Buffer

The depth buffer (or z-buffer) contains the depth data for each pixel on the screen.

It has the same size as the frame buffer, but use a different format.

The output merger stage of the pipeline evaluate each fragment with its z-value to determine if the fragment is discarded or presented as a pixel on the display.

This operation is called depth testing.

When a pixel is presented to the display, the frame buffer is updated with the color data, and the z-value is written to the depth buffer.

Further processing of other fragments use the latest z-value in the depth buffer to perform the depth test operation.

This technique allows primitives to be rendered in any order.

However, it does not work with transparent primitives.

Stencil Buffer

The stencil buffer usually contains 8-bit data associated with each pixel.

It has the same size as the frame buffer, and is usually stored with the depth buffer.

The stencil buffer is used to control how pixels are rendered to achieve different effects such as masking.

This operation is called stencil testing.


To avoid flicking when presenting the frame buffer to the display, applications use a technique known as double-buffering.

When using double-buffering, a front buffer and a back buffer are presented to the display in succession (swap) during a vertical retrace of the display.

Additional back buffers can be created to create a swap chain.

Rendering Pipeline

The main function of the graphics rendering pipeline is to render a 2D image based on 3D objects, a camera, lights, and other graphics elements.

The rendering pipeline is composed of several stages.


Prepare the geometry for the Geometry Processing stage.

This stage is executed by one ore many CPU cores.

Some operations can be executed on the GPU using compute programs (GPGPU).

Geometry Processing

Process the rendering primitives (points, lines, triangles).

The stage is divided in other functional stages.

Vertex Shading



Optional Vertex Processing


Conversion of 2D vertices in screen space into colored pixels on the screen.

This stage is fixed and is not programmable.

Pixel Processing

Draw Calls

An optimized game will have a minimum number of draw calls to render a frame. Typically, there are about 1000 draw calls per frame.

Each draw call requires a set of render states (shaders, textures, render modes...).

Changing render states can be expensive for the CPU.

Frame Time

Typically, a game targets a frame rate of 30 FPS (33.3ms/frame) or 60 FPS (66.7ms/frame).

In a balanced timeline, the CPU and GPU utilization is 100% and their workloads are offset so that the GPU works on frame N+1 while the GPU works on frame N.

In a CPU bound frame, the CPU will take more time than the GPU and the GPU will be idle for some time. Giving more work to the GPU is not a simple solution as it also requires more work for the CPU and the frame rate is already fixed.

By optimizing the number of draw calls and changes of render states, we can reduce the CPU workload.


Perspective Projection

In a perspective projection, objects further away from the camera are smaller than objects closer to the camera. Parallel lines may converge at the horizon.

The view volume of a perspective projection is a truncated pyramid with a rectangular base known as the view frustum.

The following parameters are used to generate the perspective transform of a camera.

The field of view is the extent of the scene that is seen on the screen at any given moment.

The aspect ratio is defined by the ratio between the screen width and screen space.

The objects closer than the near clipping plane are not rendered.

The objects further away than the far clipping plane are not rendered.

Orthographic Projection

In an orthographic projection (or parallel projection), parallel lines remain parallel after the transformation.

The view volume of an orthographic projection is a rectangular box.

Projection Transformation

The projection transformation takes a 4x4 matrix that represent a perspective projection or an orthographic projection, and converts it into a unit cube.

Transformed vertices are in clip coordinates, which are homogeneous coordinates.