To produce a scene, the 3D objects need several properties.
- A shape, defined by 3D models (built from rendering primitives such as triangles).
- A transform, defined by a position and orientation in space (and a scale).
- A material, that defines the representation of the object affected by light sources.
- From a simple color to a realistic physical description).
- Lights that affect the object.
Retained Mode and Immediate Mode
A retained-mode API is declarative. The application constructs a scene from graphics primitive. Each time a new frame is drawn, the graphics library transforms the scene into a set of drawing commands.
An immediate-mode API is procedural. Each time a new frame is drawn, the application directly issues the drawing commands.
Modern graphic APIs are generally immediate. Game engines built on top of these APIs are generally retained.
A CPU is optimized to handle a variety of data structures.
Modern CPUs have multiple cores and SIMD vector processors for parallel processing.
To improve the latency of accessing data from the main memory, CPUs use a hierarchy of local caches.
CPU use different technique to improve performance such a branch prediction and out-of-order execution.
A GPU is a stream processor that contains a set of processors called shader cores.
GPUs are optimized for throughput. They are able to process a massive amount of data very quickly in parallel.
Vertex programs are executed in threads, which have their own registers.
Threads that execute the same shader program are grouped into warps.
Shader programs are executed in lock-step on all processors.
When a stall occur in a warp (for example a texture fetch in a pixel shader), the warp is swapped out for a different warp that is also executed on all processors.
The cost of swapping is very low, and this is the main technique used by GPUs to hide the latency of accessing memory.
The frame buffer (or color buffer) contains the color data that is presented to the display.
The depth buffer (or z-buffer) contains the depth data for each pixel on the screen.
It has the same size as the frame buffer, but use a different format.
The output merger stage of the pipeline evaluate each fragment with its z-value to determine if the fragment is discarded or presented as a pixel on the display.
This operation is called depth testing.
When a pixel is presented to the display, the frame buffer is updated with the color data, and the z-value is written to the depth buffer.
Further processing of other fragments use the latest z-value in the depth buffer to perform the depth test operation.
This technique allows primitives to be rendered in any order.
However, it does not work with transparent primitives.
The stencil buffer usually contains 8-bit data associated with each pixel.
It has the same size as the frame buffer, and is usually stored with the depth buffer.
The stencil buffer is used to control how pixels are rendered to achieve different effects such as masking.
This operation is called stencil testing.
To avoid flicking when presenting the frame buffer to the display, applications use a technique known as double-buffering.
When using double-buffering, a front buffer and a back buffer are presented to the display in succession (swap) during a vertical retrace of the display.
Additional back buffers can be created to create a swap chain.
The main function of the graphics rendering pipeline is to render a 2D image based on 3D objects, a camera, lights, and other graphics elements.
The rendering pipeline is composed of several stages.
Prepare the geometry for the Geometry Processing stage.
This stage is executed by one ore many CPU cores.
Some operations can be executed on the GPU using compute programs (GPGPU).
Process the rendering primitives (points, lines, triangles).
The stage is divided in other functional stages.
- This stage is fully programmable.
- Perform per-vertex operations.
- The input vertices are expressed in model space, which means they are relative the center of the model.
- The vertices are converted into view space where the camera is at the center. The view space is expressed either as an orthographic or perspective projection.
- This stage is fixed and is not programmable.
- The view volume is clipped into a unit cube.
- The primitives outside the unit cube are discarded.
- The primitives inside the unit cube are processed further.
- The primitives that are partially into the view volume are clipped against the unity cube.
- Vertices outside the unit cube are discarded.
- New vertices are generated (from the intersection with the unit cube).
- Screen Mapping
- This stage is configurable.
- The 3D coordinates of the primitives are converted into screen coordinates.
Optional Vertex Processing
- Generate new vertices and triangles to process more complex curved shapes expressed as patches.
- Consists of other stages: hull, tesselator, and domain.
- Geometry Shading
- This stage is fully programmable.
- Generate new vertices and triangles.
- Stream Output
- Output processed vertices to the CPU or GPU for further processing.
- Useful when performing simulations on the GPU (ie. particle simulations).
Conversion of 2D vertices in screen space into colored pixels on the screen.
This stage is fixed and is not programmable.
- Triangle Setup
- Interpolation of shading data.
- Triangle Traversal
- Generate fragments (interpolated among the three triangle vertices).
- Pixel Shading
- This stage is fully programmable.
- Per-pixel operations are performed such as texturing to produce one or more colors.
- This stage is fixed but is configurable.
- The pixels are stored in the color buffer.
- The merging stage combines the fragment color with the color currently stored in the color buffer.
An optimized game will have a minimum number of draw calls to render a frame. Typically, there are about 1000 draw calls per frame.
Each draw call requires a set of render states (shaders, textures, render modes...).
Changing render states can be expensive for the CPU.
- translation to hardware commands
- shader compilation
- sending data to the GPU
Typically, a game targets a frame rate of 30 FPS (33.3ms/frame) or 60 FPS (66.7ms/frame).
In a balanced timeline, the CPU and GPU utilization is 100% and their workloads are offset so that the GPU works on frame N+1 while the GPU works on frame N.
In a CPU bound frame, the CPU will take more time than the GPU and the GPU will be idle for some time. Giving more work to the GPU is not a simple solution as it also requires more work for the CPU and the frame rate is already fixed.
By optimizing the number of draw calls and changes of render states, we can reduce the CPU workload.
In a perspective projection, objects further away from the camera are smaller than objects closer to the camera. Parallel lines may converge at the horizon.
The view volume of a perspective projection is a truncated pyramid with a rectangular base known as the view frustum.
The following parameters are used to generate the perspective transform of a camera.
- Field of view (FOV)
- Aspect ratio
- Near clipping plane (Near Z)
- Far clipping plane (Far Z)
The field of view is the extent of the scene that is seen on the screen at any given moment.
The aspect ratio is defined by the ratio between the screen width and screen space.
The objects closer than the near clipping plane are not rendered.
The objects further away than the far clipping plane are not rendered.
In an orthographic projection (or parallel projection), parallel lines remain parallel after the transformation.
The view volume of an orthographic projection is a rectangular box.
The projection transformation takes a 4x4 matrix that represent a perspective projection or an orthographic projection, and converts it into a unit cube.
Transformed vertices are in clip coordinates, which are homogeneous coordinates.