Shaders are small executable programs executed by the GPU.

A shader program a written using a shading language specific to a graphic device.

Shader programs in a graphics pipeline are responsible for all transform, lighting, and shading effects.

Compute programs are responsible for high-performance general purpose programs using the GPU (GPGPU).

There are different kind of shader programs, executed at different stages of the graphic pipeline.

A GPU has unified shader architecture, which means that shader programs share the same instruction set architecture (ISA).

Thanks to this architecture, a GPU can balance its workload by allocating its shader cores to different shader programs.

Shader Languages


Data Types

GPUs natively support 32-bit integers, 32-bit and 64-bit floating point scalars and vectors.

A vertex program has three type if input data:


Shader programs can perform common operations on their data types such as additions, multiplications, ...

For other operations, such as math operations, they use intrinsic functions.

Flow control is supported using two methods:

Dynamic flow branches are more costly than static flow branches, as each shader program can execute code differently.

Vertex Attributes

Vertex shader can process vertex data in an arbitrary layout. To indicate how the data is setup, vertex attributes are defined.

Each vertex attribute is defined by:

Vertex Data (Varying)

Vertex data can be stored in multiple arrays.

Each array is described by vertex attributes.


One array can contain vertex positions and another vertex colors.

Several objects can be rendered using the same vertex positions but distinct vertex colors.

Constant Data (Uniform)

In addition to vertex input data, constant data can be provided to the shader programs using uniform buffers. Typically, information such as transform matrices, time values or other effects like fog parameters are passed to shader programs as constant data.

This data data is constant during a single frame, and it is typically separated into scene data that is shared by every renderer object such as the camera view transform matrix and object-specific data that is updated before rendering an object such as a model transform matrix.

When possible, data should be defined as SIMD types to match the memory layout and alignment of the GPU.

Shader Programs

Shaders are local functions.

Most languages follow the C-style rules.

Shader Stages

Depending on the graphics pipeline, different kind of shaders are supported.

Vertex Shader

This is a fully programmable shader stage.

The main task of a vertex shader is to process incoming vertex data and map each vertex to a position in the viewport (clip-space coordinates).

A vertex shader must at least output a vertex position.

A vertex shader cannot add or remove vertices.

The output of a vertex shader can be sent to different stages.


Tessellation Shader

The tessellation stage can be used to render curved surfaces.

This is an optional stage.

The level of detail can be controlled based on the distance of the object from the camera.

The tessellation stage consists itself of three stages:

The control stage and the evaluation stage are programmable stages.

The tessellator is a fixed-function stage.


The input of the control stage is a patch primitive.

A patch primitive consists of several control points defining a subdivision surface or Bézier curve.

The control stage has two functions:

The different types of tessellation surface are:

The tessellation factors (known as tessellation levels in OpenGL and Vulkan) have two types:

The tessellation factors and the type of of the tessellation surface are sent to the tessellator and evaluation stage.

The control points of the transformed patch are sent to the evaluation stage.

The control shader can discard a patch.


The tessellator generates a set of vertices with their barycentric coordinates (relative locations on the surface).

The points are sent to the evaluation stage.


The evaluation stage processes the vertices from the tessellator using the control points to generate the output values for the vertices.

The generated triangles are sent to the rasterizer stage or the geometry shader stage.

Geometry Shader

The geometry shader can transform primitives into other type of primitives.

Geometry shaders modify input data and can duplicate it.

This is an optional and fully programmable shader stage.

The geometry shader process points, lines or triangles and can process extended primitives that contain adjacent vertices on polyline.

Geometry shaders can also process patches, but the tessellator is more efficient.

Geometry shaders support instancing.

The processed vertices of a geometry shader are sent to the rasterizer stage.

Optionally, the vertices can be written to an output stream (transform feedback in OpenGL) to be sent back through the pipeline.

This functionality can be used to simulate particles, but it is costly as the data is stored in floating point numbers.


Fragment Shader

The fragment shader is known as pixel shader in DirectX.

This is a fully programmable shader stage.

The rasterizer stage interpolates the vertex data and sends it to the fragment shader stage.

The default interpolation is a perspective-correct interpolation but the type of interpolation can be changed (for example, screen-space interpolation in which perspective projection is ignored).

A fragment is the piece of a triangle that comes from the rasterizer.

The main task of a fragment shader is to process incoming fragment data and calculate a color value for the final pixels.

The fragment shader can also output an opacity value and a depth value.

The color value and depth value are then written to the color buffer and depth buffer.

The default depth value comes from the rasterizer, but can be overridden by a fragment shader.

In the merge stage, the output from the fragment shader can be used to produce different effects, by testing the current values in the depth buffer and stencil buffer.

Compute Shader

The GPU can be used for any kind of processing task and isn’t limited to graphics.

This is called general-purpose GPU (GPGPU) programming.

The compute processing pipeline is made up of a programmable kernel function, that executes a compute pass and reads from and writes to resources directly.

Thread groups

In order to execute in parallel, each workload must be broken apart into thread groups.

A compute pass must specify the number of times to execute a kernel function. Threads are organized into a 3D grid and this number corresponds to the grid size.

Each thread group has s small amount of memory that is shared among threads.

For example, for processing a 2D image, each thread corresponds to a unique texel, and the grid size must be at least the size of the image.

The thread execution width is the number of threads that can be scheduled to run concurrently on the GPU (usually a power of two). Selecting an efficient thread group size depends on both the size of the data and the capabilities of a specific device. To make the most efficient use of the GPU, the total number of items in a thread group should be a multiple of the thread execution width.



Gouraud, Phong, and Flat Shading

Lighting can be computed once per vertex, or once per fragment.

Per vertex lighting is known as Gouraud shading.

Per fragment lighting is known as Phong shading.

With models of low vertex densities, Gouraud shading will produce visible artifacts such as angularly shaped highlights.

Per primitive shading produces a faceted appearance known as Flat shading.

Flat shading can be performed in a vertex shader. It is achieved by disabling the interpolation of the vertices, so that the value of the first vertex is passed down to each fragment of the primitive.


Lighting is generally calculated in fragment shaders.

To calculate lighting in fragment shaders, the vertex shader passes the transformed vertex normals to the fragment shader as varying data.

If the vertex normals are modified in the vertex normal, they need to renormalized before they are sent to the rasterization stage.

After the interpolation in the rasterization stage, the normals need to be renormalized.

The lighting calculations can be performed either in world space, or in camera space.

If the light positions are expressed in world space, it is generally preferable to perform the lighting calculations in world space too.