🔢

SIMD

CategoryTypes

Overview

SIMD (Single instruction, multiple data)

SIMD exploits data level parallelism with dedicated instructions to perform the same operation on multiple floating-point values simultaneously.

Alignment

SIMD types should be 16-bytes aligned. Some intrinsicts support operations with unaligned types but with a performance cost.

Calling Conventions

There are different calling requirements depending on the target platform.

On Xbox 360, Xbox One and Windows, the SIMD type require the __fastcall calling convention.

💡
The __fastcall calling convention specifies that arguments to functions are to be passed in registers.

Platforms

Microsoft

Xbox

Microsoft provides the Xbox Math Library on Xbox.

The library is implemented in Xboxmath.h.

Windows

In earlier versions of DirectX, Microsoft provided the D3DX 9 and D3DX 10 math libraries.

The library is available on Xbox 360, but its use is discouraged.

💡
The libraries are now deprecated.

The main types are XMVECTOR2, XMVECTOR3, XMVECTOR4, D3DXMATRIXA16, D3DXQUATERNION, and D3DXCOLOR.

Windows and Xbox 360

Microsoft provides the XNA Math library on Windows and Xbox 360.

💡
The library is deprecated on Windows.

The library supports:

The library is implemented in Xnamath.h.

The main types are XMVECTOR and XMMATRIX. They are 16-byte aligned.

XMVECTOR is an opaque data structure. The components cannot be accessed directly.

Windows and Xbox One

Microsoft provides the DirectXMath library on Windows and Xbox One.

The library supports:

The library is implemented in DirectXMath.h and DirectXPackedVector.h.

The main types are XMVECTOR and XMMATRIX.

Conversion between scalar and vector forms is inefficient, and should only be done when required.

⚠️
For portability, initializer lists should not be used with these types.
⚠️
The components cannot be accessed directly.

The library contains additional types in which the components can be accessed directly.

💡
They are implemented as C structs with float members.

The library uses row-major matrices, row vectors, and pre-multiplication.

⚠️
HLSL shaders default to column-major matrices.

Vectors

XMVECTOR wraps a SIMD register.

Matrices

XMMATRIX wraps four SIMD registers.

Matrix operations are suffixed with LH or RH to work with either left-handed or right-handed view coordinates.

Best Practices

Intrinsics

SSE

Declaration

The main header is <xmmintrin.h>.

It provides the __m128 type and the _mm_XXX() functions.

To define a 3D vector, instead of using three X/Y/Z single-precision floating-point components, we store packed single-precision floating-point elements as a __m128 value.

⚠The structure must not have any other members to ensure that the vectorcall calling convention will work and use of as many registers as possible to pass function arguments.

struct Vector3
{
    __m128 m;
};

In the constructor, we call _mm_set_ps to set the __m128 value from supplied values.

inline explicit Vector3(float x, float y, float z)
{
    m = _mm_set_ps(z, z, y, x);
}

Accessors

The __m128 type does not provide direct access to the X/Y/Z components, so it's necessary to implement wrapper functions.

To obtain the X component, we call _mm_cvtss_f32 to get a copy of the lower element.

To obtain the Y and Z components, we first call _mm_shuffle_ps to shuffle the elements using a mask with the _MM_SHUFFLE macro, set to 1 (for the Y component) and to 2 (for the Z component).

inline float x() const { return _mm_cvtss_f32(m); }
inline float y() const { return _mm_cvtss_f32(_mm_shuffle_ps(m, m, _MM_SHUFFLE(1, 1, 1, 1))); }
inline float z() const { return _mm_cvtss_f32(_mm_shuffle_ps(m, m, _MM_SHUFFLE(2, 2, 2, 2))); }

⚠It's discouraged to access the X/Y/Z components using an indexer on the m128_f32 member as it will be slower au cause memory spill.

inline float operator[] (size_t i) const { return m.m128_f32[i]; };
inline float& operator[] (size_t i) { return m.m128_f32[i]; };

To set the X component, we call _mm_set_ss and _mm_move_ss to defined a new __m128 value in which the X component is overridden and the Y and Z components are moved.

To set the Y and Z components, we then call _mm_shuffle_ps to shuffle the elements on a temporary value and finally call _mm_move_ss again.

⚠It's discouraged to provide direct write accessors the X/Y/Z components as it will cause more loads/stores.

Binary operations

Binary operations are implemented as dedicated functions that take another __m128 value.

inline Vector3 operator+ (Vector3 a, Vector3 b) { a.m = _mm_add_ps(a.m, b.m); return a; }
inline Vector3 operator- (Vector3 a, Vector3 b) { a.m = _mm_sub_ps(a.m, b.m); return a; }
inline Vector3 operator* (Vector3 a, Vector3 b) { a.m = _mm_mul_ps(a.m, b.m); return a; }
inline Vector3 operator/ (Vector3 a, Vector3 b) { a.m = _mm_div_ps(a.m, b.m); return a; }

Binary operations in which the other operand is a single floating-point value are implemented with the same functions where another __m128 value is computed using _mm_set1_ps.

inline Vector3 operator* (Vector3 a, float b) { a.m = _mm_mul_ps(a.m, _mm_set1_ps(b)); return a; }
inline Vector3 operator/ (Vector3 a, float b) { a.m = _mm_div_ps(a.m, _mm_set1_ps(b)); return a; }
inline Vector3 operator* (float a, Vector3 b) { b.m = _mm_mul_ps(_mm_set1_ps(a), b.m); return b; }
inline Vector3 operator/ (float a, Vector3 b) { b.m = _mm_div_ps(_mm_set1_ps(a), b.m); return b; }

Unary operations

To negate a vector, we compute the difference between the zero value and the current value. A __m128 value with all elements set to zero is created with _mm_setzero_ps.

inline Vector3 operator- (Vector3 value) { return Vector3(_mm_setzero_ps()) - value; }

Function calls

As math types such as vectors and matrices can be quite large, they are typically passed by reference.

void Translate(const Vector3& value);

When working with SiMD types, we want to use the vectorcall calling convention, and pass the parameters by value. The values stay in the XMM0-XMM3 registers and there's no copy.

void Translate(Vector3 value);

NEON

Portability

Define an alias for each platform.

#if SIMD_SSE
  using Simd128Type = __m128;
#elif SIMD_NEON
  using Simd128Type = float32x4_t;
#elif SIMD_AVX
  using Simd128Type = __vector4;
...
#endif

Define another alias for each platform argument requirement.

#if SIMD_BYVALUE
  using Simd128TypeArg = Simd128Type;
#elif SIMD_BYREF
  using Simd128TypeArg = Simd128Type&;
#endif

Then, when defining the common interface for SIMD operations, we use these aliases.

Simd128Type Add(Simd128TypeArg a, Simd128TypeArg b);

The implementation of each platform's SIMD operations is written in different header files.

For example, for the add operation with SSE:

namespace Simd
{
  inline Simd128Type Vector3Add(Simd128TypeArg vector1, Simd128TypeArg vector2)
  {
    return _mm_add_ps(vector1, vector2);
  }
}

The internal math structures are aligned and contain their corresponding SIMD type.

struct alignas(16) Vector3Type
{
  Simd128Type _v;
};

The math operations are implemented as wrappers.

inline void Vector3Add(Vector3Type* result, const Vector3Type& vector1, const Vector3Type& vector2)
{
  result->_v = Simd::Vector3Add(vector1._v, vector2._v);
}

The math types used in the game code include their corresponding internal math structure.

class Vector3
{
  Vector3Type _v;
  ...
};

When we implement the operations, we use the common interface.

inline Vector3 Vector3::operator+(const Vector3& value) const
{
  Vector3 result;
  Vector3Add(&result, *this, value);
  return result;
}

Optimizations

Partial Loads

In some platforms (such as VMX128), functions that initialize fewer than 4 components leave the remaining component uninitialized.

Other platforms (such as SSE) always intialize the 4 components by setting the remaining components to 0.

Generic to SIMD

A naive implementation of SIMD is to replace the generic implementation of math types such as Vector3 using SIMD registers.

struct Vector3
{
  SimdType _v;
};

However, SIMD is most efficient when operating on many components.

Instead, each Vector3 components should be stored in distinct SIMD float4, and the operations should be performed on four objects at the same time.

SimdVector3 inputs[4];
SimdVector3 outputs[4];
SimdOperation(inputs, outputs);

Alignment

SIMD types should be 16-bytes aligned, using compiler directives or aligned memory allocators.

SIMD operations are more efficient with aligned types.

Accessors and operators

Avoid accessing individual components through accessors or operator overloads.

It requires moving from the SIMD registers to the scalar ones and back again.

Pointers parameters

Pointer operations are more efficient than the non-pointer operations to load data directly from memory to the SIMD registers.