# SIMD

Category | Types |
---|

# Overview

SIMD (Single instruction, multiple data)

SIMD exploits data level parallelism with dedicated instructions to perform the same operation on multiple floating-point values simultaneously.

- SIMD is useful to speed up floating point computations.

- SIMD instructions are platform-dependent and the APIs do not match.

# Alignment

SIMD types should be 16-bytes aligned. Some intrinsicts support operations with unaligned types but with a performance cost.

# Calling Conventions

There are different calling requirements depending on the target platform.

On Xbox 360, Xbox One and Windows, the SIMD type require the `__fastcall`

calling convention.

# Platforms

## Microsoft

### Xbox

Microsoft provides the **Xbox Math Library** on Xbox.

The library is implemented in `Xboxmath.h`

.

### Windows

In earlier versions of DirectX, Microsoft provided the **D3DX 9** and **D3DX 10** math libraries.

The library is available on Xbox 360, but its use is discouraged.

The main types are `XMVECTOR2`

, `XMVECTOR3`

, `XMVECTOR4`

, `D3DXMATRIXA16`

, `D3DXQUATERNION`

, and `D3DXCOLOR`

.

### Windows and Xbox 360

Microsoft provides the **XNA Math** library on Windows and Xbox 360.

The library supports:

- A generic implementation without intrinsics.

**SSE/SSE2**intrinsics on Windows.

**VMX128**intrinsics on the Xbox 360.

The library is implemented in `Xnamath.h`

.

The main types are `XMVECTOR`

and `XMMATRIX`

. They are 16-byte aligned.

`XMVECTOR`

wraps a SIMD register.

`XMMATRIX`

wraps four SIMD registers.

`XMVECTOR`

is an opaque data structure. The components cannot be accessed directly.

### Windows and Xbox One

Microsoft provides the **DirectXMath** library on Windows and Xbox One.

- C++11.

- Colors.

- Bounding volumes.

The library supports:

- A generic implementation without intrinsics.

**SSE/SSE2**intrinsics on Windows and Xbox One.

**NEON**intrinsics on Windows RT (ARM).

The library is implemented in `DirectXMath.h`

and `DirectXPackedVector.h`

.

The main types are `XMVECTOR`

and `XMMATRIX`

.

- Opaque data structures.

- 16-byte aligned.

Conversion between scalar and vector forms is inefficient, and should only be done when required.

The library contains additional types in which the components can be accessed directly.

- Unaligned:
`XMFLOAT3`

,`XMFLOAT4`

,`XMFLOAT4X3`

,`XMFLOAT4X4`

.

- Aligned:
`XMFLOAT3A`

,`XMFLOAT4A`

,`XMFLOAT4X3A`

,`XMFLOAT4X4A`

.

The library uses row-major matrices, row vectors, and pre-multiplication.

### Vectors

`XMVECTOR`

wraps a SIMD register.

### Matrices

`XMMATRIX`

wraps four SIMD registers.

Matrix operations are suffixed with `LH`

or `RH`

to work with either left-handed or right-handed view coordinates.

### Best Practices

- On Windows, enable
`/fp:fast`

.

- The
`XMVerifyCPUSupport`

function should be called at startup to check for processor support.

- The library provides aligned and unaligned versions of types and memory operations. The aligned versions are more efficient.

- Many functions have an equivalent function with a
`Ptr`

suffix. These functions are more efficient than the non-pointer versions. For example,`XMVectorGetX`

and`XMVectorGetXPtr`

.

- Many functions have an equivalent function with a
`Est`

suffix. These functions trade accuracy for improved performance. For example,`XMVector3Normalize`

and`XMVector3NormalizeEst`

.

# Intrinsics

## SSE

### Declaration

The main header is `<xmmintrin.h>`

.

It provides the `__m128`

type and the `_mm_XXX()`

functions.

To define a 3D vector, instead of using three X/Y/Z single-precision floating-point components, we store packed single-precision floating-point elements as a `__m128`

value.

⚠The structure must not have any other members to ensure that the `vectorcall`

calling convention will work and use of as many registers as possible to pass function arguments.

```
struct Vector3
{
__m128 m;
};
```

In the constructor, we call `_mm_set_ps`

to set the `__m128`

value from supplied values.

```
inline explicit Vector3(float x, float y, float z)
{
m = _mm_set_ps(z, z, y, x);
}
```

### Accessors

The `__m128`

type does not provide direct access to the X/Y/Z components, so it's necessary to implement wrapper functions.

To obtain the X component, we call `_mm_cvtss_f32`

to get a copy of the lower element.

To obtain the Y and Z components, we first call `_mm_shuffle_ps`

to shuffle the elements using a mask with the `_MM_SHUFFLE`

macro, set to 1 (for the Y component) and to 2 (for the Z component).

```
inline float x() const { return _mm_cvtss_f32(m); }
inline float y() const { return _mm_cvtss_f32(_mm_shuffle_ps(m, m, _MM_SHUFFLE(1, 1, 1, 1))); }
inline float z() const { return _mm_cvtss_f32(_mm_shuffle_ps(m, m, _MM_SHUFFLE(2, 2, 2, 2))); }
```

⚠It's discouraged to access the X/Y/Z components using an indexer on the `m128_f32`

member as it will be slower au cause memory spill.

```
inline float operator[] (size_t i) const { return m.m128_f32[i]; };
inline float& operator[] (size_t i) { return m.m128_f32[i]; };
```

To set the X component, we call `_mm_set_ss`

and `_mm_move_ss`

to defined a new `__m128`

value in which the X component is overridden and the Y and Z components are moved.

To set the Y and Z components, we then call `_mm_shuffle_ps`

to shuffle the elements on a temporary value and finally call `_mm_move_ss`

again.

⚠It's discouraged to provide direct write accessors the X/Y/Z components as it will cause more loads/stores.

### Binary operations

Binary operations are implemented as dedicated functions that take another `__m128`

value.

- Addition:
`_mm_add_ps`

- Subtraction:
`_mm_sub_ps`

- Multiplication:
`_mm_mul_ps`

- Division:
`_mm_div_ps`

```
inline Vector3 operator+ (Vector3 a, Vector3 b) { a.m = _mm_add_ps(a.m, b.m); return a; }
inline Vector3 operator- (Vector3 a, Vector3 b) { a.m = _mm_sub_ps(a.m, b.m); return a; }
inline Vector3 operator* (Vector3 a, Vector3 b) { a.m = _mm_mul_ps(a.m, b.m); return a; }
inline Vector3 operator/ (Vector3 a, Vector3 b) { a.m = _mm_div_ps(a.m, b.m); return a; }
```

Binary operations in which the other operand is a single floating-point value are implemented with the same functions where another `__m128`

value is computed using `_mm_set1_ps`

.

```
inline Vector3 operator* (Vector3 a, float b) { a.m = _mm_mul_ps(a.m, _mm_set1_ps(b)); return a; }
inline Vector3 operator/ (Vector3 a, float b) { a.m = _mm_div_ps(a.m, _mm_set1_ps(b)); return a; }
inline Vector3 operator* (float a, Vector3 b) { b.m = _mm_mul_ps(_mm_set1_ps(a), b.m); return b; }
inline Vector3 operator/ (float a, Vector3 b) { b.m = _mm_div_ps(_mm_set1_ps(a), b.m); return b; }
```

### Unary operations

To negate a vector, we compute the difference between the zero value and the current value. A `__m128`

value with all elements set to zero is created with `_mm_setzero_ps`

.

`inline Vector3 operator- (Vector3 value) { return Vector3(_mm_setzero_ps()) - value; }`

### Function calls

As math types such as vectors and matrices can be quite large, they are typically passed by reference.

`void Translate(const Vector3& value);`

When working with SiMD types, we want to use the `vectorcall`

calling convention, and pass the parameters by value. The values stay in the XMM0-XMM3 registers and there's no copy.

`void Translate(Vector3 value);`

## NEON

# Portability

- Each platform supports a different instruction set.

- Each platform uses different types.

- Some platforms require SIMD types to be passed by value, others by reference.

Define an alias for each platform.

```
#if SIMD_SSE
using Simd128Type = __m128;
#elif SIMD_NEON
using Simd128Type = float32x4_t;
#elif SIMD_AVX
using Simd128Type = __vector4;
...
#endif
```

Define another alias for each platform argument requirement.

```
#if SIMD_BYVALUE
using Simd128TypeArg = Simd128Type;
#elif SIMD_BYREF
using Simd128TypeArg = Simd128Type&;
#endif
```

Then, when defining the common interface for SIMD operations, we use these aliases.

`Simd128Type Add(Simd128TypeArg a, Simd128TypeArg b);`

The implementation of each platform's SIMD operations is written in different header files.

For example, for the *add* operation with SSE:

```
namespace Simd
{
inline Simd128Type Vector3Add(Simd128TypeArg vector1, Simd128TypeArg vector2)
{
return _mm_add_ps(vector1, vector2);
}
}
```

The internal math structures are aligned and contain their corresponding SIMD type.

```
struct alignas(16) Vector3Type
{
Simd128Type _v;
};
```

The math operations are implemented as wrappers.

```
inline void Vector3Add(Vector3Type* result, const Vector3Type& vector1, const Vector3Type& vector2)
{
result->_v = Simd::Vector3Add(vector1._v, vector2._v);
}
```

The math types used in the game code include their corresponding internal math structure.

```
class Vector3
{
Vector3Type _v;
...
};
```

When we implement the operations, we use the common interface.

```
inline Vector3 Vector3::operator+(const Vector3& value) const
{
Vector3 result;
Vector3Add(&result, *this, value);
return result;
}
```

# Optimizations

## Partial Loads

In some platforms (such as VMX128), functions that initialize fewer than 4 components leave the remaining component uninitialized.

Other platforms (such as SSE) always intialize the 4 components by setting the remaining components to 0.

## Generic to SIMD

A naive implementation of SIMD is to replace the generic implementation of math types such as Vector3 using SIMD registers.

```
struct Vector3
{
SimdType _v;
};
```

However, SIMD is most efficient when operating on many components.

- Wastes 25% of the performance by not using a float in common operations.

- Requires a lot of swizzling in common algorithms (for example in a cross product operation).

Instead, each Vector3 components should be stored in distinct SIMD float4, and the operations should be performed on four objects at the same time.

```
SimdVector3 inputs[4];
SimdVector3 outputs[4];
SimdOperation(inputs, outputs);
```

## Alignment

SIMD types should be 16-bytes aligned, using compiler directives or aligned memory allocators.

SIMD operations are more efficient with aligned types.

## Accessors and operators

Avoid accessing individual components through accessors or operator overloads.

It requires moving from the SIMD registers to the scalar ones and back again.

## Pointers parameters

Pointer operations are more efficient than the non-pointer operations to load data directly from memory to the SIMD registers.