Best Practices

Category	System

Overview

This documents contains best practices that are mostly platform-independent and applicable to C++, C# and other languages.

32-bit vs 64-bit Architectures

Casting pointers to 32-bit integers loses information; casting back results in an incorrect pointer.

Structures that contain pointers change in size.

Structures that contain pointers may no longer pack the same way due to alignment rules.

Precision

Floating point comparisons should use a small epsilon to allow for minor variations in final floating point bit patterns as hardware changes.

The same calculation may produce different results depending on the CPU model or GPU model.

Memory

Throughput

Memory throughput depends on the hardware.

Some of the things that can impact the throughput of memory include:

Contention from another core or another thread.

Cache design

Page sizes

Caches

Work with contiguous structures (dynamic structures cause more cache misses).

Pack and unpack data in registers (with specific SIMD instructions).

Split Hot and cold data structures to prevent loading unneeded data into the cache.

Branch prediction

Branch misprediction incur a heavy penalty.

Branches that are always taken or never taken are, by far, the fastest patterns. The worst branching pattern is alternating taken and not taken.

In very critical code, avoid:

Virtual function calls to wide varieties of polymorphic classes from the same base class.

Function pointer dereferences.

Conditions based on effectively random data.

Optimizations

Order the data so that all not taken items are processed first, followed by all taken.

Arrange nested branches so that branches with more predictable branching behavior are the outer branches.

Keep jumps to branch targets within the same memory page (usually 4 KB).

Aim for two branches per cache line, both in the same high or low 32 byes.

Consider unrolling loops.

Multithreading

General

Use CPUID to determine the system architecture including the number of cores.

Place threads on specific cores (using thread affinity).

Two threads working with the same data should be on the same module to avoid remote cache latencies (especially with on systems split L2 cache).

Split work into as many logical threads as can be reasonably synchronized.

When synchronization is not required, a thread pool setup based on the number of cores detected.

Cross-Core Memory Costs

False sharing occurs when two memory addresses resolve to the same cache line.
- Avoid multiple cores making concurrent writes to the same cache line.
- Add padding by a multiple of a cache line in size to structures.

Data structure: switch from using an array of structures (AoS) to a structure of arrays (SoA) when each thread is operating on a unique structure.

Separate read-only data from read/write data.

Avoid data sharing, such as a direct sharing of variables, such as a reference count on a structure.

Prefer a unique job queue for each thread instead of one job queue that is shared by multiple threads (use work stealing algorithm to help balancing the load between the threads).

Interlocked Operations

Interlocked operations can help prevent race conditions between CPU cores, but can have drastic differences between CPUs.

Order

There is no guarantee on the order of execution between threads unless specifically enforced with syntonization primitives.

Lock Convoy

The issue shows up when each of these pieces of work is running on different threads that block waiting for the completion of dependent pieces of work. If one of the tasks happens to take longer than normal, the entire system stalls waiting for a lock.

void TaskA()
{
    DoWorkA();
    SetEvent(WorkADone);
}

void TaskB()
{
    WaitFor(WorkADone);
    DoWorkB();
    SetEvent(WorkBDone);
}

void TaskC()
{
    WaitFor(WorkBDone);
    DoWorkC();
    SetEvent(WorkCDone);
}

In this case, task C was waiting on task B, which was waiting on task A. Task A ends up taking slightly longer to execute, which causes task B to suspend until A is finally done. This delay then cascades down to task C.

OS Locking Primitives

The cost for using an OS locking primitive is not always cheap if the code is required to transition to the kernel.

In a use case of a creator job and a consumer job, if the consumer job thread is constantly suspending waiting on work to perform, the cost for each task increases with the cost of transitioning to the kernel.

The solution to the problem is to create larger tasks or to create a batch of jobs.

Locks

When lock are used too heavily, it results in thread stalls.

A change in timing for just one piece of code ends up affecting all other threads that are using the same locking primitive.

For example, memory allocation.

The solution is to reduce the amount of locking and contention.

Make sure that the use of locks is protecting the minimum amount of code possible.

Prefer lockless implementations whenever possible.

Proper use of reader/writer locks.

Prefer algorithms that treat input data as read-only. All threads can read the data simultaneously without needing the lock.

Testing

Applications should test against a variable number of processor cores. Testing against multiple machine configurations is possible by using SetProcessAffinityMask to restrict the application to a reduced set of cores.

I/O

Treat all I/O requests as asynchronous operations.

The standard C library functions are blocking. They perform one read operation at a time, waiting for each read to finish before requesting another read.

Using overlapped I/O with the maximum number of requests possible allows the hardware to reorder the requests to match the underlying layout.

Video

GPU

The GPUs in various platforms have different speeds as well as a different number of compute units.

Avoid tight coupling between GPU and CPU work that latency can affect.

Resolution

Support multiple resolutions for desktop applications, as well as the ability to toggle between windowed and full-screen modes.

Dynamic Resolution

To present the best resolution available and to preserve frame rate, downgrade or upgrade the resolution automatically as the application becomes more or less GPU bound.

Audio

The minimum size for audio packets can be larger or smaller on different systems.

Attempts to synchronize gameplay to audio by making assumptions on the size of those packets can result in audio stutter.

The solution is to query the packet sizes from the audio APIs where possible and then adjust the packets sent to the audio hardware to avoid extra latency caused by the OS having to convert the packets to match the hardware.

Networking

Connection time-outs can vary drastically on different platforms.

Be sure to test all network calls under varying conditions.

Avoid testing solely on internal LAN networks, to make sure your title can handle different network topologies, bandwidths and latencies.

Also try adding simulating varying levels of packet-loss and delays to your network system.