Best Practices



This documents contains best practices that are mostly platform-independent and applicable to C++, C# and other languages.

32-bit vs 64-bit Architectures


Floating point comparisons should use a small epsilon to allow for minor variations in final floating point bit patterns as hardware changes.

The same calculation may produce different results depending on the CPU model or GPU model.



Memory throughput depends on the hardware.

Some of the things that can impact the throughput of memory include:


Branch prediction

Branch misprediction incur a heavy penalty.

In very critical code, avoid:




Cross-Core Memory Costs

Interlocked Operations

Interlocked operations can help prevent race conditions between CPU cores, but can have drastic differences between CPUs.


There is no guarantee on the order of execution between threads unless specifically enforced with syntonization primitives.

Lock Convoy

The issue shows up when each of these pieces of work is running on different threads that block waiting for the completion of dependent pieces of work. If one of the tasks happens to take longer than normal, the entire system stalls waiting for a lock.

void TaskA()

void TaskB()

void TaskC()

In this case, task C was waiting on task B, which was waiting on task A. Task A ends up taking slightly longer to execute, which causes task B to suspend until A is finally done. This delay then cascades down to task C.

OS Locking Primitives

The cost for using an OS locking primitive is not always cheap if the code is required to transition to the kernel.

In a use case of a creator job and a consumer job, if the consumer job thread is constantly suspending waiting on work to perform, the cost for each task increases with the cost of transitioning to the kernel.

The solution to the problem is to create larger tasks or to create a batch of jobs.


When lock are used too heavily, it results in thread stalls.

A change in timing for just one piece of code ends up affecting all other threads that are using the same locking primitive.

For example, memory allocation.

The solution is to reduce the amount of locking and contention.


Applications should test against a variable number of processor cores. Testing against multiple machine configurations is possible by using SetProcessAffinityMask to restrict the application to a reduced set of cores.


Treat all I/O requests as asynchronous operations.

The standard C library functions are blocking. They perform one read operation at a time, waiting for each read to finish before requesting another read.

Using overlapped I/O with the maximum number of requests possible allows the hardware to reorder the requests to match the underlying layout.



The GPUs in various platforms have different speeds as well as a different number of compute units.

Avoid tight coupling between GPU and CPU work that latency can affect.


Support multiple resolutions for desktop applications, as well as the ability to toggle between windowed and full-screen modes.

Dynamic Resolution

To present the best resolution available and to preserve frame rate, downgrade or upgrade the resolution automatically as the application becomes more or less GPU bound.


The minimum size for audio packets can be larger or smaller on different systems.

Attempts to synchronize gameplay to audio by making assumptions on the size of those packets can result in audio stutter.

The solution is to query the packet sizes from the audio APIs where possible and then adjust the packets sent to the audio hardware to avoid extra latency caused by the OS having to convert the packets to match the hardware.


Connection time-outs can vary drastically on different platforms.

Be sure to test all network calls under varying conditions.

Avoid testing solely on internal LAN networks, to make sure your title can handle different network topologies, bandwidths and latencies.

Also try adding simulating varying levels of packet-loss and delays to your network system.