Multithreading

Category	System

Overview

The performance of single processor cores is now increasing very slowly.

However, the computing power available is growing by having multiple processor cores in a single machine.

Developers have to write multithreaded code in order to use this power.

Multi-threaded programming brings new design and programming challenges.

Goals

The simplest design goal for multithreading is to break up the code into independent pieces and restrict these pieces to communicating just a few times per frame.

Threads

Each program in the operating system is a process.

A process starts with a single thread, called the primary thread.

Additional threads can be created by the operating system.

All threads share the process's virtual address space and system resources.

Hardware thread

A hardware thread is a physical CPU's core.

For example, in a quad-core CPU architecture, four hardware threads are supported at once and the CPU is really doing four things at the same time.

Each hardware thread has its own stack and set of registers.

🎮

On Xbox, each thread has a default stack size of 64-KB.

Software thread

A software thread is managed by the operating system's kernel.

A hardware thread can run many software threads.

The CPU time is divided among those threads.

Preemptive multitasking takes place on each processor.

Preemptive multitasking means that while a thread is executing, the kernel can interrupt it to schedule another thread (of equal or higher priority) for execution.

Each software thread maintains a set of structures the system will use to save the thread context until it is scheduled. The thread context includes the thread's stack and set of registers.

Scheduler

Each thread created by a process is assigned a scheduling priority.

The operating system's scheduler maintains a queue of threads for each priority level and assigns a time slice to each thread in turn.

Only the highest priority threads that are ready to run are scheduled.

If there are no threads in the highest priority class ready to run, the next lowest priority class is checked.

Simultaneous Multi-Threading

Simultaneous Multi-Threading (SMT) is a technique in which two hardware threads are on one CPU core.

SMT threads share the L1 instruction and data caches.

💡

SMT threads running different instruction types cooperate better than two threads across different cores running the same instructions as in the latter case, they can end up fighting over the cache and causing many cache misses.

There should only be one CPU-intensive thread per CPU core. Developers can assign a software thread to a specific hardware thread by setting the proper thread affinity.

💡

On Windows, the GetLogicalProcessorInformation function or the CPUID instruction can be used to return the CPU core topology.

Synchronization

On modern processors, reads and writes of naturally aligned native types are atomic.

However, there is no garantee that the compiler will use an atomic instruction for increments, or decrements.

When there is a shared resource like a global variable, the threads must make sure they share access to that resource.

class Object
{
private:
    int _counter;

public:
    void AddReference()
    {
        _counter++;
    }
};

If two threads happen to call AddReference at the same time, instead of both incrementing the _counter variable, a race condition could potententially occur.

The assembly for this simple operation would look like:

Load _counter into register R.

Increment register R.

Store register R back into _counter.

The increment operation is actually divided into three operations, and when a second thread try to increment the _counter variable, the first thread could be executing any of these three steps.

Assuming the _counter variable is initially set to 1.

If thread 1 hasn't yet completed the third step, then thread 2 will read the same initial value in the first step. The two threads would be incrementing _counter to 2. After the two threads have returned from AddReference, the _counter variable would have been updated to 2 twice, instead of 3.

Synchronization Primitives

For multiple threads to work together, they need to be synchronized in order to communicate and share resources.

Each platform provides a set of synchronization primitives.

Mutexes

Critical Sections

Semaphores

Events

Timers

Mutexes

Mutexes are used to protect a resource so that only one thread at a time can have access to that resource.

A thread waits on the mutex, and, when the wait succeeds, accesses the resource.

During this time, all other threads that wait on the mutex will be blocked and placed in a queue to obtain ownership of the resource.

When the first thread is finished accessing the resource, it releases the mutex. The next thread (if any) waiting on the mutex is then given access.

Critical Sections

Critical sections are similar to mutexes except that they are slightly faster and must be accessed with their own set of synchronization functions.

Semaphores

Semaphores are similar to mutexes in that they protect a resource, but they differ in that they allow a specified number of threads access to the resource before blocking additional threads requesting access.

For example, if only four threads were to be given access to a resource at a time, a semaphore with a count of four would be created. As each thread waited on the semaphore, the count would be decremented. When the fifth thread waited on the semaphore, it would block until one of the other threads currently using the resource finished doing so and signaled the semaphore.

Events

Events are used by threads to wait for certain events to occur.

A thread waiting on an event is blocked until that event is signaled by some other thread.

When an event is signaled, either a single thread or multiple threads waiting on that event are waked and allowed to execute.

Timers

Timers provide an object handle that will be signaled after a specified period of time has elapsed.

Examples

Multiple threads can each have a handle to a mutex object that is used to control write access to a variable. Before accessing a shared resource, the threads must call one of the wait functions.

Critical sections and interlocked functions are the most lightweight and efficient synchronization primitives.

They provide a more efficient means of synchronization than mutexes.

A thread can request ownership of a critical section, without blocking upon failure to obtain the critical section.

Aninterlocked function can perform some operations without requiring synchronization.

Locking

When a thread needs exclusive access to a resource, it calls a locking function.

If the object is already locked, the thread will stop running until the resource becomes available.

If the object isn't already locked, the thread will acquire the lock to stop other threads from accessing the resource.

When the thread is done with the resource, it releases the lock and continues.

For example, using the C++11 mutex:

#include <mutex>

class Object
{
    std::mutex _mutex;
    int _counter;

public:
    void AddReference()
    {
        _mutex.lock();
        _counter++;
        _mutex.unlock();
    }
};

The _mutex variable is a mutex owned by the class.

Before incrementing the counter, the mutex is locked. The other threads will be blocked when trying to lock the mutex again.

After incrementing the counter, the mutex is unlocked. Only one other thread will be able to proceed and increment the counter again.

Atomics

A better alternative is to declare the counter as an atomic variable.

#include <atomic>

class Object
{
    std::atomic<int> _counter;

public:
    void AddReference()
    {
        _counter++;
    }
};

Another efficient implementation is to use a platform-specific interlocked functions that can perform a load/increment/store as a single operation.

On Windows and Xbox:

#include <windows.h>

class Object
{
    int _counter;

public:
    void AddReference()
    {
        InterlockedIncrement(&_counter);
    }
};

On macOS and iOS:

#include <libkern/OSAtomic.h>

class Object
{
    int _counter;

public:
    void AddReference()
    {
        OSIncrementAtomic(&_counter);
    }
};

⚠️

Interlocked functions and critical section operations will take much longer if other threads are writing to the same cache line.

⚠️

On Windows, the interlocked functions are full-memory barriers — they effectively have a CPU memory barrier both before and after them, which means that they are a full read-acquire or write-release barrier all by themselves. On Xbox 360, the Acquire and Release versions of the interlocked functions should be used as they come with a memory barrier.

Volatile Variables and Reordering

In the C++ standard, the volatile keyword prevents the compiler from applying optimizations.

Volatile reads and writes cannot be reordered by the compiler.

However, the compiler can still reorder non-volatile reads and writes relative to volatile reads and writes.

Additionally, volatile does not prevent CPU reordering.

☠

On Windows and Xbox, the Visual Studio C++ compiler will not rearrange any reads and writes past volatile variables. On Windows, the compiler also prevents the CPU from reordering reads and writes. On Xbox 360, the compiler does not.

Deadlock

A deadlock occurs when two threads are both trying to lock two separate resources at once.

It can occur if two threads try to acquire the same two locks but acquire them in a different order.

For example, with two threads trying to access object A and B:

void Thread1Func()
{
    A->Lock();
    B->Lock();

    ...

    B->Unlock();
    A->Unlock();
}

void Thread2Func()
{
    B->Lock();
    A->Lock();

    ...

    A->Unlock();
    B->Unlock();
}

If thread 1 acquires object A and at the same time thread 2 acquires object B. Thread 1 will wait indefinitely for object B and thread 2 will wait indefinitely for object A. This is an example of deadlock.

Priority Inversion

The scenario in which a high-priority thread spins in a loop while waiting for a low-priority thread to release a lock is called priority inversion.

🎮

On Windows and macOS, the operating system's scheduler solves this problem by randomly boosting the priority of the low priority thread.

☠

On Xbox, the developer explicitly assign software threads to hardware threads, and there is a no protection against priority inversion.

Managing Threads

The proper way of handling a thread is waiting for it to terminate, and then cleaning it up.

Child threads do not need as much stack as their parent thread. On Xbox, their stack size should be specified as a multiple of 64-KB.

Creating and destroying threads is an expensive operation. It is recommended to use a thread pool of threads that wait around for work instead.

It is recommended to let a thread exit normally by returning from its functio.

🎮

On Windows and Xbox One, a thread is created with CreateThread. In C++, it is not recommended to exit a thread with ExitThread as destructors will not be called.

💡

On Xbox 360, the software threads must be explicitly assigned to hardware threads with XSetThreadProcessor. Otherwise, their are created on the same hardware thread. On Windows SetThreadAffinityMask can be used to suggest to the operating system which hardware threads to use, but it is not recommended as other processes are running on the system.

🎮

On Xbox 360, it is recommend to use the CRT _beginthreadex function to create a thread and properly intialize the C runtime. Similarly, a thread should either return normally; or be exited with _endthreadex.

☠

Threads should never be suspended manually as it can lead to deadlocks. On Windows and Xbox, SuspendThread should be never be used. Threads should also never be terminated manually as they will not be properly cleaned up. On Windows, TerminateThread should never be used; on Xbox it is not available.

Fibers

A fiber is a unit of execution that must be manually scheduled by a program.

Fibers run in the context of the threads that schedule them.

Each thread can schedule multiple fibers.

A fiber has its own stack and a subset of the thread's registers.

Fibers are not preemptively scheduled.

A fiber can be terminated by any running fiber, including the currently executing fiber.

Multithreading architectures

Pipeline

In a typical pipeline architecture, there are two threads working independently. For example an Update thread and a Render thread. The threads synchronize their execution using a semaphore. The gain is at most twice as fast as a single threaded architecture.

To share data structures, there are two methods:

Data structures need to be locked before being written to or read from.

Data structures can be double-buffered or triple-buffered to minimize the amount of time spent during a lock.

🎮

Direct3D 9 and 11 use a pipelined architecture in which buffers need to be lock and unlocked. The runtime returns a pointer to a new region of memory instead of the old buffer data if the GPU is still using the buffer. Direct 3D 9 — Lock and Unlock. Direct 3D 11 — Map and Unmap. In Direct3D 12, the application is responsible for pipelining data updates.

Contractor queue

In a contractor queue architecture, a primary thread acts as the task master and creates a queue of independant tasks. While the task master waits for the completion of the tasks, the worker threads each take a set of iterations and run over them in parallel. When the queue is empty, the threads all combine back into the master thread.

The queue can be implemented with a lock.

💡

The compute shaders executed by the GPU use a contractor queue architecture.

Work crew

In the work crew architecture, many threads work on independent tasks that all interoperate.

For example, loading threads use a work crew architecture.

The threads are synchronized with events.

Synchronization Examples

For multiple threads to work together, they need to be synchronized in order to communicate and share resources.

Each platform provides a set of synchronization primitives.

Exclusive Access

Mutex

A mutex can be used to gain exclusive access to a resource.

The operating system guarantees that, for a particular mutex, only one thread at a time can acquire it.

💡

A mutex can be used to synchronize between multiple processes.

On Windows and Xbox:

Create a mutex

HANDLE mutex = CreateMutex(0, FALSE, 0);

Manipulate shared data

WaitForSingleObject(mutex, INFINITE);
...
ReleaseMutex( mutex );

Destroy the mutex
```
CloseHandle(mutex);
```

Critical section

⚠️

The main disadvantage to mutexes is that they are relatively expensive to acquire and release.

A faster alternative is a critical section.

💡

A critical section can be used to synchronize only within a process.

On Windows and Xbox:

Create a critical section

CRITICAL_SECTION cs;
InitializeCriticalSection(&cs);

Manipulate shared data

EnterCriticalSection(&cs);
...
LeaveCriticalSection(&cs);

Destroy the critical section
```
DeleteCriticalSection(&cs);
```

Events

Events can be use to share data accross multiple threads when implementing a double-buffering or tripple-buffering strategy.

For example, if the update thread and the render thread are taking turns using a pair of render description buffers, an event can be associated each buffer to indicate when they are done with it.

On Windows and Xbox:

Create an event with CreateEvent.
1. When a thread is done with a buffer, it can use SetEvent to signal it.
1. The thread can call WaitForSingleObject on the other buffer's event.

During cleanup, the event is destroyed with CloseHandle.

Semaphores

A semaphore is used to implement work queues to control how many threads can be running at the same time.

One thread adds work to a queue and whenever it adds a new item to the queue, it releases a semaphore to allows one worker thread to be released from the pool of waiting threads.

The worker threads wait for a semaphore, and when it returns they know there is a work item in the queue for them.

Additionally, another synchronization technique must be used to guarantee safe access to the shared work queue.

On Windows and Xbox:

The pool create an semaphore with CreateSemaphore.
1. The worker threads call WaitForSingleObject.
1. When a worker thread's task is finished, release the semaphore with ReleaseSemaphore.

During cleanup, the pool waits for all threads to terminate with WaitForMultipleObjects.

The semaphore is destroyed with CloseHandle.

Lockless Programming

Lockless programming is a way to safely share data between multiple threads without the cost of acquiring and releasing locks.

Interlocked functions can perform simple thread-safe operations without using locks.

The use of these functions is called lockless programming.

⚠️

Lockless algorithms are not guaranteed to be faster than algorithms that use locks.

Lockless Containers

Data can be safely shared between threads using lockless containers.

Code that executes on parallel threads can add and remove elements on the lock-free containers simultaneously.

Lock-free Stack

Lock-free Queue

Lock-free Priority Queue

Lock-free Hash Table

Lock-free Stack

Provides a thread-safe mechanism for adding and removing data elements from a last-in-first-out (LIFO) data structure.

Lock-free Queue

Provides a thread-safe mechanism for adding and removing data elements from a first-in-first-out (GIFO) data structure.

Lock-free Priority Queue

Provides a thread-safe mechanism for adding and removing data elements from a prioritized queue structure where the elements are sorted according to a key.

Lock-free Hash Table

Provides a thread-safe mechanism for adding and removing data elements from a hash-tree structure based on a hash key value.

Logging

The logging functions should be implemented using a thread-safe mechanism to perform physical writes to a log file on a specified hardware thread.

Best Practices

Use as few threads as possible on each processor.
- Minimize context switching.
- Minimize use of system resources.

Schedule threads on all available processors.

Synchronization

Even though some synchronization methods are faster than others, it is usually better to synchronize less often.

All shared data that can be written to must be protected with a synchronization primitive.

Minimize the amount of shared data.

Minimize the time spent updating shared data.

Use lockless programming when appropriate.

Scheduling Priorities

A typical strategy is to set background threads, particularly those that are processor intensive, to a low priority, to ensure that they can be preempted when necessary.

The waiting high-priority thread then use a wait function, which let the system's scheduler boosts the dynamic priority of the background thread.

Thread Local Storage

The thread local storage (TLS) is an important way of avoiding synchronization.

Global variables can be defined as being per-thread. It gives each thread its own copy of it which can be referenced safely and efficiently without requiring synchronization.

TLS Storage

On Windows and Xbox, use the TLS API.

On other platforms, use the POSIX pthread library.

Steps:

During the program initialization, one thread allocates a TLS index/key.

On Windows and Xbox, use TlsAlloc.

Each thread allocates storage and associate the index/key with a pointer to its storage.

On Windows and Xbox, use TlsSetValue.

On POSIX, use pthread_key_create and pthread_setspecific.

When a thread needs to access its storage, use the TLS index/key to retrieve the pointer.

On Windows and Xbox, use TlsGetValue.

On POSIX, use pthread_getspecific.

When a thread no longer needs the storage that it has associated with a TLS index/key, it must free that storage.

On Widows and Xbox, use TlsFree.

On POSIX, use pthread_key_delete.

TLS Variable

Alternatively, a TLS variable can be declared.

MSVC — the __declspec keyword and the thread attribute.

__declspec(thread) int tls = 1;

Clang/GCC — the __thread keyword.

__thread int tls = 1;

Timing

The rdtsc instruction should not be used directly as it not necessarily synchronized between CPUs.

Use the Windows QPC API, or the C++ chrono library with a high-performance clock instead.

Debugging

A well balanced multithreaded system will reduce long stalls where threads are waiting on each other.

Some multithreaded bugs show up only rarely, making them difficult to find and fix.

Use Cases

Loading

Data is typically compressed which helps reduce the size on disk but increases the worklow of the loading.

A file loading thread that handles decompression should use asynchronous file I/O to process load requests from the update thread.

The loading thread should perform large reads in order to maximize data-reading efficiency.

Rendering

The rendering can be put on a separate thread.

The most simple strategy is for the rendering thread to handle rendering operations so that the update thread has less workload.

A more complex strategy is that the update thread prepares render buffers and render commands which the rendering thread can process. The update thread is one frame ahead of the render thread, which means that it takes two frames before user actions appear on the screen.

The update and rendering threads work in lockstep with each other, their communication buffers are simply double buffered. The update thread is filling one buffer while the render thread is reading from the other.

☠

On Windows using Direct 3D, the D3DCREATE_MULTITHREADED flag allows rendering on one thread and resource creation on other threads, while Direct 3D handles synchronization. It should not be used as it will affect the performance of the rendering thread.

Physics and Animation

Physics and animation cannot be put onto a separate thread because the update thread requires the results of the calculations immediately.

Instead, the calculations are handled as jobs on multiple threads.

Careful design of data structures and synchronization points is important.

Optional Simulations

Some simulations can be handled by the job system such as procedural geometry (clouds...), cloth, hair and fuilds simulations.

They typically don't require complex synchronization with the update thread.

They do not need to run at a fixed frame rate.

Their complexity can be scaled down on lower systems.

Worker threads

There are two designs to implement worker threads.

Each worker thread pulls assignments from their work queue as quickly as it can in a loop. The data goes from the update threads to the worker threads and then to the render thread, there can be a delay of three or more frames before some actions appear on the screen.

Several worker threads pull assignments from the same work queue.