Multithreading
Category | System |
---|
Overview
The performance of single processor cores is now increasing very slowly.
However, the computing power available is growing by having multiple processor cores in a single machine.
Developers have to write multithreaded code in order to use this power.
Multi-threaded programming brings new design and programming challenges.
Goals
The simplest design goal for multithreading is to break up the code into independent pieces and restrict these pieces to communicating just a few times per frame.
Threads
Each program in the operating system is a process.
A process starts with a single thread, called the primary thread.
Additional threads can be created by the operating system.
All threads share the process's virtual address space and system resources.
Hardware thread
A hardware thread is a physical CPU's core.
For example, in a quad-core CPU architecture, four hardware threads are supported at once and the CPU is really doing four things at the same time.
Each hardware thread has its own stack and set of registers.
Software thread
A software thread is managed by the operating system's kernel.
A hardware thread can run many software threads.
The CPU time is divided among those threads.
Preemptive multitasking takes place on each processor.
Preemptive multitasking means that while a thread is executing, the kernel can interrupt it to schedule another thread (of equal or higher priority) for execution.
Each software thread maintains a set of structures the system will use to save the thread context until it is scheduled. The thread context includes the thread's stack and set of registers.
Scheduler
Each thread created by a process is assigned a scheduling priority.
The operating system's scheduler maintains a queue of threads for each priority level and assigns a time slice to each thread in turn.
Only the highest priority threads that are ready to run are scheduled.
If there are no threads in the highest priority class ready to run, the next lowest priority class is checked.
Simultaneous Multi-Threading
Simultaneous Multi-Threading (SMT) is a technique in which two hardware threads are on one CPU core.
SMT threads share the L1 instruction and data caches.
There should only be one CPU-intensive thread per CPU core. Developers can assign a software thread to a specific hardware thread by setting the proper thread affinity.
Synchronization
On modern processors, reads and writes of naturally aligned native types are atomic.
However, there is no garantee that the compiler will use an atomic instruction for increments, or decrements.
When there is a shared resource like a global variable, the threads must make sure they share access to that resource.
class Object
{
private:
int _counter;
public:
void AddReference()
{
_counter++;
}
};
If two threads happen to call AddReference
at the same time, instead of both incrementing the _counter
variable, a race condition could potententially occur.
The assembly for this simple operation would look like:
- Load
_counter
into register R.
- Increment register R.
- Store register R back into
_counter
.
The increment operation is actually divided into three operations, and when a second thread try to increment the _counter
variable, the first thread could be executing any of these three steps.
Assuming the _counter
variable is initially set to 1.
If thread 1 hasn't yet completed the third step, then thread 2 will read the same initial value in the first step. The two threads would be incrementing _counter
to 2. After the two threads have returned from AddReference
, the _counter
variable would have been updated to 2 twice, instead of 3.
Synchronization Primitives
For multiple threads to work together, they need to be synchronized in order to communicate and share resources.
Each platform provides a set of synchronization primitives.
- Mutexes
- Critical Sections
- Semaphores
- Events
- Timers
Mutexes
Mutexes are used to protect a resource so that only one thread at a time can have access to that resource.
A thread waits on the mutex, and, when the wait succeeds, accesses the resource.
During this time, all other threads that wait on the mutex will be blocked and placed in a queue to obtain ownership of the resource.
When the first thread is finished accessing the resource, it releases the mutex. The next thread (if any) waiting on the mutex is then given access.
Critical Sections
Critical sections are similar to mutexes except that they are slightly faster and must be accessed with their own set of synchronization functions.
Semaphores
Semaphores are similar to mutexes in that they protect a resource, but they differ in that they allow a specified number of threads access to the resource before blocking additional threads requesting access.
For example, if only four threads were to be given access to a resource at a time, a semaphore with a count of four would be created. As each thread waited on the semaphore, the count would be decremented. When the fifth thread waited on the semaphore, it would block until one of the other threads currently using the resource finished doing so and signaled the semaphore.
Events
Events are used by threads to wait for certain events to occur.
A thread waiting on an event is blocked until that event is signaled by some other thread.
When an event is signaled, either a single thread or multiple threads waiting on that event are waked and allowed to execute.
Timers
Timers provide an object handle that will be signaled after a specified period of time has elapsed.
Examples
Multiple threads can each have a handle to a mutex object that is used to control write access to a variable. Before accessing a shared resource, the threads must call one of the wait functions.
Critical sections and interlocked functions are the most lightweight and efficient synchronization primitives.
They provide a more efficient means of synchronization than mutexes.
- A thread can request ownership of a critical section, without blocking upon failure to obtain the critical section.
- Aninterlocked function can perform some operations without requiring synchronization.
Locking
When a thread needs exclusive access to a resource, it calls a locking function.
- If the object is already locked, the thread will stop running until the resource becomes available.
- If the object isn't already locked, the thread will acquire the lock to stop other threads from accessing the resource.
- When the thread is done with the resource, it releases the lock and continues.
For example, using the C++11 mutex
:
#include <mutex>
class Object
{
std::mutex _mutex;
int _counter;
public:
void AddReference()
{
_mutex.lock();
_counter++;
_mutex.unlock();
}
};
The _mutex
variable is a mutex owned by the class.
Before incrementing the counter, the mutex is locked. The other threads will be blocked when trying to lock the mutex again.
After incrementing the counter, the mutex is unlocked. Only one other thread will be able to proceed and increment the counter again.
Atomics
A better alternative is to declare the counter as an atomic
variable.
#include <atomic>
class Object
{
std::atomic<int> _counter;
public:
void AddReference()
{
_counter++;
}
};
Another efficient implementation is to use a platform-specific interlocked functions that can perform a load/increment/store as a single operation.
On Windows and Xbox:
#include <windows.h>
class Object
{
int _counter;
public:
void AddReference()
{
InterlockedIncrement(&_counter);
}
};
On macOS and iOS:
#include <libkern/OSAtomic.h>
class Object
{
int _counter;
public:
void AddReference()
{
OSIncrementAtomic(&_counter);
}
};
Volatile Variables and Reordering
In the C++ standard, the volatile
keyword prevents the compiler from applying optimizations.
Volatile reads and writes cannot be reordered by the compiler.
However, the compiler can still reorder non-volatile reads and writes relative to volatile reads and writes.
Additionally, volatile does not prevent CPU reordering.
Deadlock
A deadlock occurs when two threads are both trying to lock two separate resources at once.
It can occur if two threads try to acquire the same two locks but acquire them in a different order.
For example, with two threads trying to access object A and B:
void Thread1Func()
{
A->Lock();
B->Lock();
...
B->Unlock();
A->Unlock();
}
void Thread2Func()
{
B->Lock();
A->Lock();
...
A->Unlock();
B->Unlock();
}
If thread 1 acquires object A and at the same time thread 2 acquires object B. Thread 1 will wait indefinitely for object B and thread 2 will wait indefinitely for object A. This is an example of deadlock.
Priority Inversion
The scenario in which a high-priority thread spins in a loop while waiting for a low-priority thread to release a lock is called priority inversion.
Managing Threads
The proper way of handling a thread is waiting for it to terminate, and then cleaning it up.
Child threads do not need as much stack as their parent thread. On Xbox, their stack size should be specified as a multiple of 64-KB.
Creating and destroying threads is an expensive operation. It is recommended to use a thread pool of threads that wait around for work instead.
It is recommended to let a thread exit normally by returning from its functio.
Fibers
A fiber is a unit of execution that must be manually scheduled by a program.
Fibers run in the context of the threads that schedule them.
Each thread can schedule multiple fibers.
A fiber has its own stack and a subset of the thread's registers.
Fibers are not preemptively scheduled.
A fiber can be terminated by any running fiber, including the currently executing fiber.
Multithreading architectures
Pipeline
In a typical pipeline architecture, there are two threads working independently. For example an Update thread and a Render thread. The threads synchronize their execution using a semaphore. The gain is at most twice as fast as a single threaded architecture.
To share data structures, there are two methods:
- Data structures need to be locked before being written to or read from.
- Data structures can be double-buffered or triple-buffered to minimize the amount of time spent during a lock.
Contractor queue
In a contractor queue architecture, a primary thread acts as the task master and creates a queue of independant tasks. While the task master waits for the completion of the tasks, the worker threads each take a set of iterations and run over them in parallel. When the queue is empty, the threads all combine back into the master thread.
The queue can be implemented with a lock.
Work crew
In the work crew architecture, many threads work on independent tasks that all interoperate.
For example, loading threads use a work crew architecture.
The threads are synchronized with events.
Synchronization Examples
For multiple threads to work together, they need to be synchronized in order to communicate and share resources.
Each platform provides a set of synchronization primitives.
Exclusive Access
Mutex
A mutex can be used to gain exclusive access to a resource.
The operating system guarantees that, for a particular mutex, only one thread at a time can acquire it.
On Windows and Xbox:
- Create a mutex
HANDLE mutex = CreateMutex(0, FALSE, 0);
- Manipulate shared data
WaitForSingleObject(mutex, INFINITE); ... ReleaseMutex( mutex );
- Destroy the mutex
CloseHandle(mutex);
Critical section
A faster alternative is a critical section.
On Windows and Xbox:
- Create a critical section
CRITICAL_SECTION cs; InitializeCriticalSection(&cs);
- Manipulate shared data
EnterCriticalSection(&cs); ... LeaveCriticalSection(&cs);
- Destroy the critical section
DeleteCriticalSection(&cs);
Events
Events can be use to share data accross multiple threads when implementing a double-buffering or tripple-buffering strategy.
For example, if the update thread and the render thread are taking turns using a pair of render description buffers, an event can be associated each buffer to indicate when they are done with it.
On Windows and Xbox:
- Create an event with
CreateEvent
.- When a thread is done with a buffer, it can use
SetEvent
to signal it.
- The thread can call
WaitForSingleObject
on the other buffer's event.
- When a thread is done with a buffer, it can use
- During cleanup, the event is destroyed with
CloseHandle
.
Semaphores
A semaphore is used to implement work queues to control how many threads can be running at the same time.
One thread adds work to a queue and whenever it adds a new item to the queue, it releases a semaphore to allows one worker thread to be released from the pool of waiting threads.
The worker threads wait for a semaphore, and when it returns they know there is a work item in the queue for them.
Additionally, another synchronization technique must be used to guarantee safe access to the shared work queue.
On Windows and Xbox:
- The pool create an semaphore with
CreateSemaphore
.- The worker threads call
WaitForSingleObject
.
- When a worker thread's task is finished, release the semaphore with
ReleaseSemaphore
.
- The worker threads call
- During cleanup, the pool waits for all threads to terminate with
WaitForMultipleObjects
.
- The semaphore is destroyed with
CloseHandle
.
Lockless Programming
Lockless programming is a way to safely share data between multiple threads without the cost of acquiring and releasing locks.
Interlocked functions can perform simple thread-safe operations without using locks.
The use of these functions is called lockless programming.
Lockless Containers
Data can be safely shared between threads using lockless containers.
Code that executes on parallel threads can add and remove elements on the lock-free containers simultaneously.
- Lock-free Stack
- Lock-free Queue
- Lock-free Priority Queue
- Lock-free Hash Table
Lock-free Stack
Provides a thread-safe mechanism for adding and removing data elements from a last-in-first-out (LIFO) data structure.
Lock-free Queue
Provides a thread-safe mechanism for adding and removing data elements from a first-in-first-out (GIFO) data structure.
Lock-free Priority Queue
Provides a thread-safe mechanism for adding and removing data elements from a prioritized queue structure where the elements are sorted according to a key.
Lock-free Hash Table
Provides a thread-safe mechanism for adding and removing data elements from a hash-tree structure based on a hash key value.
Logging
The logging functions should be implemented using a thread-safe mechanism to perform physical writes to a log file on a specified hardware thread.
Best Practices
- Use as few threads as possible on each processor.
- Minimize context switching.
- Minimize use of system resources.
- Schedule threads on all available processors.
Synchronization
- Even though some synchronization methods are faster than others, it is usually better to synchronize less often.
- All shared data that can be written to must be protected with a synchronization primitive.
- Minimize the amount of shared data.
- Minimize the time spent updating shared data.
- Use lockless programming when appropriate.
Scheduling Priorities
A typical strategy is to set background threads, particularly those that are processor intensive, to a low priority, to ensure that they can be preempted when necessary.
The waiting high-priority thread then use a wait function, which let the system's scheduler boosts the dynamic priority of the background thread.
Thread Local Storage
The thread local storage (TLS) is an important way of avoiding synchronization.
Global variables can be defined as being per-thread. It gives each thread its own copy of it which can be referenced safely and efficiently without requiring synchronization.
TLS Storage
- On Windows and Xbox, use the TLS API.
- On other platforms, use the POSIX pthread library.
Steps:
- During the program initialization, one thread allocates a TLS index/key.
On Windows and Xbox, use TlsAlloc
.
- Each thread allocates storage and associate the index/key with a pointer to its storage.
On Windows and Xbox, use TlsSetValue
.
On POSIX, use pthread_key_create
and pthread_setspecific
.
- When a thread needs to access its storage, use the TLS index/key to retrieve the pointer.
On Windows and Xbox, use TlsGetValue
.
On POSIX, use pthread_getspecific
.
- When a thread no longer needs the storage that it has associated with a TLS index/key, it must free that storage.
On Widows and Xbox, use TlsFree
.
On POSIX, use pthread_key_delete
.
TLS Variable
Alternatively, a TLS variable can be declared.
- MSVC — the
__declspec
keyword and thethread
attribute.
__declspec(thread) int tls = 1;
- Clang/GCC — the
__thread
keyword.
__thread int tls = 1;
Timing
The rdtsc
instruction should not be used directly as it not necessarily synchronized between CPUs.
Use the Windows QPC API, or the C++ chrono library with a high-performance clock instead.
Debugging
A well balanced multithreaded system will reduce long stalls where threads are waiting on each other.
Some multithreaded bugs show up only rarely, making them difficult to find and fix.
Use Cases
Loading
Data is typically compressed which helps reduce the size on disk but increases the worklow of the loading.
A file loading thread that handles decompression should use asynchronous file I/O to process load requests from the update thread.
The loading thread should perform large reads in order to maximize data-reading efficiency.
Rendering
The rendering can be put on a separate thread.
The most simple strategy is for the rendering thread to handle rendering operations so that the update thread has less workload.
A more complex strategy is that the update thread prepares render buffers and render commands which the rendering thread can process. The update thread is one frame ahead of the render thread, which means that it takes two frames before user actions appear on the screen.
The update and rendering threads work in lockstep with each other, their communication buffers are simply double buffered. The update thread is filling one buffer while the render thread is reading from the other.
Physics and Animation
Physics and animation cannot be put onto a separate thread because the update thread requires the results of the calculations immediately.
Instead, the calculations are handled as jobs on multiple threads.
Careful design of data structures and synchronization points is important.
Optional Simulations
Some simulations can be handled by the job system such as procedural geometry (clouds...), cloth, hair and fuilds simulations.
They typically don't require complex synchronization with the update thread.
They do not need to run at a fixed frame rate.
Their complexity can be scaled down on lower systems.
Worker threads
There are two designs to implement worker threads.
- Each worker thread pulls assignments from their work queue as quickly as it can in a loop. The data goes from the update threads to the worker threads and then to the render thread, there can be a delay of three or more frames before some actions appear on the screen.
- Several worker threads pull assignments from the same work queue.