Cache Memory
Overview
Access to the main memory (also known as system memory) by the CPU, is a slow operation.
Therefore, a set of memory caches are used by the CPU.
Caches are smaller banks of memory.
Cache Levels
There is a tradeoff between cache latency and hit rate. Larger caches have better hit rates but longer latency.
For this reason, moderns CPU have two or three levels of cache.
The L1 cache is divided into two uses:
- The instruction cache holds copies of the instructions to execute.
- The data cache holds copies of memory that are read and written by the instructions.
The L2 and L3 caches are typically unified caches containing both instructions and data.
Higher levels of cache contain more memory that the previous level of cache but are also slower.
Different architectures have different levels of cache. Generally, there are three levels of cache and the level 3 cache is shared between the cores.
The L1 cache is generally located on the CPU itself and is fastest to access.
Higher levels of cache are generally located on external chips but are still faster to access than main memory.
Memory Access
When the processor tries to access a memory address:
- It checks the fastest L1 cache first.
- If there is a cache hit, the processor proceeds at high speed without checking higher levels of cache, and more importantly without accessing the main memory.
- If there is a cache miss, the next fastest cache is checked.
- The processor checks higher and slower levels of cache until there is a cache hit.
- If the data is does not exist in any level of cache, the data is copied from main memory to the cache.
Architectures
On Intel i7 processors, the L1 cache is 32 KB and dedicated to each core. The L2 cache is 256 KB and shared by each cluster. The L3 cache is 6 MB to 8 MB and shared by all the cores.
The Xbox One and the PS4 have dual quad-core processors. There are only two levels of cache, but with an additional bridge between the split L2 caches. The size of the L2 is also increased to 4 MB.
Cache Lines
Each cache is organized into fixed-size units called cache lines.
They size is typically a power of two value.
When a program references a memory address, the CPU loads a cache line by copying from system memory beginning at the next-lowest address that is aligned with the size of a cache line.
Therefore, properly aligning data allows a better use of the cache line.
Predictable access patterns are very prefetch-friendly.
For example, linear array traversals are very cache-friendly (see locality).
Cache Misses
Cache hits and cache misses occur when a memory address is read or written by the CPU.
When the CPU tries to access a memory address:
- If the memory address is in the cache, a cache hit occurs.
- If the memory address is not in the cache, a cache miss occurs.
Cache misses can be avoided by strategic use of memory by promoting locality of reference.
The benefit of locality of reference is to increase the likelihood that data is already in the cache when instructions need access to it. The data should ideally fit in a cache line.
Spatial locality
Spatial locality is the use of data structures within close memory locations.
To benefit from spatial locality, keep data structures close together in memory.
In other words, if a block of memory is accessed by the CPU, there is a good chance that adjacent memory blocks will also be touched by the CPU.
Temporal locality
Temporal locality is the reuse of data structures within a small time duration.
To benefit from temporal locality, the code should keep access to data structures as close in time as possible.
In other words, if a block of memory is accessed by the CPU, there is a good chance that it will be accessed again in a short amount of time.
Cache Trashing
Poor locality results in cache thrashing and cache pollution.
Cache trashing occurs when multiple memory locations compete for the same cache lines and result in excessive cache misses.
Cache pollution occurs when unnecessary data is loaded into the CPU cache causing other useful data to be evicted from the cache.
Loads and Stores
A cache load occurs when the CPU copies data from the main memory to the caches.
A cache flush (or store) occurs when the CPU copies data back from the caches to the main memory.
Load-Hit-Stores
Data that was just written is not immediately available for reading.
A load-hit-store occurs when a thread stores data to a memory location and shortly afterward reloads from that same location.
There are several possible causes of a load-hit-store.
- Transferring data between register sets
- Updating data in arrays
- Passing structures by value
Transferring Data between Register Sets
Data is transferred between register sets when performing type conversions (for example, float to integer and integer to float conversions).
A data transfer between register sets can only be done by going through memory.
Updating Data in Arrays
Frequent read-modify-write sequences to the same element of an array can cause a load-hit-store.
When writing data through a pointer that is not marked as restrict, the compiler assumes that other variables are potentially being overwritten which may force the compiler to flush their values to memory and reload them later.
Passing Structures by Value
Function calls that pass structures by value may have to pass the structures in memory.
The calling function writes the structure to the stack, and the called function reads it back, causing a load-hit-store.
Virtual Memory
Read and write instructions specify memory addresses in the virtual memory address space.
Virtual addresses are mapped to physical memory addresses.
The address translation is performed by the Memory Management Unit (MMU) using a Translation Look-aside Buffer (TLB).
Virtual memory is used only by CPUs. GPUs and audio hardware need contiguous physical memory.
Pages
The CPU divides memory into pages.
Translation Look-aside Buffer
Before a read or write instruction can be executed, the CPU needs to translate the virtual address to a physical memory address.
Each CPU core has a Translation Look-aside Buffer (TLB) which is a virtual-to-physical translation cache used for both code and data.
Coherency
Coherent
Coherent means that the memory access from other clients can snoop CPU caches.
Coherent clients must explicitly flush modified data from their caches for any latest-modified copy to become visible to the CPU.
Fully-coherent
Fully coherent means that the CPU does not need to explicitly flush caches in order for the latest copy of modified data to be available.
Memory Types
There are three memory types: cacheable (default), write-combined, and non-cached.
Cachable
Loads from cacheable memory involve fetching data in aligned units of 64 bytes known as cache lines.
Write-combined
Load/stores bypass the data cache, reducing pressure on the cache.
Stores are batched into 64-byte write-gather buffers and later directly written to main RAM in bursts.
Branch Prediction
Speculative out-of-order execution means that each core can speculatively execute instructions before the address of a jump target has been resolved.
The branch predictor is responsible for supplying the address at which to start speculative execution.
Without speculative execution, the CPU would have to stall, waiting for the resolution of the address at which to continue execution.
If the branch predictor mispredicts the address, all speculative instructions are discarded and never retired.
Policies
On modern architectures:
- L1 is no-allocate and write-through.
- L2 is write-allocate and write-back.
Write-allocate
Write-allocate means that when writing to an address that is not already in the cache, then a cache line will be allocated for it.
No-allocate
No-allocate means that when writing to an address that is not already in the cache, then the data will not be stored in the cache.
It avoids wasting space with data that is never read.
Write-back
Write-back means that writes to the cache are not written to memory or the next cache until they have to be.
Write-through
Write-through means that modified data is immediately written to memory or the next cache, and never updated in the current cache.
It ensures that synchronizing data between processors is always fast because there is never any modified data stored in a lower caches that needs to be flushed to a higher cache.
Best Practices
General
Code and data structures must be laid out to maximize cache efficiency.
Work with contiguous structures. Dynamic structures cause more cache misses.
Pack and unpack data in registers with specific SIMD instructions.
Hot/cold split data structures to prevent loading unneeded data into the cache.
Thread affinity
On systems with a split L2 cache, two threads on different modules working with the same data set will incur heavy penalties through remote L2 and L1 latencies.
Branch Misprediction
In very critical code, avoid:
- virtual function calls.
- function pointer dereferences.
- switch statements based on effectively random data.
Data Cache
- Perform linear array traversals.
- Use as much of a cache line as possible.
Instruction Cache
- Fit working set in cache.
- Reduce branching.
- modern branch predictors have high accuracy.
- avoid cost of branch misprediction.
- Inline cautiously.
- can increase data cache misses.
- can decrease instruction cache misses.
Examples
Matrix Multiplication
Matrices are generally stored in a a row-major order, which means that consecutive elements of a row reside in contiguous memory locations.
If a matrix multiplication is implemented using a column-major order, each statement evicts the previous elements from the cache.
for (int i = 0; i < n; ++i)
for (int j = 0; j < m; ++j)
for (int k = 0; k < p; ++k)
C[i][j] = C[i][j] + A[i][k] * B[k][j];
If a matrix multiplication is implemented using a row-major order, the locality of references cause the algorithm to perform more efficiently.
for (int i = 0; i < n; ++i)
for (int k = 0; k < p; ++k)
for (int j = 0; j < m; ++j)
C[i][j] = C[i][j] + A[i][k] * B[k][j];
Member variable
In this example, we compute a sum from the values in an array and store the result in a member variable.
struct MyType
{
size_t _total = 0;
void Update(int* values, size_t count);
};
When a member variable is accessed in a loop, the performance can be affected.
Many load-store occurs, which slow down the execution of the loop.
for (size_t i = 0; i < count; ++i)
{
_total += values[i];
}
Defining a local variable to perform the calculation in the loop, and then store the result in the data member, can be much faster.
size_t total = 0;
for (size_t i = 0; i < count; ++i)
{
total += values[i];
}
_total = total;
Alternatively, the restrict
keyword can be used as a hint for the compiler to indicate that the memory is not aliased.