Cache Memory

Overview

Access to the main memory (also known as system memory) by the CPU, is a slow operation.

Therefore, a set of memory caches are used by the CPU.

Caches are smaller banks of memory.

Cache Levels

There is a tradeoff between cache latency and hit rate. Larger caches have better hit rates but longer latency.

For this reason, moderns CPU have two or three levels of cache.

ℹ️

The levels of cache in the cache hierarchy are referred to as Level 1 (L1), Level 2 (L2), and Level 3 (L3).

The L1 cache is divided into two uses:

The instruction cache holds copies of the instructions to execute.

The data cache holds copies of memory that are read and written by the instructions.

The L2 and L3 caches are typically unified caches containing both instructions and data.

Higher levels of cache contain more memory that the previous level of cache but are also slower.

Different architectures have different levels of cache. Generally, there are three levels of cache and the level 3 cache is shared between the cores.

The L1 cache is generally located on the CPU itself and is fastest to access.

Higher levels of cache are generally located on external chips but are still faster to access than main memory.

Memory Access

When the processor tries to access a memory address:

It checks the fastest L1 cache first.

If there is a cache hit, the processor proceeds at high speed without checking higher levels of cache, and more importantly without accessing the main memory.

If there is a cache miss, the next fastest cache is checked.

The processor checks higher and slower levels of cache until there is a cache hit.

If the data is does not exist in any level of cache, the data is copied from main memory to the cache.

Architectures

On Intel i7 processors, the L1 cache is 32 KB and dedicated to each core. The L2 cache is 256 KB and shared by each cluster. The L3 cache is 6 MB to 8 MB and shared by all the cores.

The Xbox One and the PS4 have dual quad-core processors. There are only two levels of cache, but with an additional bridge between the split L2 caches. The size of the L2 is also increased to 4 MB.

Cache Lines

Each cache is organized into fixed-size units called cache lines.

They size is typically a power of two value.

💡

Moderns CPUs generally have 64-bytes cache lines.

When a program references a memory address, the CPU loads a cache line by copying from system memory beginning at the next-lowest address that is aligned with the size of a cache line.

Therefore, properly aligning data allows a better use of the cache line.

⚠️

Even when a single byte not in the cache, the full cache line is read from main memory.

Predictable access patterns are very prefetch-friendly.

For example, linear array traversals are very cache-friendly (see locality).

Cache Misses

Cache hits and cache misses occur when a memory address is read or written by the CPU.

When the CPU tries to access a memory address:

If the memory address is in the cache, a cache hit occurs.

If the memory address is not in the cache, a cache miss occurs.

⚠️

An excessive number of cache misses can significantly impact performance.

Cache misses can be avoided by strategic use of memory by promoting locality of reference.

The benefit of locality of reference is to increase the likelihood that data is already in the cache when instructions need access to it. The data should ideally fit in a cache line.

Spatial locality

Spatial locality is the use of data structures within close memory locations.

To benefit from spatial locality, keep data structures close together in memory.

In other words, if a block of memory is accessed by the CPU, there is a good chance that adjacent memory blocks will also be touched by the CPU.

Temporal locality

Temporal locality is the reuse of data structures within a small time duration.

To benefit from temporal locality, the code should keep access to data structures as close in time as possible.

In other words, if a block of memory is accessed by the CPU, there is a good chance that it will be accessed again in a short amount of time.

Cache Trashing

Poor locality results in cache thrashing and cache pollution.

Cache trashing occurs when multiple memory locations compete for the same cache lines and result in excessive cache misses.

💡

It can occur when a cache level is shared by multiple cores, and a multithreading algorithm do not efficiently share data.

Cache pollution occurs when unnecessary data is loaded into the CPU cache causing other useful data to be evicted from the cache.

💡

It can occur when an algorithm works on a data structure that is interleaved with data members that are not used in the algorithm.

Loads and Stores

A cache load occurs when the CPU copies data from the main memory to the caches.

A cache flush (or store) occurs when the CPU copies data back from the caches to the main memory.

Load-Hit-Stores

Data that was just written is not immediately available for reading.

A load-hit-store occurs when a thread stores data to a memory location and shortly afterward reloads from that same location.

There are several possible causes of a load-hit-store.

Transferring data between register sets

Updating data in arrays

Passing structures by value

Transferring Data between Register Sets

Data is transferred between register sets when performing type conversions (for example, float to integer and integer to float conversions).

A data transfer between register sets can only be done by going through memory.

Updating Data in Arrays

Frequent read-modify-write sequences to the same element of an array can cause a load-hit-store.

When writing data through a pointer that is not marked as restrict, the compiler assumes that other variables are potentially being overwritten which may force the compiler to flush their values to memory and reload them later.

Passing Structures by Value

Function calls that pass structures by value may have to pass the structures in memory.

The calling function writes the structure to the stack, and the called function reads it back, causing a load-hit-store.

Virtual Memory

Read and write instructions specify memory addresses in the virtual memory address space.

Virtual addresses are mapped to physical memory addresses.

⚠️

Contiguous virtual addresses != Contiguous physical memory addresses

The address translation is performed by the Memory Management Unit (MMU) using a Translation Look-aside Buffer (TLB).

Virtual memory is used only by CPUs. GPUs and audio hardware need contiguous physical memory.

Pages

The CPU divides memory into pages.

💡

A x86 CPU generally uses 4 KB pages.

Translation Look-aside Buffer

Before a read or write instruction can be executed, the CPU needs to translate the virtual address to a physical memory address.

Each CPU core has a Translation Look-aside Buffer (TLB) which is a virtual-to-physical translation cache used for both code and data.

ℹ️

The lookup is implemented by dividing the virtual address by the relevant page size and sending the result through a hash function to generate a number used to select a set TLB entries.

Coherency

Coherent

Coherent means that the memory access from other clients can snoop CPU caches.

Coherent clients must explicitly flush modified data from their caches for any latest-modified copy to become visible to the CPU.

Fully-coherent

Fully coherent means that the CPU does not need to explicitly flush caches in order for the latest copy of modified data to be available.

Memory Types

There are three memory types: cacheable (default), write-combined, and non-cached.

Cachable

Loads from cacheable memory involve fetching data in aligned units of 64 bytes known as cache lines.

Write-combined

Load/stores bypass the data cache, reducing pressure on the cache.

Stores are batched into 64-byte write-gather buffers and later directly written to main RAM in bursts.

💡

Write-combined memory should only be used to write data that is not read by the CPU but is later read by the GPU.

Branch Prediction

Speculative out-of-order execution means that each core can speculatively execute instructions before the address of a jump target has been resolved.

The branch predictor is responsible for supplying the address at which to start speculative execution.

Without speculative execution, the CPU would have to stall, waiting for the resolution of the address at which to continue execution.

If the branch predictor mispredicts the address, all speculative instructions are discarded and never retired.

Policies

On modern architectures:

L1 is no-allocate and write-through.

L2 is write-allocate and write-back.

Write-allocate

Write-allocate means that when writing to an address that is not already in the cache, then a cache line will be allocated for it.

No-allocate

No-allocate means that when writing to an address that is not already in the cache, then the data will not be stored in the cache.

It avoids wasting space with data that is never read.

Write-back

Write-back means that writes to the cache are not written to memory or the next cache until they have to be.

Write-through

Write-through means that modified data is immediately written to memory or the next cache, and never updated in the current cache.

It ensures that synchronizing data between processors is always fast because there is never any modified data stored in a lower caches that needs to be flushed to a higher cache.

Best Practices

General

Code and data structures must be laid out to maximize cache efficiency.

Work with contiguous structures. Dynamic structures cause more cache misses.

Pack and unpack data in registers with specific SIMD instructions.

Hot/cold split data structures to prevent loading unneeded data into the cache.

Thread affinity

On systems with a split L2 cache, two threads on different modules working with the same data set will incur heavy penalties through remote L2 and L1 latencies.

Branch Misprediction

In very critical code, avoid:

virtual function calls.

function pointer dereferences.

switch statements based on effectively random data.

Data Cache

Perform linear array traversals.

Use as much of a cache line as possible.

Instruction Cache

Fit working set in cache.

Reduce branching.
- modern branch predictors have high accuracy.
- avoid cost of branch misprediction.

Inline cautiously.
- can increase data cache misses.
- can decrease instruction cache misses.

Examples

Matrix Multiplication

Matrices are generally stored in a a row-major order, which means that consecutive elements of a row reside in contiguous memory locations.

If a matrix multiplication is implemented using a column-major order, each statement evicts the previous elements from the cache.

for (int i = 0; i < n; ++i)
    for (int j = 0; j < m; ++j)
        for (int k = 0; k < p; ++k)
            C[i][j] = C[i][j] + A[i][k] * B[k][j];

If a matrix multiplication is implemented using a row-major order, the locality of references cause the algorithm to perform more efficiently.

for (int i = 0; i < n; ++i)
    for (int k = 0; k < p; ++k)
        for (int j = 0; j < m; ++j)
            C[i][j] = C[i][j] + A[i][k] * B[k][j];

Member variable

In this example, we compute a sum from the values in an array and store the result in a member variable.

struct MyType
{
    size_t _total = 0;

    void Update(int* values, size_t count);
};

When a member variable is accessed in a loop, the performance can be affected.

Many load-store occurs, which slow down the execution of the loop.

for (size_t i = 0; i < count; ++i)
{
    _total += values[i];
}

Defining a local variable to perform the calculation in the loop, and then store the result in the data member, can be much faster.

size_t total = 0;
for (size_t i = 0; i < count; ++i)
{
    total += values[i];
}
_total = total;

Alternatively, the restrict keyword can be used as a hint for the compiler to indicate that the memory is not aliased.