Memory

Category	Memory

Overview

Memory is an important resource in a game.

Games use large amount of memory and memory allocation has a performance cost.

Carefully planning the memory allocation strategy is important when designing a system.

Some languages provide automatic memory management using different strategies: garbage collection (GC), automatic reference counting (ARC), resource acquisition is initialization (RAII).

History

On 16-bit architectures, the memory was segmented and there were near and far pointers. Therefore, on 16-bit Windows, the heap was divided into a local heap and a global heap.

Starting from 32-bit architectures, the heap is backed by a paged memory model. The stack and heap are both backed by pages of memory managed by the operating system.

STL Allocator

The STL allocator was originally designed to work on 16-bit architectures and abstract the near and far pointers.

STL allocators do not have any reference to near and far pointers anymore but it's original purpose is now gone.

❌STL allocators are assumed to be stateless, which makes the implementation of custom allocators impossible.

❌STL allocators must be used as a template parameter, but work with void* instead of using a template type.

Virtual Memory

Each process has its own virtual address space (or logical address space). All threads of a process can access its virtual address space. However, a process cannot access memory that belongs to another process unless it is shared.

A virtual address does not represent the actual physical location in memory.

The operating system maintains a page table for each process that maps virtual addresses into their corresponding physical addresses.

The virtual address space can be smaller or larger than the total physical memory available on the system.

The virtual address space is organized into uniformly-sized chunks of memory called pages.

The size of a page depends on the system.

On Windows and OSX, the size of a page is 4 kilobytes.

On console platforms, the size of a page is generally 4 kilobytes.

On iOS, the size of a page is 16 kilobytes (also backed by 16 kilobyte physical pages on A8 systems).

On desktop platforms, the system can perform demand paging by moving (swapping) pages of physical memory to and from a paging file on disk (backing store).

On mobile and on consoles, there is no demand paging.

⚠️

Paging affects performance negatively, and should avoided on desktop platforms. - Allocations smaller than the size of a virtual memory page cannot be page-aligned. - Large allocations are guaranteed to be page-aligned.

The advantage to using virtual memory addresses, backed by pages of RAM, is that available RAM can be mapped into a contiguous virtual address range, even if the RAM itself is fragmented throughout physical memory.

Heap Memory

A process can create and access a private heap, which is a block of one or more pages of virtual memory in the address space of the process.

On most platforms, the heap is designed and optimized for the allocation and deallocation of small blocks of memory (smaller than the system page size).

Heaps allocate virtual memory and manage smaller allocations and deallocations within that section of virtual memory.

💡

The memory-management routines in the CRT libraries are built on top of the platform-specific heap functions. It is more efficient to call the heap functions directly.

Memory Types

There are different kind of memory in the hardware of a system.

At the lowest level, the processor performs instructions using its registers and use its cache memory to reduce access to the main memory.

Access to the main memory is performed through the bus which is a very slow operation compared to access to cache memory.

The main memory is managed by the operating system as paged memory. The system allocates pages of memory during the lifetime of a program. The main memory is divided in three main blocks: static memory, stack memory and heap memory.

Finally, the external storage is the slowest memory but contains permanent data. Accessing disk memory requires the data to be copied to the main memory or using file mapped data, the memory can be streamed from the disk to the main memory.

CPU
- Registers
- Cache

RAM
- Static
- Stack
- Heap

I/O
- Disk
- Mapped
- Network

CPU

Registers

The CPU registers are the fastest accessible locations.

The CPU perform operations using its registers.

There are few registers, and they are constantly being overridden by values from the local cache.

Cache

The CPU accesses memory from the cache very efficiently but the cache is very small.

When the data is not in the cache, the data is retrieved from the main memory and copied into the cache. This operation is costly.

There are different levels of cache (typically 3 levels) to reduce access to the main memory.

The data is copied from the main memory to the cache using the length of a cache line.

Typically, on a 64-bit CPU, the cache line is 64 bytes.

Optimizing the layout of data in main memory for efficient use of the cache line is called locality.

Static Memory

A process usually has in its address space the machine code and the program data.

The static memory is part of the executable code of the program and is allocated in main memory.

Static variables are stored:

in the data segment of the program (if initialized)

in the uninitialized data segment (BSS segment) of the program (if uninitialized)

Static variables persist for the lifetime of the program.

✔️Allocated automatically when the program starts.

❌Persists for the lifetime of the program even when not needed.

Stack Memory

The stack memory is a region in main memory where data is added or removed in a last-in-first-out manner.

Each thread has its own reserved region of stack memory.

Stack memory is allocated and deallocated automatically when functions are called and return.

Allocating more memory on the stack than is available can result in a crash due to stack overflow.

✔️Run-time dynamic allocation from the stack is possible with the alloca() function.

❌Cannot persist across multiple function calls.

❌ The alloca() function is platform and compiler dependent and its use is discouraged.

Dynamic Memory

When applications need more memory, they can request a block of memory from the operating system. The memory is allocated in the heap.

The heap memory is managed using virtual memory.

To dynamically allocate memory in a program, virtual memory pages are requested from the operating system. Alternatively, the heap memory can be allocated directly and the operating system will manage the virtual pages.

Fragmentation

Multiple allocations and deallocations over time create memory fragmentation.

Fragmentation occurs when there are many small gaps between allocated memory blocks, which prevent their use for another allocation request.

Heap

The heap memory (or free store) is stored in the RAM and is slower to access than the CPU cache.

✔️The size of the memory is dynamic at runtime.

✔️Custom allocation strategies can be implemented.

❌Create memory fragmentation.

❌Memory leaks are difficult to track.

There are different methods to allocate heap memory.

C runtime

The malloc() function allocates a block of memory on the heap (CRT heap). The program accesses this block of memory via a pointer that the function returns.

When the memory is no longer needed, the pointer is passed to the free() function which deallocates the memory.

Most implementations allocate several pages of virtual memory and divide those pages into smaller arena of memory to reduce lock contention.

Most implementations also handle coalescing nearby allocations into contiguous regions, putting several small allocations on the same page.

ℹThe malloc() function works on raw bytes of memory.

✔️The simplest method to allocate heap memory.

❌The implementation of malloc() and free() is greatly dependent on the system.

❌Some implementations suffer from lock contention.

C++ runtime

The new and new[] operators calls the malloc() function and the delete and delete[] operators calls the free() function.

The new operator also calls the constructor of the allocated object type if a constructor exists.

ℹThe new operator works with types instead of raw bytes of memory.

✔️The new and delete operators can be overloaded. The only rule is that the first argument to the new operator must be of type size_t, which is automatically passed by the compiler.

❌The implementation of new and delete is compiler dependent.

❌The new / delete and new[] / delete[] operators are inconsistent.

ℹThe placement new operator doesn't allocate memory but can be called to invoke a constructor on an block of memory.

Windows

On Windows, the malloc() function calls the HeapAlloc() function from the Windows API, and the free() function calls the HeapFree() function.

Each process has a default heap provided by the system. The handle to the default heap is returned by the GetProcessHeap() function.

A private heap can be created by calling the HeapCreate() function. Initially, at least one page is committed. If the maximum size of the heap is not specified, the heap can be resized as needed. Otherwise, the maximum size is rounded up to a multiple of the system page size.

If allocation requests exceed the size of the committed pages, the system commits additional pages of memory for the heap until the maximum size is reached, or the system runs out of physical memory.

A heap is destroyed by calling the HeapDestroy() function. The committed pages are then decommitted and released.

macOS

On macOS, all malloc allocations are zoned allocations. The malloc() function calls the malloc_zone_malloc() function using the default malloc zone, and the free() function calls the malloc_zone_free() function. Malloc zones can be created with malloc_create_zone() and destroyed with malloc_destroy_zone().

Virtual Memory

Operating systems use virtual memory to separate the memory addresses used by a process from actual physical addresses.

Virtual memory is allocated with the granularity of a page (on most platform the default page size is 4 KB). It means than smaller allocations (for example 10 bytes) will allocate a whole page (4096 bytes).

Pages of virtual memory are reserved by a process to restrict their use from other processes. The physical memory is only allocated when the process accesses that memory. On Windows, it's necessary to commit the pages before accessing them.

Managing pages of virtual memory manually allows us to use specific debugging techniques such as:

protected pages

guard pages

Protected Pages

Pages can have memory-protection options to enable read-only access or write-only access; or disable all access to the page.

Protected pages are used to monitor invalid access to freed memory.

Allocate a block of memory at the end of a page.

Reserve and protect the following page.

void* AllocateAtEndOfPage(size_t size)
{
  size_t pages = (size + PageSize - 1) / PageSize;
  char* address = VirtualAlloc(pages * PageSize);
  size_t offset = (pages * PageSize) - size;
  return address + offset;
}

⚠️

This allocation model will waste a page size for each allocation.

⚠️

It only works for fixed-size allocations.

Guard Pages

Guard pages are used to monitor the growth of large dynamic data structures.

When accessing an address within a guard page, the system raises a page fault.

Windows

VirtualAlloc — reserves memory in the virtual address space of the process.

If the size of the page is not specified, the size is rounded up to the next multiple of the system page boundary.

The MEM_COMMIT flag indicates that the memory is immediately committed to physical memory.

It's possible to allocate a large quantity of memory with the MEM_RESERVE flag to indicate that the memory is only reserved but not yet committed. VirtualAlloc can be called later on with the MEM_COMMIT flag and a specific address range to commit part of the reserved memory.

VirtualFree — decommits and releases a region of pages within the virtual address space of the process.

Additionally, VirtualLock enables a process to lock one or more pages of committed memory into physical memory, preventing the system from swapping the pages out to secondary storage.

To determine the size of a page on the system, call GetSystemInfo and inspect the value of SYSTEM_INFO.dwPageSize.

void* Allocate(size_t size)
{
    void* address;

    address = VirtualAlloc(nullptr,
                           size,
                           MEM_COMMIT,
                           PAGE_READWRITE);

    if (address == nullptr)
    {
        address = nullptr;
    }
 
    return address;
}

bool Deallocate(void* address)
{
    return VirtualFree((LPVOID)address,
                       0,
                       MEM_RELEASE);
}

macOS

vm_allocate — allocates a region of virtual memory.

vm_deallocate — deallocates a region of virtual memory.

The requested size must be a multiple of a virtual page size.

To determine the size of a page on the system, use the host_page_size() function and inspect the value of page_size.

ℹ️macOS doesn’t distinguish reserve and commit operations.

void* Allocate(size_t size)
{
    void* address;
    kern_return_t error;
  
    error = vm_allocate((vm_map_t)mach_task_self(),
                        (vm_address_t*)&address,
                        size,
                        VM_FLAGS_ANYWHERE);
 
    if (error != KERN_SUCCESS)
    {
        address = nullptr;
    }
 
    return address;
}

bool Deallocate(void* address, size_t size)
{
    kern_return_t error;
  
    error = vm_deallocate((vm_map_t)mach_task_self(),
                          (vm_address_t*)address,
                           size);
 
    if (error != KERN_SUCCESS)
    {
        return false;
    }
 
    return true;
}

Linux

mmap — maps pages of memory.

munmap — unmap pages of memory.

The size of a page is obtained with getpagesize.

ℹ️Linux doesn’t distinguish reserve and commit operations.

#include <sys/mman.h>

void* Allocate(size_t size)
{
    void* address;
    kern_return_t error;
  
    address = mmap(nullptr,
                   size,
                   PROT_READ | PROT_WRITE,
                   MAP_PRIVATE
                   0,
                   0);
 
    if (address == MAP_FAILED)
    {
        address = nullptr;
    }
 
    return address;
}

bool Deallocate(void* address, size_t size)
{  
    int result = munmap(address, size);
 
    if (result != 0)
    {
        return false;
    }

    return true;
}

Write-Combined Memory

Write-combined memory is a type of non-cacheable memory where writes bypass the CPU caches and are written directly to main memory.

💡

Useful for data destined for the GPU that will not be read by the CPU.

🎮

On the Xbox 360, use the XPhysicalAlloc function to create write-combined memory heaps.

File Mapping

File mapping is the association of a file with a portion of the virtual address space of a process.

It also allows the process to work efficiently with a large data file, without having to map the whole file into memory.

File mapping can be used for inter-process communication, and can be a solution for the communication between a runtime (in dev mode) and an editor.

Alignment

Memory alignment means storing the data at a memory offset equal to a multiple of the word size of the processor.

❌Some processor cannot access unaligned memory.

⚠Modern processors can access unaligned memory but with a performance cost as the processor will have to read multiple words and combine them together.

Heap allocations are guaranteed to be aligned to the fundamental alignment, which is 8 bytes on 32-bit platforms and 16 bytes on 64-bit platforms.

For larger alignment requirement, platform-specific functions can be used, or manual alignment is required.

An alignment is specified as values of the type std::size_t.

An alignment is expressed as a positive power of two value (equal or larger than size_t).

When a block of memory is allocated with a specific alignment, the allocated size can be larger than the requested size. The additional bytes are called padding bytes.

Power of two

Numbers which are powers of two have one and only one bit set in their binary representation.

So if x is a power of two then x & (x-1) will be 0.

bool IsPowerOfTwo(std::uintptr_t x)
{
	return (x != 0) && ((x & (x-1)) == 0);
}

Align an address

The memory address must be is a multiple of the specified alignment.

To align a memory address to a specified alignment, we perform a modulo arithmetic.

As the alignment is a power of two, the modulo ptr % alignment can be replaced with by ptr & (alignment - 1).

void* Align(std::uintptr_t ptr, std::size_t alignment)
{
  return (void*)((ptr + (alignment - 1)) & -((int)alignment));
}

Structure

The C++ standard guarantees that the members of a class or struct appear in memory in the same order as they are declared.

When a structure is allocated, padding bytes are inserted between member fields to ensure that each member is properly aligned.

The size of structures should be a multiple of size_t when possible to ensure that no memory is wasted by padding.

By default, structures are aligned at the size of the largest element they contain.

Each member is aligned to respect the natural alignment of its type.

The natural alignment of a type corresponds to its size in memory.

uint8_t is 1-byte aligned

uint16_t is 2-bytes aligned

uint32_t is 4-bytes aligned

struct Vertex
{
  uint16_t a; //0x0000+4
  uint32_t b; //0x0004+4
  uint8_t  c; //0x0008+4
};

The size of Vertex is 12 bytes.

a is aligned to 4 bytes with 2 bytes of padding.

c is aligned to 4 bytes with 3 bytes of padding.

The offset of a member can be inspected using the offsetof macro.

offsetof(Vertex, a)

We can reorder the members in the structure to reduce the padding.

struct Vertex
{
  uint32_t b; //0x0000+4
  uint16_t a; //0x0004+2
  uint8_t  c; //0x0006+2
};

The size of Vertex is 8 bytes.

a is aligned to 2 bytes with no padding.

c is aligned to 1 byte with 2 bytes of padding.

To force the compiler to remove any padding, a keyword can be specified in the declaration of the struct.

MSVC: use the pack pragma directive before the declaration. #pragma pack()

Clang/GCC: declare the structure with the __attribute__ keyword. __attribute__((packed))

⚠Unfortunately, it comes at a performance cost as the compiler generates extra instructions to perform the memory access in a way that does not cause unaligned access violation (typically by reading part of the data from two memory addresses and combining them).

#pragma pack(push)
struct Vertex
{
  uint32_t b; //0x0000+4
  uint16_t a; //0x0004+2
  uint8_t  c; //0x0006+1
};
#pragma pack(pop)

struct __attribute__((packed)) Vertex
{
  uint32_t b; //0x0000+4
  uint16_t a; //0x0004+2
  uint8_t  c; //0x0006+1
};

The size of Vertex is 7 bytes.

Alignment

Aligning data structures can improve the performances by ensuring that cache lines are properly used.

C++11 provides the alignof macro to determine the alignment of a specified type.

alignof(Vertex)

The alignment of the Vertex structure would be 4 as its largest member is of type uint32_t.

The C++11 macro alignas can be used in the declaration of the structure to force a specific alignment.

struct alignas(16) Vertex
{
  uint32_t b;
  uint16_t a;
  uint8_t  c;
};

The alignment of Vertex is now 16.

Relocatable Heap

A relocatable heap helps with fragmentation by moving all the allocated block in memory to removes the holes.

⚠️When a block of memory is moved, all the pointers to that block of memory become invalid.

To support a relocatable heap, it's necessary to have a handle system instead of using pointers directly.

A relocatable heap is not practical as the only method of avoiding fragmentation, but can be a good solution in some cases. For example, scripted objects usually work with handles instead of pointers to support hot-reload.

Garbage Collection

Garbage collection is a strategy for automatically detecting memory allocated to objects that are no longer in use.

Manual memory management enable several types of bugs in a program:

memory leaks – an unused object is never released.

memory corruption – destruction of a different object that occupy the same location in memory.

dangling pointers – pointers that do not point to a valid object.

Garbage collection provides a solution for these bugs, though memory leaks can still occur.

However, garbage collection also suffers from other issues:

higher memory usage

collection cycles – CPU consumption.

finalizer problem – finalization is non-deterministic.

References

Memory management, Wikipedia, https://en.wikipedia.org/wiki/Memory_management

Virtual Memory Tricks, Our Machinery, https://ourmachinery.com/post/virtual-memory-tricks/