Metal 4 Optimization Series

Article 2: Metal 4 Memory Mastery — Explicit Resource Management for Maximum Performance

Part 2 of the Metal 4 Optimization Series

Part 2 of the Metal 4 Optimization Series

Metal 4’s most profound architectural shift lies in memory management. Where previous Metal versions handled residency and synchronization implicitly, Metal 4 places these responsibilities squarely in developer hands. This explicit control model—mirroring DirectX 12 and Vulkan—unlocks significant performance gains while demanding deeper understanding of GPU memory hierarchies.

This article dissects Metal 4’s memory architecture from command allocators through placement sparse resources, providing production-ready patterns for high-performance applications.

The philosophy behind explicit memory control

Apple’s decision to require explicit memory management stems from observing modern application patterns. Games and professional applications now manage thousands of resources per scene—a dramatic increase from the single-texture-per-object days. Implicit tracking overhead scales poorly with resource count, and the CPU cycles spent determining residency and synchronization become measurable bottlenecks.

Metal 4’s explicit model inverts this relationship. Applications declare resource requirements upfront, typically at load time, then incur minimal per-frame overhead. The trade-off is clear: more setup complexity in exchange for consistently lower runtime cost.

Consider a typical game scenario loading an open-world environment. Under Metal 3’s implicit model, the runtime continuously tracked which resources each command buffer accessed, maintaining internal data structures and performing validation. Under Metal 4, the application populates residency sets once during level load, attaches them to command queues, and pays nearly zero per-frame cost for resource management.

Command allocators: taking control of command memory

Metal 4 introduces MTL4CommandAllocator as the mechanism for explicit command buffer memory management. Unlike Metal 3 where command buffers obtained memory from queue-managed pools, Metal 4 requires applications to create and manage allocators explicitly.

The allocator lifecycle follows a straightforward pattern:

// Create allocator descriptor
let allocatorDescriptor = MTL4CommandAllocatorDescriptor()
allocatorDescriptor.heapSize = 16 * 1024 * 1024  // 16 MB command heap

// Create the allocator from the device
let commandAllocator = try device.makeCommandAllocator(descriptor: allocatorDescriptor)

// Create command buffer from allocator
let commandBufferDescriptor = MTL4CommandBufferDescriptor()
let commandBuffer = try commandAllocator.makeCommandBuffer(descriptor: commandBufferDescriptor)

The critical insight is that command allocators are reusable after GPU completion. A typical triple-buffered renderer maintains three allocators, resetting each after its corresponding frame completes:

class FrameResources {
    let allocator: MTL4CommandAllocator
    let fence: MTLFence
    var frameIndex: UInt64 = 0
}

// Per-frame pattern
func renderFrame(frameResources: FrameResources) {
    // Wait for this allocator's previous work to complete
    commandQueue.wait(for: frameResources.fence, value: frameResources.frameIndex)

    // Reset allocator - reclaims all command buffer memory
    frameResources.allocator.reset()

    // Create new command buffer from reset allocator
    let commandBuffer = try frameResources.allocator.makeCommandBuffer(descriptor: descriptor)

    // Encode and commit work...

    // Signal completion
    frameResources.frameIndex += 1
    commandQueue.signal(frameResources.fence, value: frameResources.frameIndex)
}

Sizing the allocator heap requires profiling. Start with 16-64 MB for typical rendering workloads, monitoring for allocation failures. The allocator will request additional system memory if the initial heap proves insufficient, but properly sized allocators avoid this runtime cost.

Residency sets: the foundation of GPU resource access

Residency sets represent Metal 4’s most fundamental memory management primitive. Every resource the GPU accesses must be contained in an attached residency set—there is no implicit residency in Metal 4.

Creating and populating a residency set follows a two-phase pattern: add resources, then commit:

// Create residency set with estimated capacity
let residencyDescriptor = MTLResidencySetDescriptor()
residencyDescriptor.initialCapacity = 256

let residencySet = try device.makeResidencySet(descriptor: residencyDescriptor)

// Phase 1: Add resources
residencySet.addAllocation(diffuseTexture)
residencySet.addAllocation(normalTexture)
residencySet.addAllocation(vertexBuffer)
residencySet.addAllocations([material1, material2, material3])

// Phase 2: Commit makes resources available
residencySet.commit()

// Attach to command queue for all commands
commandQueue.addResidencySet(residencySet)

The commit() call is essential—resources added without committing remain invisible to the GPU. This two-phase design enables efficient batch updates: accumulate many additions, then commit once.

Residency set attachment strategies

Metal 4 offers two attachment points for residency sets: command queues and individual command buffers. The choice depends on resource access patterns:

Queue-attached residency sets suit stable resources accessed across many frames:

// Attach once at initialization
commandQueue.addResidencySet(sceneResidencySet)
commandQueue.addResidencySet(globalTexturesSet)

// All command buffers submitted to this queue inherit these sets

Queue attachment incurs zero per-frame cost for stable resources. Remedy Entertainment’s Control Ultimate Edition team specifically highlighted this benefit: separating resources into different residency sets based on use patterns, they achieved significant reductions in residency management overhead.

Buffer-attached residency sets suit frame-specific resources:

// Per-frame resources attached to individual command buffers
commandBuffer.useResidencySet(frameSpecificResources)
commandBuffer.useResidencySets([decalSet, particleSet])

Use buffer attachment for resources that change frequently or are specific to particular rendering passes.

Residency set updates and streaming

For streaming scenarios where resources load and unload continuously, update residency sets on background threads:

// Background streaming thread
func onAssetLoaded(texture: MTLTexture) {
    streamingResidencySet.addAllocation(texture)
    streamingResidencySet.commit()
    // Set is already attached to queue - no additional work needed
}

func onAssetUnloaded(texture: MTLTexture) {
    streamingResidencySet.removeAllocation(texture)
    streamingResidencySet.commit()
}

Resources only become non-resident when they exist in no attached residency set. The union of all attached sets determines the resident resource collection. This enables overlapping sets for different purposes without concern for duplicate entries.

Argument tables: modern resource binding

Metal 4’s MTL4ArgumentTable replaces the implicit argument table mechanism with explicit objects. Argument tables store bindings to resources, enabling efficient access patterns that scale to thousands of resources.

Create argument tables with explicit capacity for each resource type:

let argumentDescriptor = MTL4ArgumentTableDescriptor()
argumentDescriptor.maxBufferBindCount = 16
argumentDescriptor.maxTextureBindCount = 32
argumentDescriptor.maxSamplerBindCount = 8

let argumentTable = try device.makeArgumentTable(descriptor: argumentDescriptor)

Binding resources uses GPU virtual addresses rather than object references:

// Bind texture using gpuResourceID
argumentTable.setTexture(albedoTexture.gpuResourceID, index: 0)
argumentTable.setTexture(normalTexture.gpuResourceID, index: 1)

// Bind buffer using gpuAddress
argumentTable.setAddress(uniformBuffer.gpuAddress, index: 0)

// Bind buffer at offset - critical for shared buffers
let instanceOffset = instanceIndex * MemoryLayout<InstanceData>.stride
argumentTable.setAddress(instanceBuffer.gpuAddress + UInt64(instanceOffset), index: 1)

The offset arithmetic pattern (gpuAddress + offset) enables efficient sub-buffer addressing without creating buffer views. This is essential for instancing, streaming buffers, and ring-buffer patterns.

Bindless rendering with argument tables

For true bindless rendering, argument tables typically need just one buffer binding that points to your resource hierarchy:

// Scene descriptor buffer contains all resource references
struct SceneDescriptor {
    var meshes: UInt64      // gpuAddress of mesh array
    var materials: UInt64   // gpuAddress of material array
    var textures: UInt64    // gpuAddress of texture ID array
    var lights: UInt64      // gpuAddress of light array
}

// Single binding provides access to entire scene
let bindlessTable = try device.makeArgumentTable(descriptor: MTL4ArgumentTableDescriptor())
bindlessTable.maxBufferBindCount = 1
bindlessTable.setAddress(sceneDescriptorBuffer.gpuAddress, index: 0)

The shader then navigates this hierarchy using pointer arithmetic:

struct SceneDescriptor {
    constant MeshData* meshes;
    constant MaterialData* materials;
    texture2d<float>* textures;  // Accessed via gpuResourceID
    constant LightData* lights;
};

fragment float4 fragment_main(
    constant SceneDescriptor& scene [[buffer(0)]],
    /* ... */)
{
    // Access any mesh, material, texture, or light by index
    MeshData mesh = scene.meshes[meshIndex];
    MaterialData material = scene.materials[mesh.materialIndex];
    // Sample texture using material's texture index
    float4 albedo = scene.textures[material.albedoIndex].sample(sampler, uv);
}

This pattern eliminates per-draw binding overhead entirely—bind once, draw everything.

Placement sparse resources: fine-grained memory control

Metal 4 introduces placement sparse resources for applications requiring dynamic memory control beyond standard allocation. These resources allocate virtual address space without physical backing initially, with memory pages mapped on demand from placement heaps.

The use case is massive streaming: open-world games with gigabytes of texture data where only visible regions require physical memory.

// Create placement heap
let heapDescriptor = MTLHeapDescriptor()
heapDescriptor.type = .placement
heapDescriptor.size = 512 * 1024 * 1024  // 512 MB placement heap
heapDescriptor.storageMode = .private

let placementHeap = try device.makeHeap(descriptor: heapDescriptor)

// Create sparse texture (virtual allocation only)
let textureDescriptor = MTLTextureDescriptor.texture2DDescriptor(
    pixelFormat: .bc7_rgbaUnorm,
    width: 16384,
    height: 16384,
    mipmapped: true
)
textureDescriptor.resourceOptions = [.storageModePrivate]
// Mark as sparse placement resource
textureDescriptor.hazardTrackingMode = .untracked

let sparseTexture = try device.makeSparseTexture(descriptor: textureDescriptor)

Map and unmap memory regions using resource state encoders:

// Map visible mip levels and tiles
let stateEncoder = commandBuffer.makeResourceStateCommandEncoder()

// Map specific tile region
let tileRegion = MTLRegion(origin: MTLOrigin(x: 0, y: 0, z: 0),
                           size: MTLSize(width: 4, height: 4, depth: 1))
stateEncoder.updateTextureMapping(sparseTexture,
                                   mode: .map,
                                   region: tileRegion,
                                   mipLevel: 0,
                                   slice: 0)

// Unmap region no longer visible
stateEncoder.updateTextureMapping(sparseTexture,
                                   mode: .unmap,
                                   region: distantRegion,
                                   mipLevel: 0,
                                   slice: 0)

stateEncoder.endEncoding()

Apple’s sample code demonstrates 15% texture memory reduction using sparse textures in realistic rendering scenarios. For streaming systems, the savings compound as camera movement exposes new regions while discarding distant ones.

Barrier API: explicit synchronization

Metal 4’s barrier API provides stage-to-stage synchronization with explicit scope control. Unlike implicit hazard tracking, barriers require developers to specify exactly which operations must complete before others begin.

Queue barriers synchronize work across encoders on the same queue:

// Compute pass writes to texture
let computeEncoder = commandBuffer.makeComputeCommandEncoder()
computeEncoder.setComputePipelineState(blurPipeline)
computeEncoder.setTexture(inputTexture, index: 0)
computeEncoder.setTexture(outputTexture, index: 1)
computeEncoder.dispatchThreadgroups(threadgroups, threadsPerThreadgroup: threadsPerGroup)
computeEncoder.endEncoding()

// Queue barrier ensures compute completes before render reads
commandQueue.barrier(
    resources: [outputTexture],
    afterStages: .dispatchStage,
    beforeStages: .fragmentStage
)

// Render pass reads computed texture
let renderEncoder = commandBuffer.makeRenderCommandEncoder(descriptor: renderPassDescriptor)
renderEncoder.setFragmentTexture(outputTexture, index: 0)
// ... render
renderEncoder.endEncoding()

Pass barriers synchronize within a single encoder:

let renderEncoder = commandBuffer.makeRenderCommandEncoder(descriptor: descriptor)

// First draw writes to buffer
renderEncoder.setVertexBuffer(outputBuffer, offset: 0, index: 0)
renderEncoder.drawPrimitives(type: .triangle, vertexStart: 0, vertexCount: vertexCount)

// Pass barrier ensures write completes
renderEncoder.memoryBarrier(
    resources: [outputBuffer],
    afterStages: .vertexStage,
    beforeStages: .fragmentStage
)

// Second draw reads from buffer
renderEncoder.setFragmentBuffer(outputBuffer, offset: 0, index: 0)
renderEncoder.drawPrimitives(type: .triangle, vertexStart: 0, vertexCount: secondVertexCount)

renderEncoder.endEncoding()

Stage identifiers map to GPU pipeline stages:

.vertexStage - Vertex shader execution
.fragmentStage - Fragment shader execution
.dispatchStage - Compute shader execution
.meshStage - Mesh shader execution
.objectStage - Object shader execution
.tileStage - Tile shader execution
.machineLearningStage - ML encoder execution

Correct barrier placement requires understanding data flow. A compute pass writing to a texture that a fragment shader samples requires a barrier from .dispatchStage to .fragmentStage. Missing barriers produce undefined results—typically visual corruption or GPU hangs.

Memory pooling patterns for production

Production applications rarely allocate resources individually. Memory pooling amortizes allocation cost and enables efficient reuse patterns.

Ring buffer pattern for dynamic data

Per-frame uniform data fits naturally into ring buffers:

class UniformRingBuffer {
    let buffer: MTLBuffer
    let alignment: Int
    let frameCount: Int
    var currentFrame: Int = 0
    var currentOffset: Int = 0

    init(device: MTLDevice, frameCapacity: Int, frameCount: Int = 3) {
        self.alignment = 256  // Metal uniform buffer alignment
        self.frameCount = frameCount

        let alignedFrameSize = (frameCapacity + alignment - 1) & ~(alignment - 1)
        let totalSize = alignedFrameSize * frameCount

        self.buffer = device.makeBuffer(length: totalSize, options: .storageModeShared)!
    }

    func beginFrame() {
        currentFrame = (currentFrame + 1) % frameCount
        currentOffset = currentFrame * (buffer.length / frameCount)
    }

    func allocate<T>(count: Int) -> (offset: Int, pointer: UnsafeMutablePointer<T>) {
        let size = MemoryLayout<T>.stride * count
        let alignedSize = (size + alignment - 1) & ~(alignment - 1)

        let offset = currentOffset
        currentOffset += alignedSize

        let pointer = (buffer.contents() + offset).bindMemory(to: T.self, capacity: count)
        return (offset, pointer)
    }
}

Heap suballocation for textures

Texture pooling using heaps provides O(1) allocation:

class TexturePool {
    let heap: MTLHeap
    var availableTextures: [MTLPixelFormat: [MTLTexture]] = [:]

    init(device: MTLDevice, heapSize: Int) {
        let heapDescriptor = MTLHeapDescriptor()
        heapDescriptor.size = heapSize
        heapDescriptor.storageMode = .private
        heapDescriptor.type = .automatic

        self.heap = device.makeHeap(descriptor: heapDescriptor)!
    }

    func acquireTexture(width: Int, height: Int, format: MTLPixelFormat) -> MTLTexture {
        // Check pool for matching texture
        if var available = availableTextures[format], !available.isEmpty {
            if let index = available.firstIndex(where: { $0.width == width && $0.height == height }) {
                let texture = available.remove(at: index)
                availableTextures[format] = available
                return texture
            }
        }

        // Allocate from heap
        let descriptor = MTLTextureDescriptor.texture2DDescriptor(
            pixelFormat: format, width: width, height: height, mipmapped: false)
        return heap.makeTexture(descriptor: descriptor)!
    }

    func releaseTexture(_ texture: MTLTexture) {
        var available = availableTextures[texture.pixelFormat] ?? []
        available.append(texture)
        availableTextures[texture.pixelFormat] = available
    }
}

Command allocator pooling

Multiple command allocators enable parallel encoding:

class CommandAllocatorPool {
    private var availableAllocators: [MTL4CommandAllocator] = []
    private var inFlightAllocators: [(allocator: MTL4CommandAllocator, fence: MTLSharedEvent, value: UInt64)] = []
    private let device: MTLDevice

    func acquireAllocator() -> MTL4CommandAllocator {
        // Reclaim completed allocators
        reclaimCompleted()

        if let allocator = availableAllocators.popLast() {
            allocator.reset()
            return allocator
        }

        // Create new allocator
        let descriptor = MTL4CommandAllocatorDescriptor()
        return try! device.makeCommandAllocator(descriptor: descriptor)
    }

    func releaseAllocator(_ allocator: MTL4CommandAllocator, afterEvent event: MTLSharedEvent, value: UInt64) {
        inFlightAllocators.append((allocator, event, value))
    }

    private func reclaimCompleted() {
        inFlightAllocators.removeAll { entry in
            if entry.fence.signaledValue >= entry.value {
                availableAllocators.append(entry.allocator)
                return true
            }
            return false
        }
    }
}

Debugging memory issues

Xcode 26’s enhanced Metal Debugger provides comprehensive memory diagnostics for Metal 4:

GPU Frame Capture snapshots all allocated resources, their sizes, and residency status. Look for resources not in any residency set—these will cause GPU faults.

Memory Graph visualizes resource relationships, helping identify leaks where resources remain allocated but unused.

Validation Layer catches common errors:

Resources accessed without residency set membership
Barriers with incorrect stage specifications
Command buffer submission with missing residency sets

Enable validation during development:

let descriptor = MTLCommandQueueDescriptor()
// Validation automatically enabled in debug builds
let commandQueue = device.makeCommandQueue(descriptor: descriptor)

Runtime validation adds overhead but catches errors that would otherwise manifest as corruption or hangs.

Performance measurement and optimization

Profile memory management overhead using Metal System Trace:

Residency set commit time - Should be minimal for incremental updates
Command allocator reset time - Indicates whether allocator sizing is appropriate
Barrier stall time - Excessive stalls suggest suboptimal barrier placement

Key metrics to monitor:

Memory footprint - Total GPU-accessible memory consumption
Residency set churn - Frequency of add/remove operations
Allocation rate - Resources created per frame (should be near zero in steady state)

Target zero per-frame allocations. All resource creation should occur during loading, with runtime using pooled resources exclusively.

Migration strategy from Metal 3

Migrate incrementally, leveraging Metal 4’s extension of existing MTLDevice:

Add residency sets first - Create sets containing all resources, attach to queues
Adopt command allocators - Replace queue-created command buffers with allocator-created ones
Convert to argument tables - Replace implicit binding with explicit tables
Add explicit barriers - Identify synchronization points and add barriers
Consider sparse resources - For streaming scenarios, evaluate placement sparse resources

Each step can be adopted independently. Metal 4 APIs extend MTLDevice, enabling Metal 3 and Metal 4 code paths to coexist during transition.

Conclusion

Metal 4’s explicit memory management demands more from developers but delivers measurable performance improvements. Command allocators eliminate per-frame allocation overhead, residency sets provide fine-grained control over GPU-visible resources, and argument tables enable efficient bindless rendering patterns.

The patterns presented here—pooled allocators, stable residency sets, ring buffers for dynamic data—form the foundation for high-performance Metal 4 applications. Combined with explicit barriers for synchronization, these techniques enable consistent frame times even in resource-intensive scenarios.

The next article in this series explores Metal 4’s shader compilation pipeline, examining flexible pipeline states and the new MTL4Compiler interface for optimal shader management.