MetalFX Frame Interpolation
Generating intermediate frames for smoother gameplay
Frame Interpolation Architecture
MetalFX Frame Interpolation analyzes two consecutive rendered frames and synthesizes intermediate frames, effectively doubling (or more) the apparent frame rate.
Implementation
Setup
// Check support
guard MTLFXFrameInterpolation.isSupported(by: device) else {
fatalError("Frame interpolation not supported")
}
// Create descriptor
let descriptor = MTLFXFrameInterpolationDescriptor()
descriptor.inputWidth = renderWidth
descriptor.inputHeight = renderHeight
descriptor.outputWidth = displayWidth
descriptor.outputHeight = displayHeight
descriptor.colorTextureFormat = .rgba16Float
descriptor.depthTextureFormat = .depth32Float
descriptor.motionTextureFormat = .rg16Float
descriptor.outputTextureFormat = .rgba16Float
// Create effect
let frameInterpolation = MTLFXFrameInterpolation(device: device, descriptor: descriptor)!
The selected Swift snippet demonstrates an initialization pattern for MetalFX frame interpolation on Apple platforms. It begins by asserting that the runtime environment supports the feature: MTLFXFrameInterpolation.isSupported(by: device) is queried and, if false, execution is aborted via fatalError. This immediate failure communicates that the subsequent code path assumes the presence of device-level capabilities required for frame interpolation, but it also makes the initialization brittle because it terminates the process instead of providing a graceful fallback.
Next, a MTLFXFrameInterpolationDescriptor instance is created and populated with a concise set of configuration properties. Two pairs of dimensions—inputWidth/inputHeight and outputWidth/outputHeight—describe the resolution of the source (rendered) textures and the resolution of the destination (display or upscaling target). Correct mapping of these values is essential: mismatched sizes can lead to sampling artifacts or incorrect mapping of motion vectors and depth data.
The descriptor also sets explicit texture formats for each required resource. colorTextureFormat = .rgba16Float indicates a high-precision color buffer using 16-bit floating components per channel, which helps preserve color fidelity when synthesizing intermediate frames. depthTextureFormat = .depth32Float ensures a 32-bit floating point depth buffer, which is useful for accurate reprojection and occlusion handling. Motion vectors are configured with motionTextureFormat = .rg16Float, a two-component half-precision format sufficient to store screen-space motion (x and y). Finally, outputTextureFormat = .rgba16Float specifies the format for the generated frame; matching input color precision can reduce resampling artifacts.
After the descriptor is configured, the snippet calls the designated initializer MTLFXFrameInterpolation(device:descriptor:) and force-unwraps the result with !. This suggests the initializer is failable and may return nil under some conditions. Force-unwrapping assumes success and will cause a runtime crash if initialization fails. Combined with the earlier fatalError on unsupported devices, the approach privileges simplicity over robustness and user experience.
From an engineering perspective, this pattern communicates intent clearly but should be hardened before production use. Replace fatalError with a controlled fallback or an error path that reports the limitation to higher layers. Use guard let or if let to safely handle the potentially nil return from the initializer, and surface a recoverable error to the caller rather than crashing. For example, log the failure, disable interpolation while continuing rendering, or provide a user-visible message.
Other practical considerations include validating that the chosen texture formats are supported by the GPU and pixel formats are compatible with the rendering pipeline. Ensure the device actually supports the specified floating-point formats and that the rest of the renderer can produce color, depth, and motion textures in those formats. Also consider performance: high-precision formats increase memory and bandwidth use, so balance quality with resource constraints. Finally, maintain clear ownership and lifecycle management of the frameInterpolation instance and associated textures to avoid synchronization or resource leaks when encoding to command buffers.
Per-Frame Usage
func interpolateFrame(
previousColor: MTLTexture,
currentColor: MTLTexture,
previousDepth: MTLTexture,
currentDepth: MTLTexture,
motionVectors: MTLTexture,
output: MTLTexture,
commandBuffer: MTLCommandBuffer
) {
frameInterpolation.previousColorTexture = previousColor
frameInterpolation.currentColorTexture = currentColor
frameInterpolation.previousDepthTexture = previousDepth
frameInterpolation.currentDepthTexture = currentDepth
frameInterpolation.motionTexture = motionVectors
frameInterpolation.outputTexture = output
// Interpolation factor: 0.5 = halfway between frames
frameInterpolation.interpolationFactor = 0.5
// Reset flag for scene cuts
frameInterpolation.reset = isSceneCut
frameInterpolation.encode(to: commandBuffer)
}
The function sets up and enqueues a single MetalFX frame interpolation pass. It accepts six texture inputs—previous and current color, previous and current depth, motion vectors—and an output texture plus a command buffer. Inside the body, each provided texture is assigned to the corresponding property of a preconfigured frameInterpolation effect instance. This binds the necessary color, depth, and motion resources so the effect can synthesize an intermediate frame. The interpolationFactor is set to 0.5, indicating the desired temporal blend halfway between the previous and current frames; changing this value shifts the temporal sampling of the synthesized frame. The reset flag is toggled by isSceneCut, which should be determined elsewhere; when true it informs the interpolator to discard temporal history or avoid blending across abrupt scene changes, preventing ghosting. Finally, frameInterpolation.encode(to: commandBuffer) records the interpolation workload into the provided command buffer for GPU execution. Important operational notes: textures must match the formats and sizes expected by the descriptor used to create frameInterpolation (color, depth, motion formats and input/output resolutions). Ownership and synchronization of textures matter—ensure they are in the correct usage state and lifetimes extend until GPU execution completes. Also validate isSceneCut reliably to avoid unnecessary resets, and consider exposing the interpolation factor as a parameter to adapt to variable frame pacing.
Motion Vector Generation
Quality motion vectors are critical:
struct MotionVectorOutput {
float4 color [[color(0)]];
float2 motion [[color(1)]];
float depth [[depth(any)]];
};
fragment MotionVectorOutput motion_vector_fragment(
VertexOutput in [[stage_in]],
constant FrameData& frame [[buffer(0)]]
) {
MotionVectorOutput out;
// Standard shading
out.color = shade(in);
out.depth = in.position.z;
// Calculate screen-space motion
float4 currentClip = frame.viewProjection * float4(in.worldPosition, 1.0);
float4 previousClip = frame.previousViewProjection * float4(in.previousWorldPosition, 1.0);
float2 currentNDC = currentClip.xy / currentClip.w;
float2 previousNDC = previousClip.xy / previousClip.w;
// Motion in UV space (0 to 1)
out.motion = (currentNDC - previousNDC) * 0.5;
return out;
}
This fragment shader is designed to produce three outputs per pixel: a shaded color, a two-component motion vector, and a depth value. It packages those outputs into a small struct so they can be written to multiple render targets at once. The overall intent is to compute per-pixel screen-space motion from differences between the current and previous world-space positions, while also performing a normal shading and depth write suitable for downstream passes such as temporal reprojection or frame interpolation. The struct, named MotionVectorOutput, declares three members and attaches Metal output qualifiers. The first member, named color, is a four-component float4 bound to color attachment 0. This channel carries the usual shaded color, including alpha if the shading model supplies it. The second member, motion, is a two-component float2 bound to color attachment 1. That target is intended specifically for motion vectors or velocity, packed into the red and green channels of a separate render target. Lastly, depth is a single float marked with the depth semantic and uses [[depth(any)]], meaning the fragment can write a depth value which the GPU will accept into the depth buffer but is not required to satisfy depth test rules in a specific way. The fragment function motion_vector_fragment receives a VertexOutput struct for the current fragment and a constant FrameData block passed in buffer slot 0. VertexOutput typically contains interpolated per-vertex data such as world-space position, previous frame world-space position, normals, texture coordinates and a clip-space position. FrameData groups camera matrices and other per-frame constants, and includes both the current viewProjection matrix and the previousViewProjection matrix used to transform world-space positions into clip space for the respective frames. Inside the function a local MotionVectorOutput named out is declared. The code first calls a shading routine shade(in), which is expected to compute the conventional surface color for the pixel. That color is assigned to out.color. The fragment also writes a depth value derived from the interpolated in.position.z. That z is likely the view-space depth or a clip-space component depending on how VertexOutput is set up; the shader author must ensure it corresponds to the expected depth range and coordinate system for the depth attachment. The motion vector computation follows. The shader projects the current world position and the previous world position into clip space using the corresponding frame matrices: frame.viewProjection for the current frame and frame.previousViewProjection for the previous frame. Each projection yields a float4 in clip space. The shader then performs the perspective divide to convert clip-space coordinates to normalized device coordinates (NDC), dividing xy by w for both currentClip and previousClip. This yields currentNDC and previousNDC, two-component values in NDC space where x and y typically live roughly in the range [-1, 1] prior to any viewport mapping. The motion vector is computed as the difference between the current and previous NDC positions: currentNDC - previousNDC. The code then multiplies this difference by 0.5 and stores it in out.motion, with a comment that reads “Motion in UV space (0 to 1)”. That comment is slightly misleading as written. The difference of two NDC coordinates ranges roughly in [-2, 2], so multiplying by 0.5 maps that to roughly [-1, 1], not to the [0, 1] UV range. If the intent is to store motion in texture coordinate space (0..1), an additional bias of +0.5 would be necessary: (currentNDC - previousNDC) * 0.5 + 0.5. Alternatively, if storing signed motion centered on zero is intended, the current multiplication by 0.5 produces a signed velocity in a range appropriate for some consumers but the comment should not claim 0..1 semantics. This distinction matters because downstream consumers (interpolators, reprojection passes) must decode the stored motion correctly. A few practical caveats and implementation notes follow from the shader logic. First, the shader assumes previous world positions are available per-vertex and interpolated across the triangle; generating accurate previous positions requires either storing previous-model transforms on the vertex or computing previous world positions in a motion pass. If skeletal animation or per-object motion exists, the previous world position must reflect that. Second, the perspective divide must guard against dividing by zero; if previousClip.w or currentClip.w is very small the computed NDC will be unstable and motion vectors will be noisy. A robust implementation clamps or substitutes fallback values when w is near zero. Third, this screen-space motion approach captures the apparent motion of surfaces but cannot represent motion for newly revealed pixels (disocclusions) or geometry that was occluded in the previous frame. Disocclusion handling normally requires additional heuristics or a depth-aware blending strategy in the temporal pass. Fourth, precision and format choices matter: storing motion in a half precision two-channel texture (e.g., rg16Float) limits range and precision, so choosing appropriate scale and bias ensures no overflow and minimal quantization artifacts. Finally, this shader integrates naturally into temporal workflows such as TAA or frame interpolation: the color is used as the base for rendering, the motion texture guides reprojection of previous frames, and the depth helps detect mismatches and occlusions. Developers should ensure alignment between this shader’s conventions (coordinate mapping, velocity encoding) and the consumers that read its outputs to avoid incorrect reprojections or ghosting.
Frame Pacing Strategy
Triple Buffering with Interpolation
class FrameInterpolationController {
private var frameHistory: [FrameData] = []
private let historySize = 3
func submitFrame(_ frame: FrameData, displayLink: CADisplayLink) {
frameHistory.append(frame)
if frameHistory.count > historySize {
frameHistory.removeFirst()
}
// Calculate required interpolated frames
let targetRefreshRate = displayLink.targetTimestamp - displayLink.timestamp
let renderInterval = frame.timestamp - frameHistory[frameHistory.count - 2].timestamp
let interpolatedFrameCount = Int(ceil(renderInterval / targetRefreshRate)) - 1
for i in 1...interpolatedFrameCount {
let factor = Float(i) / Float(interpolatedFrameCount + 1)
generateInterpolatedFrame(factor: factor)
}
}
}
This code implements a simple controller that generates interpolated frames from a short history of actually rendered frames. It keeps an array frameHistory and a constant historySize = 3, so the window holds at most the last three entries. The main public method submitFrame takes a new frame and a CADisplayLink, adds the frame to the history, and removes the oldest entry if the history is too large. Then it estimates how many intermediate frames are needed and calls generateInterpolatedFrame with the calculated interpolation factors. The needed interpolated frames are computed in a few steps. First it finds the target refresh interval as displayLink.targetTimestamp - displayLink.timestamp, which is the planned time until the next VSync. Then it computes the render interval as the difference between the new frame’s timestamp and the previous frame in history (frame.timestamp - frameHistory[frameHistory.count - 2].timestamp). It divides renderInterval by targetRefreshRate, rounds up with ceil, and subtracts one to get the number of intermediate frames. For each i from 1 to that count it computes factor = Float(i) / Float(interpolatedFrameCount + 1), giving evenly spaced values between 0 and 1 (excluding 0 and 1), and calls the generator. This approach has several assumptions and weaknesses. It assumes there are at least two frames in frameHistory — frameHistory[frameHistory.count - 2] will fail on the first call. It does not validate timestamps or targetRefreshRate (zero or negative values break the division). If interpolatedFrameCount is zero or negative, the 1…interpolatedFrameCount loop will crash. Converting to Int without checks can produce wrong counts. There is no cap on the number of generated frames, no protection against GPU overload, no scene-cut detection or history reset, and no synchronization guarantees for generateInterpolatedFrame. Suggested fixes: check there are at least two frames before accessing the previous one; ensure targetRefreshRate is positive; clamp interpolatedFrameCount = max(0, Int(ceil(…)) - 1) so the loop is safe; consider using displayLink.duration or a fixed refresh interval; keep consistent time units; add a reasonable maximum for interpolated frames; and handle resource lifetime and GPU synchronization (or use MetalFX / command buffers) when generating frames.
Quality Considerations
Handling Disocclusions
Disoccluded regions (newly visible areas) require special handling:
// Enable disocclusion handling
descriptor.enableDisocclusionHandling = true
// Provide reactive mask for UI elements
frameInterpolation.reactiveTexture = uiMaskTexture
The first line enables a special disocclusion mode in the interpolator. This makes the algorithm handle areas that are newly visible in the current frame (they were hidden before). It prevents ghosting and artifacts when reprojecting previous frames by detecting those new pixels and processing them correctly instead of just blending old data. The second line sets a mask texture called reactiveTexture for reactive UI elements. The mask marks pixels that should not be smoothly interpolated (for example, blinking HUD, cursors, or temporary UI), so the interpolator can treat them differently or skip them in the temporal blend. Practical notes: ensure the texture format and size are compatible, synchronize writes to it, and consider performance impact when using high resolution or multiple masks.
Scene Cut Detection
func detectSceneCut(previousFrame: MTLTexture, currentFrame: MTLTexture) -> Bool {
// Calculate histogram difference
let histogramDiff = computeHistogramDifference(previousFrame, currentFrame)
// Threshold for scene cut
return histogramDiff > 0.7
}
Scene cut detection is important for frame interpolation. When the scene changes suddenly, you cannot safely use frame history because reprojection and blending will pull in wrong data and cause strong ghosting or artifacts. The example uses a simple method based on the histogram difference between two consecutive frames and compares it to a threshold of 0.7. This is fast and often enough, but it needs careful additions to be robust in real scenes. The histogram method means representing each frame by a discrete histogram of color or brightness (count of pixels per bin) and measuring the distance between those histograms. You can use luminance histograms or multi-channel histograms (R/G/B or YUV). Compare them with L1, L2, chi-square or correlation metrics and normalize the result to 0..1 so you can threshold it easily. A 0.7 threshold means that if the normalized difference is over 70%, it is treated as a scene cut, but this number depends on the bin choices, normalization and metric. Simple histogram thresholds have weaknesses. Histograms ignore spatial layout, so global brightness changes (flicker, auto exposure, day/night) can cause false positives. Large camera motion or panning can change color distribution without an actual cut. Small but important cuts that keep similar color distribution can be missed. Practical improvements include using multiple histograms (separate luminance and chrominance, or grid histograms for local changes), adaptive thresholding based on recent variability or hysteresis, and combining signals. Combine the histogram detector with motion-vector statistics, SSIM/PSNR on reprojected frames, edge detection or keypoint matches (ORB/SIFT). Motion vectors help tell camera motion apart from real cuts. Make histograms robust to exposure by normalizing them based on global exposure or using a gamut-invariant space like YUV so lighting changes do not trigger false cuts. For performance, compute histograms on the GPU or with reductions, and choose bin counts and grid resolution to match latency targets. The detector should return a binary cut flag plus a confidence level. On a confirmed cut, reset the interpolator history, ignore the previous frame for reprojection, and shorten or skip interpolated frames as needed. Also log timing and synchronize with the render thread to avoid races when accessing textures. In summary, a simple histogram test is a good quick start, but production use requires adaptivity, multi-channel and spatial analysis, and possibly multimodal fusion of signals to reduce false positives and negatives and keep interpolation visually correct.
Conclusion
MetalFX Frame Interpolation provides an effective way to synthesize intermediate frames and improve smoothness. Implementation: use safe initializations instead of fatalError and force‑unwrap; return errors or disable interpolation as a fallback. Format validation: verify GPU support for pixel formats and ensure compatibility with the rendering pipeline (color, depth, motion, output). Motion vectors: unify encoding (range and bias); if UV 0..1 is desired, use ((currentNDC - previousNDC) * 0.5 + 0.5), otherwise document that values are signed. Synchronization and lifecycle: ensure correct texture states, lifetimes, and command buffer completion waits to avoid races and resource leaks. Pacing and controller: add checks (sufficient history, positive targetRefreshRate, cap on interpolated frames), robust timing and protection against GPU overload. Cut detection and disocclusion: use multimodal detectors (local histograms, motion stats, SSIM) and enable disocclusion/reactive masks for UI while considering performance (GPU reductions, low resolution). Quality vs. performance: balance precision and memory use; test on real hardware and measure latency/bandwidth. Recommendation: harden the implementation with validation, fallbacks, synchronization and consistent motion-data encoding before production deployment.