The engine uses a 4-stage compute pipeline to cull, bin, and prepare geometry for mesh shader rendering. All stages run on GPU, eliminating CPU-side draw call generation.

Pipeline Overview

PrimitiveCulling → MeshletBinning → MeshletUnpacking → PrepareDraw → Mesh Shader

(Stage 0) (Stage 1) (Stage 2) (Stage 3)

Each stage reads outputs from the previous stage. The pipeline processes primitives (scene objects) through frustum culling, groups them by material, linearizes their meshlets, and generates indirect draw commands.

Stage 0: PrimitiveCulling

Shader: PrimitiveCulling.comp

Tests primitives against view frustum and Hi-Z occlusion. Operates in two passes for accurate occlusion culling.

Pass 1

Tests all primitives against previous frame's Hi-Z pyramid
Survivors write to culling_survivors[]
Primitives that pass frustum but fail Hi-Z write to culling_failed[] for Pass 2

Pass 2

Re-tests culling_failed[] primitives against current frame's Hi-Z
Appends newly visible primitives to culling_survivors[]

Culling Tests

Frustum culling: Tests bounding sphere against 6 frustum planes for both eyes. Primitive passes if visible in either eye.

Hi-Z occlusion: Projects bounding sphere to screen space, samples Hi-Z mipmap pyramid at appropriate LOD. Conservative: only culls objects behind ALL scene geometry in their screen tile.

GPU transform optimization: Computes world-space bounds from local bounds + world matrix on GPU. Eliminates CPU iteration for transform updates.

Path Routing

PrimitiveCulling routes primitives to three rendering paths:

Mesh shader path (multi-meshlet): Primitives with meshletCount > 1 → writes primitive ID to culling_survivors[] → processed by MeshletBinning

Vertex shader path (single-meshlet): Primitives with meshletCount == 1 → writes primitive ID to vs_visible_instances[] → separate instanced VS pipeline

LOD path (cluster-based): Primitives with LOD data → per-cluster selection → writes ClusterSurvivor[] to lod_cluster_survivors[] → processed by MeshletBinning

Inputs

local_bounds_buffer[] - Static local bounding spheres (uploaded once)
per_object_transforms[] - World matrices (updated per frame)
primitive_meshlet_data[] - Meshlet counts and pipeline IDs
frustum_planes - View frustum for both eyes (UBO)
u_hiZPyramid - Previous frame's depth pyramid (Pass 1)
u_hiZPyramidCurrent - Current frame's depth pyramid (Pass 2)

Outputs

culling_survivors[] - Primitive IDs for mesh shader path
cull_count - Atomic counter for survivors
vs_visible_instances[] - Primitive IDs for VS path
vs_visible_count - Atomic counter for VS survivors
lod_cluster_survivors[] - LOD cluster data with dither factors
lod_cluster_survivor_count - Atomic counter for LOD clusters
culling_failed[] - Primitive IDs that failed Pass 1 Hi-Z (for Pass 2)
culling_failed_count - Atomic counter for failed primitives

Dispatch

vkCmdDispatch(cmd, (primitiveCount + threadCount - 1) / threadCount, 1, 1);

Stage 1: MeshletBinning (BinningAllocator)

Shader: MeshletBinningAllocator.comp

Groups meshlets by rendering pipeline (material/shader type) to batch draw calls. All meshlets using the same material get binned together for a single indirect draw command.

Algorithm

Processes two input streams:

Regular primitive survivors (indices [0, survivor_count))
LOD cluster survivors (indices [survivor_count, survivor_count + lod_cluster_survivor_count))

For each survivor:

Look up pipeline ID from primitive_meshlet_data[primitiveID].pipelineID (which material/shader this object uses)
Use subgroup operations to batch atomic increments per pipeline
Allocate write offset in that pipeline's bin
Store allocation info for Stage 2

The output groups all meshlets that share the same rendering pipeline together. Mesh shader Stage 3 dispatches one indirect draw per pipeline, processing all binned meshlets in a single call.

Subgroup optimization: Threads targeting the same pipeline aggregate their meshlet counts and perform a single atomic add per subgroup instead of per thread. Reduces atomic contention by ~32x (typical subgroup size).

Inputs

culling_survivors[] - Primitive IDs from PrimitiveCulling
survivor_count - Count buffer from PrimitiveCulling
lod_cluster_survivors[] - LOD cluster data from PrimitiveCulling
lod_cluster_survivor_count - LOD counter from PrimitiveCulling
primitive_meshlet_data[] - Meshlet start/count/pipeline per primitive
cluster_lod_data[] - Meshlet index lookup for LOD clusters

Outputs

allocations[] - Per-primitive/cluster allocation info:
struct PrimitiveAllocation {

uint primitiveID; // For transform/material lookup

uint pipelineID; // Which material bin

uint binWriteOffset; // Where to write in bin

uint meshletStartIndex; // Global meshlet start

uint meshletCount; // Number of meshlets

float ditherFactor; // LOD transition (1.0 for non-LOD)

};
pipeline_meshlet_counts[] - Total meshlets per pipeline (atomic counters)

Dispatch

// Worst case: all primitives survive + all LOD clusters selected
const uint32_t maxWork = primitiveCount + clusterCount;
vkCmdDispatch(cmd, (maxWork + threadCount - 1) / threadCount, 1, 1);

Stage 2: MeshletUnpacking

Shader: MeshletUnpacking.comp

Expands primitives into linearized meshlet indices per pipeline bin. Creates fixed-stride array that mesh shaders index into.

Layout

Output buffer uses fixed-stride layout:

[Pipeline0: MAX_MESHLETS_PER_BIN slots][Pipeline1: MAX_MESHLETS_PER_BIN slots]...

Each pipeline reserves MAX_MESHLETS_PER_BIN (262144) slots. Mesh shader computes offset: pipelineIndex * MAX_MESHLETS_PER_BIN + gl_WorkGroupID.x.

Algorithm

Reads allocation info from Stage 1, expands each primitive's meshlets:

uint bin_base = alloc.pipelineID * MAX_MESHLETS_PER_BIN;
uint write_start = bin_base + alloc.binWriteOffset;
 
for (uint i = 0; i < alloc.meshletCount; i++) {
    binned_meshlet_info[write_start + i] = VisibleMeshletInfo {
        .objectID = alloc.primitiveID,
        .meshletIndex = alloc.meshletStartIndex + i,
        .ditherFactor = alloc.ditherFactor
    };
}

Inputs

allocations[] - From MeshletBinning
survivor_count - Regular survivor count
lod_cluster_survivor_count - LOD survivor count

Outputs

binned_meshlet_info[] - Linearized VisibleMeshletInfo per pipeline:
struct VisibleMeshletInfo {

uint objectID; // Primitive ID for transform lookup

uint meshletIndex; // Global meshlet index

float ditherFactor; // LOD transition (0.0-1.0)

};

Dispatch

GPU-driven via indirect dispatch. CountDispatcher reads survivor_count + lod_cluster_survivor_count and generates dispatch commands.

vkCmdDispatchIndirect(cmd, dispatchBuffer, 0);

Stage 3: PrepareDraw

Shader: PrepareDraw.comp

Converts per-pipeline meshlet counts to indirect draw commands for mesh shaders.

Algorithm

One thread per pipeline. Reads meshlet count, writes indirect command:

uint pipelineIndex = gl_GlobalInvocationID.x;
commands[pipelineIndex] = DrawMeshTaskIndirectCommand {
    .groupCountX = pipeline_meshlet_counts[pipelineIndex],
    .groupCountY = 1,
    .groupCountZ = 1
};

Mesh shader work group count = meshlet count (one thread group processes one meshlet).

Inputs

pipeline_meshlet_counts[] - From MeshletBinning (Stage 1)

Outputs

commands[] - VkDrawMeshTasksIndirectCommandEXT per pipeline

Dispatch

vkCmdDispatch(cmd, 1, 1, 1); // Single thread, one per pipeline

Data Flow Summary

CPU Upload:
  - Local bounds (once)
  - Per-object transforms (per frame)
  - Frustum planes (per frame)
 
Stage 0 (PrimitiveCulling):
  Reads: local_bounds, transforms, frustum, Hi-Z
  Writes: culling_survivors[], cull_count, vs_visible_instances[], lod_cluster_survivors[]
 
Stage 1 (MeshletBinning):
  Reads: culling_survivors[], cull_count, lod_cluster_survivors[]
  Writes: allocations[], pipeline_meshlet_counts[]
 
Stage 2 (MeshletUnpacking):
  Reads: allocations[], survivor_count, lod_cluster_survivor_count
  Writes: binned_meshlet_info[]
 
Stage 3 (PrepareDraw):
  Reads: pipeline_meshlet_counts[]
  Writes: indirect_draw_commands[]
 
Mesh Shader:
  Reads: binned_meshlet_info[pipelineIndex * MAX_MESHLETS_PER_BIN + gl_WorkGroupID.x]
  Draws via: vkCmdDrawMeshTasksIndirectEXT(cmd, indirect_draw_commands, ...)

Barriers

Each stage synchronizes via compute-to-compute pipeline barriers:

VkMemoryBarrier2 {
    .srcStageMask = VK_PIPELINE_STAGE_2_COMPUTE_SHADER_BIT,
    .srcAccessMask = VK_ACCESS_2_SHADER_WRITE_BIT,
    .dstStageMask = VK_PIPELINE_STAGE_2_COMPUTE_SHADER_BIT,
    .dstAccessMask = VK_ACCESS_2_SHADER_READ_BIT
}

Final barrier before mesh shader:

VkBufferMemoryBarrier2 {
    .srcStageMask = VK_PIPELINE_STAGE_2_COMPUTE_SHADER_BIT,
    .dstStageMask = VK_PIPELINE_STAGE_2_MESH_SHADER_BIT_EXT | VK_PIPELINE_STAGE_2_DRAW_INDIRECT_BIT
}

Performance Characteristics

Atomic contention reduction: Subgroup aggregation in MeshletBinning reduces atomic operations by ~32x (one atomic per subgroup instead of per thread).

GPU-driven dispatch: MeshletUnpacking uses indirect dispatch. No CPU readback required to determine work size.

Fixed-stride indexing: Mesh shaders use simple arithmetic (pipelineID * stride + meshletID) instead of indirection chains.

Two-pass occlusion: Pass 1 uses stale Hi-Z (allows overlap with previous frame rendering). Pass 2 catches false negatives with accurate Hi-Z.

Buffer Initialization

Counters and dispatch buffers must be zeroed before PrimitiveCulling:

vkCmdFillBuffer(cmd, cull_count, 0, sizeof(uint32_t), 0);
vkCmdFillBuffer(cmd, vs_visible_count, 0, sizeof(uint32_t), 0);
vkCmdFillBuffer(cmd, lod_cluster_survivor_count, 0, sizeof(uint32_t), 0);
vkCmdFillBuffer(cmd, culling_failed_count, 0, sizeof(uint32_t), 0);
// Fill pipeline_meshlet_counts[32] with zeros
vkCmdFillBuffer(cmd, pipeline_meshlet_counts, 0, 32 * sizeof(uint32_t), 0);

Vertex Shader Offloading Pipeline

Single-meshlet geometry bypasses the mesh shader pipeline using instanced vertex shader rendering. This reduces overhead for tiny meshes and heavily instanced objects.

Routing Decision

PrimitiveCulling routes primitives based on meshlet count:

meshletCount == 1 → Vertex shader path
meshletCount > 1 → Mesh shader path

Single-meshlet primitives write to vs_visible_instances[] instead of culling_survivors[].

Pipeline Stages

PrimitiveCulling → VSBinningAllocator → VSInstanceUnpacking → VSPrepareDraw → vkCmdDrawIndexedIndirectCount

(Stage 0) (VS Stage 1) (VS Stage 2) (VS Stage 3)

VS Stage 1: VSBinningAllocator

Shader: VSBinningAllocator.comp

Bins visible single-meshlet primitives by geometry type. Groups instances sharing the same mesh to enable instanced drawing.

Algorithm:

Reads primitive IDs from vs_visible_instances[], resolves geometry type via instance_culling_data[].meshGeometryId → mesh_geometry_data[].singleMeshletGeoIndex, allocates write offsets per geometry bin using subgroup-optimized atomics.

uint primitive_id = vs_visible_instances[survivor_idx];
uint mesh_geo_id = instance_culling_data[primitive_id].meshGeometryId;
uint geo_index = mesh_geometry_data[mesh_geo_id].singleMeshletGeoIndex;
 
// Subgroup optimization batches atomics per geometry type
uint global_offset = atomicAdd(vs_geometry_counters[geo_index], instance_count);

Inputs:

vs_visible_instances[] - Primitive IDs from PrimitiveCulling
vs_visible_count - Number of VS path survivors
instance_culling_data[] - Maps primitive ID to geometry ID
mesh_geometry_data[] - Contains singleMeshletGeoIndex

Outputs:

vs_instance_allocations[] - Per-survivor allocation: (primitiveID, geoIndex, writeOffset)
vs_geometry_counters[] - Atomic instance count per geometry type

Dispatch:

vkCmdDispatch(cmd, (primitiveCount + threadCount - 1) / threadCount, 1, 1);

VS Stage 2: VSInstanceUnpacking

Shader: VSInstanceUnpacking.comp

Writes primitive IDs to fixed-stride instance buffer organized by geometry type.

Buffer Layout:

[Geo0: slots 0..MAX-1][Geo1: slots MAX..2*MAX-1][Geo2: ...]

Each geometry type reserves MAX_VS_INSTANCES_PER_GEO (16384) slots.

Algorithm:

uint dest_idx = alloc.geoIndex * MAX_VS_INSTANCES_PER_GEO + alloc.writeOffset;

vs_instance_ids[dest_idx] = alloc.primitiveID;

Vertex shader reads: uint objectID = vs_instance_ids[gl_InstanceIndex];

Inputs:

vs_instance_allocations[] - From VSBinningAllocator
vs_visible_count - Number of survivors

Outputs:

vs_instance_ids[] - Fixed-stride buffer for vertex shader lookup

Dispatch:

vkCmdDispatch(cmd, (primitiveCount + threadCount - 1) / threadCount, 1, 1);

VS Stage 3: VSPrepareDraw

Shader: VSPrepareDraw.comp

Generates one VkDrawIndexedIndirectCommand per geometry type with visible instances.

Algorithm:

One thread per geometry type. Reads instance count, writes indirect command if count > 0.

uint instance_count = vs_geometry_counters[geo_idx];
if (instance_count == 0) return;
 
DrawIndexedIndirectCommand cmd = {
    .indexCount = geo.indexCount,
    .instanceCount = instance_count,
    .firstIndex = geo.firstIndex,
    .vertexOffset = geo.vertexOffset,
    .firstInstance = geo_idx * MAX_VS_INSTANCES_PER_GEO  // Base into instance buffer
};

Subgroup optimization: elect one thread to allocate draw command slots for all geometries in subgroup.

Inputs:

single_meshlet_geo[] - Static geometry data (index/vertex info)
vs_geometry_counters[] - Instance counts from VSBinningAllocator

Outputs:

vs_indirect_draws[] - VkDrawIndexedIndirectCommand per geometry
vs_draw_count - Number of commands generated (atomic counter)

Push Constants:

uniqueGeometryCount - Number of geometry types to process

Dispatch:

vkCmdDispatch(cmd, (uniqueGeometryCount + threadCount - 1) / threadCount, 1, 1);

Vertex Shader Draw

Command:

vkCmdDrawIndexedIndirectCount(
    cmd,
    indirectBuffer,        // vs_indirect_draws[]
    0,
    countBuffer,           // vs_draw_count
    0,
    maxDrawCount,          // uniqueGeometryCount
    stride
);

Vertex Shader Lookup:

Placeholder.vert maps gl_InstanceIndex to object ID:

uint objectID = vs_instance_ids[gl_InstanceIndex];
mat4 worldMatrix = objects[objectID].worldMatrix;
gl_Position = viewProjection.matrices[gl_ViewIndex] * worldMatrix * vec4(inPosition.xyz, 1.0);

Data Flow Summary

PrimitiveCulling:
  Input: primitive bounds, transforms, frustum
  Output: vs_visible_instances[] (primitive IDs)
 
VSBinningAllocator:
  Input: vs_visible_instances[], geometry metadata
  Output: vs_instance_allocations[], vs_geometry_counters[]
 
VSInstanceUnpacking:
  Input: vs_instance_allocations[]
  Output: vs_instance_ids[] (fixed-stride per geometry)
 
VSPrepareDraw:
  Input: vs_geometry_counters[], geometry data
  Output: vs_indirect_draws[], vs_draw_count
 
DrawIndexedIndirectCount:
  Reads: vs_indirect_draws[], vs_draw_count
  Vertex shader reads: vs_instance_ids[gl_InstanceIndex]

Performance Characteristics

Draw call reduction: 5000 visible cubes with 1 geometry type: 5000 draws → 1 instanced draw.

Memory overhead: Instance ID buffer: uniqueGeometryCount × MAX_VS_INSTANCES_PER_GEO × 4 bytes (2 MB for 128 geometry types).

Compute overhead: 3 lightweight dispatches (O(visible_instances) + O(unique_geometries)).

Routing threshold: Single-meshlet geometry (meshletCount == 1) automatically uses VS path. Multi-meshlet geometry uses mesh shader path. No manual configuration required.

Barriers

// After PrimitiveCulling
VK_PIPELINE_STAGE_2_COMPUTE_SHADER_BIT (WRITE)
→ VK_PIPELINE_STAGE_2_COMPUTE_SHADER_BIT (READ)
 
// After VSBinningAllocator
VK_PIPELINE_STAGE_2_COMPUTE_SHADER_BIT (WRITE)
→ VK_PIPELINE_STAGE_2_COMPUTE_SHADER_BIT (READ)
 
// After VSInstanceUnpacking
VK_PIPELINE_STAGE_2_COMPUTE_SHADER_BIT (WRITE)
→ VK_PIPELINE_STAGE_2_COMPUTE_SHADER_BIT (READ) | VK_PIPELINE_STAGE_2_VERTEX_SHADER_BIT (READ)
 
// After VSPrepareDraw
VK_PIPELINE_STAGE_2_COMPUTE_SHADER_BIT (WRITE)
→ VK_PIPELINE_STAGE_2_DRAW_INDIRECT_BIT (READ)

Pipeline Overview

Stage 0: PrimitiveCulling

Pass 1

Pass 2

Culling Tests

Path Routing

Inputs

Outputs

Dispatch

Stage 1: MeshletBinning (BinningAllocator)

Algorithm

Inputs

Outputs

Dispatch

Stage 2: MeshletUnpacking

Layout

Algorithm

Inputs

Outputs

Dispatch

Stage 3: PrepareDraw

Algorithm

Inputs

Outputs

Dispatch

Data Flow Summary

Barriers

Performance Characteristics

Buffer Initialization

Vertex Shader Offloading Pipeline

Routing Decision

Pipeline Stages

VS Stage 1: VSBinningAllocator

VS Stage 2: VSInstanceUnpacking

VS Stage 3: VSPrepareDraw

Vertex Shader Draw

Data Flow Summary

Performance Characteristics

Barriers

Related Documentation