The engine uses a 4-stage compute pipeline to cull, bin, and prepare geometry for mesh shader rendering. All stages run on GPU, eliminating CPU-side draw call generation.
Pipeline Overview
PrimitiveCulling → MeshletBinning → MeshletUnpacking → PrepareDraw → Mesh Shader
(Stage 0) (Stage 1) (Stage 2) (Stage 3)
Each stage reads outputs from the previous stage. The pipeline processes primitives (scene objects) through frustum culling, groups them by material, linearizes their meshlets, and generates indirect draw commands.
Stage 0: PrimitiveCulling
Shader: PrimitiveCulling.comp
Tests primitives against view frustum and Hi-Z occlusion. Operates in two passes for accurate occlusion culling.
Pass 1
- Tests all primitives against previous frame's Hi-Z pyramid
- Survivors write to culling_survivors[]
- Primitives that pass frustum but fail Hi-Z write to culling_failed[] for Pass 2
Pass 2
- Re-tests culling_failed[] primitives against current frame's Hi-Z
- Appends newly visible primitives to culling_survivors[]
Culling Tests
Frustum culling: Tests bounding sphere against 6 frustum planes for both eyes. Primitive passes if visible in either eye.
Hi-Z occlusion: Projects bounding sphere to screen space, samples Hi-Z mipmap pyramid at appropriate LOD. Conservative: only culls objects behind ALL scene geometry in their screen tile.
GPU transform optimization: Computes world-space bounds from local bounds + world matrix on GPU. Eliminates CPU iteration for transform updates.
Path Routing
PrimitiveCulling routes primitives to three rendering paths:
Mesh shader path (multi-meshlet): Primitives with meshletCount > 1 → writes primitive ID to culling_survivors[] → processed by MeshletBinning
Vertex shader path (single-meshlet): Primitives with meshletCount == 1 → writes primitive ID to vs_visible_instances[] → separate instanced VS pipeline
LOD path (cluster-based): Primitives with LOD data → per-cluster selection → writes ClusterSurvivor[] to lod_cluster_survivors[] → processed by MeshletBinning
Inputs
- local_bounds_buffer[] - Static local bounding spheres (uploaded once)
- per_object_transforms[] - World matrices (updated per frame)
- primitive_meshlet_data[] - Meshlet counts and pipeline IDs
- frustum_planes - View frustum for both eyes (UBO)
- u_hiZPyramid - Previous frame's depth pyramid (Pass 1)
- u_hiZPyramidCurrent - Current frame's depth pyramid (Pass 2)
Outputs
- culling_survivors[] - Primitive IDs for mesh shader path
- cull_count - Atomic counter for survivors
- vs_visible_instances[] - Primitive IDs for VS path
- vs_visible_count - Atomic counter for VS survivors
- lod_cluster_survivors[] - LOD cluster data with dither factors
- lod_cluster_survivor_count - Atomic counter for LOD clusters
- culling_failed[] - Primitive IDs that failed Pass 1 Hi-Z (for Pass 2)
- culling_failed_count - Atomic counter for failed primitives
Dispatch
vkCmdDispatch(cmd, (primitiveCount + threadCount - 1) / threadCount, 1, 1);
Stage 1: MeshletBinning (BinningAllocator)
Shader: MeshletBinningAllocator.comp
Groups meshlets by rendering pipeline (material/shader type) to batch draw calls. All meshlets using the same material get binned together for a single indirect draw command.
Algorithm
Processes two input streams:
- Regular primitive survivors (indices [0, survivor_count))
- LOD cluster survivors (indices [survivor_count, survivor_count + lod_cluster_survivor_count))
For each survivor:
- Look up pipeline ID from primitive_meshlet_data[primitiveID].pipelineID (which material/shader this object uses)
- Use subgroup operations to batch atomic increments per pipeline
- Allocate write offset in that pipeline's bin
- Store allocation info for Stage 2
The output groups all meshlets that share the same rendering pipeline together. Mesh shader Stage 3 dispatches one indirect draw per pipeline, processing all binned meshlets in a single call.
Subgroup optimization: Threads targeting the same pipeline aggregate their meshlet counts and perform a single atomic add per subgroup instead of per thread. Reduces atomic contention by ~32x (typical subgroup size).
Inputs
- culling_survivors[] - Primitive IDs from PrimitiveCulling
- survivor_count - Count buffer from PrimitiveCulling
- lod_cluster_survivors[] - LOD cluster data from PrimitiveCulling
- lod_cluster_survivor_count - LOD counter from PrimitiveCulling
- primitive_meshlet_data[] - Meshlet start/count/pipeline per primitive
- cluster_lod_data[] - Meshlet index lookup for LOD clusters
Outputs
- allocations[] - Per-primitive/cluster allocation info:
struct PrimitiveAllocation {
uint primitiveID; // For transform/material lookup
uint pipelineID; // Which material bin
uint binWriteOffset; // Where to write in bin
uint meshletStartIndex; // Global meshlet start
uint meshletCount; // Number of meshlets
float ditherFactor; // LOD transition (1.0 for non-LOD)
};
- pipeline_meshlet_counts[] - Total meshlets per pipeline (atomic counters)
Dispatch
const uint32_t maxWork = primitiveCount + clusterCount;
vkCmdDispatch(cmd, (maxWork + threadCount - 1) / threadCount, 1, 1);
Stage 2: MeshletUnpacking
Shader: MeshletUnpacking.comp
Expands primitives into linearized meshlet indices per pipeline bin. Creates fixed-stride array that mesh shaders index into.
Layout
Output buffer uses fixed-stride layout:
[Pipeline0: MAX_MESHLETS_PER_BIN slots][Pipeline1: MAX_MESHLETS_PER_BIN slots]...
Each pipeline reserves MAX_MESHLETS_PER_BIN (262144) slots. Mesh shader computes offset: pipelineIndex * MAX_MESHLETS_PER_BIN + gl_WorkGroupID.x.
Algorithm
Reads allocation info from Stage 1, expands each primitive's meshlets:
uint bin_base = alloc.pipelineID * MAX_MESHLETS_PER_BIN;
uint write_start = bin_base + alloc.binWriteOffset;
for (uint i = 0; i < alloc.meshletCount; i++) {
binned_meshlet_info[write_start + i] = VisibleMeshletInfo {
.objectID = alloc.primitiveID,
.meshletIndex = alloc.meshletStartIndex + i,
.ditherFactor = alloc.ditherFactor
};
}
Inputs
- allocations[] - From MeshletBinning
- survivor_count - Regular survivor count
- lod_cluster_survivor_count - LOD survivor count
Outputs
- binned_meshlet_info[] - Linearized VisibleMeshletInfo per pipeline:
struct VisibleMeshletInfo {
uint objectID; // Primitive ID for transform lookup
uint meshletIndex; // Global meshlet index
float ditherFactor; // LOD transition (0.0-1.0)
};
Dispatch
GPU-driven via indirect dispatch. CountDispatcher reads survivor_count + lod_cluster_survivor_count and generates dispatch commands.
vkCmdDispatchIndirect(cmd, dispatchBuffer, 0);
Stage 3: PrepareDraw
Shader: PrepareDraw.comp
Converts per-pipeline meshlet counts to indirect draw commands for mesh shaders.
Algorithm
One thread per pipeline. Reads meshlet count, writes indirect command:
uint pipelineIndex = gl_GlobalInvocationID.x;
commands[pipelineIndex] = DrawMeshTaskIndirectCommand {
.groupCountX = pipeline_meshlet_counts[pipelineIndex],
.groupCountY = 1,
.groupCountZ = 1
};
Mesh shader work group count = meshlet count (one thread group processes one meshlet).
Inputs
- pipeline_meshlet_counts[] - From MeshletBinning (Stage 1)
Outputs
- commands[] - VkDrawMeshTasksIndirectCommandEXT per pipeline
Dispatch
vkCmdDispatch(cmd, 1, 1, 1);
Data Flow Summary
CPU Upload:
- Local bounds (once)
- Per-object transforms (per frame)
- Frustum planes (per frame)
Stage 0 (PrimitiveCulling):
Reads: local_bounds, transforms, frustum, Hi-Z
Writes: culling_survivors[], cull_count, vs_visible_instances[], lod_cluster_survivors[]
Stage 1 (MeshletBinning):
Reads: culling_survivors[], cull_count, lod_cluster_survivors[]
Writes: allocations[], pipeline_meshlet_counts[]
Stage 2 (MeshletUnpacking):
Reads: allocations[], survivor_count, lod_cluster_survivor_count
Writes: binned_meshlet_info[]
Stage 3 (PrepareDraw):
Reads: pipeline_meshlet_counts[]
Writes: indirect_draw_commands[]
Mesh Shader:
Reads: binned_meshlet_info[pipelineIndex * MAX_MESHLETS_PER_BIN + gl_WorkGroupID.x]
Draws via: vkCmdDrawMeshTasksIndirectEXT(cmd, indirect_draw_commands, ...)
Barriers
Each stage synchronizes via compute-to-compute pipeline barriers:
VkMemoryBarrier2 {
.srcStageMask = VK_PIPELINE_STAGE_2_COMPUTE_SHADER_BIT,
.srcAccessMask = VK_ACCESS_2_SHADER_WRITE_BIT,
.dstStageMask = VK_PIPELINE_STAGE_2_COMPUTE_SHADER_BIT,
.dstAccessMask = VK_ACCESS_2_SHADER_READ_BIT
}
Final barrier before mesh shader:
VkBufferMemoryBarrier2 {
.srcStageMask = VK_PIPELINE_STAGE_2_COMPUTE_SHADER_BIT,
.dstStageMask = VK_PIPELINE_STAGE_2_MESH_SHADER_BIT_EXT | VK_PIPELINE_STAGE_2_DRAW_INDIRECT_BIT
}
Performance Characteristics
Atomic contention reduction: Subgroup aggregation in MeshletBinning reduces atomic operations by ~32x (one atomic per subgroup instead of per thread).
GPU-driven dispatch: MeshletUnpacking uses indirect dispatch. No CPU readback required to determine work size.
Fixed-stride indexing: Mesh shaders use simple arithmetic (pipelineID * stride + meshletID) instead of indirection chains.
Two-pass occlusion: Pass 1 uses stale Hi-Z (allows overlap with previous frame rendering). Pass 2 catches false negatives with accurate Hi-Z.
Buffer Initialization
Counters and dispatch buffers must be zeroed before PrimitiveCulling:
vkCmdFillBuffer(cmd, cull_count, 0, sizeof(uint32_t), 0);
vkCmdFillBuffer(cmd, vs_visible_count, 0, sizeof(uint32_t), 0);
vkCmdFillBuffer(cmd, lod_cluster_survivor_count, 0, sizeof(uint32_t), 0);
vkCmdFillBuffer(cmd, culling_failed_count, 0, sizeof(uint32_t), 0);
vkCmdFillBuffer(cmd, pipeline_meshlet_counts, 0, 32 * sizeof(uint32_t), 0);
Vertex Shader Offloading Pipeline
Single-meshlet geometry bypasses the mesh shader pipeline using instanced vertex shader rendering. This reduces overhead for tiny meshes and heavily instanced objects.
Routing Decision
PrimitiveCulling routes primitives based on meshlet count:
- meshletCount == 1 → Vertex shader path
- meshletCount > 1 → Mesh shader path
Single-meshlet primitives write to vs_visible_instances[] instead of culling_survivors[].
Pipeline Stages
PrimitiveCulling → VSBinningAllocator → VSInstanceUnpacking → VSPrepareDraw → vkCmdDrawIndexedIndirectCount
(Stage 0) (VS Stage 1) (VS Stage 2) (VS Stage 3)
VS Stage 1: VSBinningAllocator
Shader: VSBinningAllocator.comp
Bins visible single-meshlet primitives by geometry type. Groups instances sharing the same mesh to enable instanced drawing.
Algorithm:
Reads primitive IDs from vs_visible_instances[], resolves geometry type via instance_culling_data[].meshGeometryId → mesh_geometry_data[].singleMeshletGeoIndex, allocates write offsets per geometry bin using subgroup-optimized atomics.
uint primitive_id = vs_visible_instances[survivor_idx];
uint mesh_geo_id = instance_culling_data[primitive_id].meshGeometryId;
uint geo_index = mesh_geometry_data[mesh_geo_id].singleMeshletGeoIndex;
// Subgroup optimization batches atomics per geometry type
uint global_offset = atomicAdd(vs_geometry_counters[geo_index], instance_count);
Inputs:
- vs_visible_instances[] - Primitive IDs from PrimitiveCulling
- vs_visible_count - Number of VS path survivors
- instance_culling_data[] - Maps primitive ID to geometry ID
- mesh_geometry_data[] - Contains singleMeshletGeoIndex
Outputs:
- vs_instance_allocations[] - Per-survivor allocation: (primitiveID, geoIndex, writeOffset)
- vs_geometry_counters[] - Atomic instance count per geometry type
Dispatch:
vkCmdDispatch(cmd, (primitiveCount + threadCount - 1) / threadCount, 1, 1);
VS Stage 2: VSInstanceUnpacking
Shader: VSInstanceUnpacking.comp
Writes primitive IDs to fixed-stride instance buffer organized by geometry type.
Buffer Layout:
[Geo0: slots 0..MAX-1][Geo1: slots MAX..2*MAX-1][Geo2: ...]
Each geometry type reserves MAX_VS_INSTANCES_PER_GEO (16384) slots.
Algorithm:
uint dest_idx = alloc.geoIndex * MAX_VS_INSTANCES_PER_GEO + alloc.writeOffset;
vs_instance_ids[dest_idx] = alloc.primitiveID;
Vertex shader reads: uint objectID = vs_instance_ids[gl_InstanceIndex];
Inputs:
- vs_instance_allocations[] - From VSBinningAllocator
- vs_visible_count - Number of survivors
Outputs:
- vs_instance_ids[] - Fixed-stride buffer for vertex shader lookup
Dispatch:
vkCmdDispatch(cmd, (primitiveCount + threadCount - 1) / threadCount, 1, 1);
VS Stage 3: VSPrepareDraw
Shader: VSPrepareDraw.comp
Generates one VkDrawIndexedIndirectCommand per geometry type with visible instances.
Algorithm:
One thread per geometry type. Reads instance count, writes indirect command if count > 0.
uint instance_count = vs_geometry_counters[geo_idx];
if (instance_count == 0) return;
DrawIndexedIndirectCommand cmd = {
.indexCount = geo.indexCount,
.instanceCount = instance_count,
.firstIndex = geo.firstIndex,
.vertexOffset = geo.vertexOffset,
.firstInstance = geo_idx * MAX_VS_INSTANCES_PER_GEO // Base into instance buffer
};
Subgroup optimization: elect one thread to allocate draw command slots for all geometries in subgroup.
Inputs:
- single_meshlet_geo[] - Static geometry data (index/vertex info)
- vs_geometry_counters[] - Instance counts from VSBinningAllocator
Outputs:
- vs_indirect_draws[] - VkDrawIndexedIndirectCommand per geometry
- vs_draw_count - Number of commands generated (atomic counter)
Push Constants:
- uniqueGeometryCount - Number of geometry types to process
Dispatch:
vkCmdDispatch(cmd, (uniqueGeometryCount + threadCount - 1) / threadCount, 1, 1);
Vertex Shader Draw
Command:
vkCmdDrawIndexedIndirectCount(
cmd,
indirectBuffer,
0,
countBuffer,
0,
maxDrawCount,
stride
);
Vertex Shader Lookup:
Placeholder.vert maps gl_InstanceIndex to object ID:
uint objectID = vs_instance_ids[gl_InstanceIndex];
mat4 worldMatrix = objects[objectID].worldMatrix;
gl_Position = viewProjection.matrices[gl_ViewIndex] * worldMatrix * vec4(inPosition.xyz, 1.0);
Data Flow Summary
PrimitiveCulling:
Input: primitive bounds, transforms, frustum
Output: vs_visible_instances[] (primitive IDs)
VSBinningAllocator:
Input: vs_visible_instances[], geometry metadata
Output: vs_instance_allocations[], vs_geometry_counters[]
VSInstanceUnpacking:
Input: vs_instance_allocations[]
Output: vs_instance_ids[] (fixed-stride per geometry)
VSPrepareDraw:
Input: vs_geometry_counters[], geometry data
Output: vs_indirect_draws[], vs_draw_count
DrawIndexedIndirectCount:
Reads: vs_indirect_draws[], vs_draw_count
Vertex shader reads: vs_instance_ids[gl_InstanceIndex]
Performance Characteristics
Draw call reduction: 5000 visible cubes with 1 geometry type: 5000 draws → 1 instanced draw.
Memory overhead: Instance ID buffer: uniqueGeometryCount × MAX_VS_INSTANCES_PER_GEO × 4 bytes (2 MB for 128 geometry types).
Compute overhead: 3 lightweight dispatches (O(visible_instances) + O(unique_geometries)).
Routing threshold: Single-meshlet geometry (meshletCount == 1) automatically uses VS path. Multi-meshlet geometry uses mesh shader path. No manual configuration required.
Barriers
VK_PIPELINE_STAGE_2_COMPUTE_SHADER_BIT (WRITE)
→ VK_PIPELINE_STAGE_2_COMPUTE_SHADER_BIT (READ)
VK_PIPELINE_STAGE_2_COMPUTE_SHADER_BIT (WRITE)
→ VK_PIPELINE_STAGE_2_COMPUTE_SHADER_BIT (READ)
VK_PIPELINE_STAGE_2_COMPUTE_SHADER_BIT (WRITE)
→ VK_PIPELINE_STAGE_2_COMPUTE_SHADER_BIT (READ) | VK_PIPELINE_STAGE_2_VERTEX_SHADER_BIT (READ)
VK_PIPELINE_STAGE_2_COMPUTE_SHADER_BIT (WRITE)
→ VK_PIPELINE_STAGE_2_DRAW_INDIRECT_BIT (READ)
Related Documentation
- VR Renderer
- Shader Bindings (no anchor available)
- Rendering Data Manager (no anchor available)