Vulkan Schnee 0.0.1
High-performance rendering engine
Loading...
Searching...
No Matches
GPU-Driven Rendering Pipeline

The engine uses a 4-stage compute pipeline to cull, bin, and prepare geometry for mesh shader rendering. All stages run on GPU, eliminating CPU-side draw call generation.

Pipeline Overview

PrimitiveCulling → MeshletBinning → MeshletUnpacking → PrepareDraw → Mesh Shader
(Stage 0) (Stage 1) (Stage 2) (Stage 3)

Each stage reads outputs from the previous stage. The pipeline processes primitives (scene objects) through frustum culling, groups them by material, linearizes their meshlets, and generates indirect draw commands.

Stage 0: PrimitiveCulling

Shader: PrimitiveCulling.comp

Tests primitives against view frustum and Hi-Z occlusion. Operates in two passes for accurate occlusion culling.

Pass 1

  • Tests all primitives against previous frame's Hi-Z pyramid
  • Survivors write to culling_survivors[]
  • Primitives that pass frustum but fail Hi-Z write to culling_failed[] for Pass 2

Pass 2

  • Re-tests culling_failed[] primitives against current frame's Hi-Z
  • Appends newly visible primitives to culling_survivors[]

Culling Tests

Frustum culling: Tests bounding sphere against 6 frustum planes for both eyes. Primitive passes if visible in either eye.

Hi-Z occlusion: Projects bounding sphere to screen space, samples Hi-Z mipmap pyramid at appropriate LOD. Conservative: only culls objects behind ALL scene geometry in their screen tile.

GPU transform optimization: Computes world-space bounds from local bounds + world matrix on GPU. Eliminates CPU iteration for transform updates.

Path Routing

PrimitiveCulling routes primitives to three rendering paths:

Mesh shader path (multi-meshlet): Primitives with meshletCount > 1 → writes primitive ID to culling_survivors[] → processed by MeshletBinning

Vertex shader path (single-meshlet): Primitives with meshletCount == 1 → writes primitive ID to vs_visible_instances[] → separate instanced VS pipeline

LOD path (cluster-based): Primitives with LOD data → per-cluster selection → writes ClusterSurvivor[] to lod_cluster_survivors[] → processed by MeshletBinning

Inputs

  • local_bounds_buffer[] - Static local bounding spheres (uploaded once)
  • per_object_transforms[] - World matrices (updated per frame)
  • primitive_meshlet_data[] - Meshlet counts and pipeline IDs
  • frustum_planes - View frustum for both eyes (UBO)
  • u_hiZPyramid - Previous frame's depth pyramid (Pass 1)
  • u_hiZPyramidCurrent - Current frame's depth pyramid (Pass 2)

Outputs

  • culling_survivors[] - Primitive IDs for mesh shader path
  • cull_count - Atomic counter for survivors
  • vs_visible_instances[] - Primitive IDs for VS path
  • vs_visible_count - Atomic counter for VS survivors
  • lod_cluster_survivors[] - LOD cluster data with dither factors
  • lod_cluster_survivor_count - Atomic counter for LOD clusters
  • culling_failed[] - Primitive IDs that failed Pass 1 Hi-Z (for Pass 2)
  • culling_failed_count - Atomic counter for failed primitives

Dispatch

vkCmdDispatch(cmd, (primitiveCount + threadCount - 1) / threadCount, 1, 1);

Stage 1: MeshletBinning (BinningAllocator)

Shader: MeshletBinningAllocator.comp

Groups meshlets by rendering pipeline (material/shader type) to batch draw calls. All meshlets using the same material get binned together for a single indirect draw command.

Algorithm

Processes two input streams:

  1. Regular primitive survivors (indices [0, survivor_count))
  2. LOD cluster survivors (indices [survivor_count, survivor_count + lod_cluster_survivor_count))

For each survivor:

  1. Look up pipeline ID from primitive_meshlet_data[primitiveID].pipelineID (which material/shader this object uses)
  2. Use subgroup operations to batch atomic increments per pipeline
  3. Allocate write offset in that pipeline's bin
  4. Store allocation info for Stage 2

The output groups all meshlets that share the same rendering pipeline together. Mesh shader Stage 3 dispatches one indirect draw per pipeline, processing all binned meshlets in a single call.

Subgroup optimization: Threads targeting the same pipeline aggregate their meshlet counts and perform a single atomic add per subgroup instead of per thread. Reduces atomic contention by ~32x (typical subgroup size).

Inputs

  • culling_survivors[] - Primitive IDs from PrimitiveCulling
  • survivor_count - Count buffer from PrimitiveCulling
  • lod_cluster_survivors[] - LOD cluster data from PrimitiveCulling
  • lod_cluster_survivor_count - LOD counter from PrimitiveCulling
  • primitive_meshlet_data[] - Meshlet start/count/pipeline per primitive
  • cluster_lod_data[] - Meshlet index lookup for LOD clusters

Outputs

  • allocations[] - Per-primitive/cluster allocation info:
    struct PrimitiveAllocation {
    uint primitiveID; // For transform/material lookup
    uint pipelineID; // Which material bin
    uint binWriteOffset; // Where to write in bin
    uint meshletStartIndex; // Global meshlet start
    uint meshletCount; // Number of meshlets
    float ditherFactor; // LOD transition (1.0 for non-LOD)
    };
  • pipeline_meshlet_counts[] - Total meshlets per pipeline (atomic counters)

Dispatch

// Worst case: all primitives survive + all LOD clusters selected
const uint32_t maxWork = primitiveCount + clusterCount;
vkCmdDispatch(cmd, (maxWork + threadCount - 1) / threadCount, 1, 1);

Stage 2: MeshletUnpacking

Shader: MeshletUnpacking.comp

Expands primitives into linearized meshlet indices per pipeline bin. Creates fixed-stride array that mesh shaders index into.

Layout

Output buffer uses fixed-stride layout:

[Pipeline0: MAX_MESHLETS_PER_BIN slots][Pipeline1: MAX_MESHLETS_PER_BIN slots]...

Each pipeline reserves MAX_MESHLETS_PER_BIN (262144) slots. Mesh shader computes offset: pipelineIndex * MAX_MESHLETS_PER_BIN + gl_WorkGroupID.x.

Algorithm

Reads allocation info from Stage 1, expands each primitive's meshlets:

uint bin_base = alloc.pipelineID * MAX_MESHLETS_PER_BIN;
uint write_start = bin_base + alloc.binWriteOffset;
for (uint i = 0; i < alloc.meshletCount; i++) {
binned_meshlet_info[write_start + i] = VisibleMeshletInfo {
.objectID = alloc.primitiveID,
.meshletIndex = alloc.meshletStartIndex + i,
.ditherFactor = alloc.ditherFactor
};
}

Inputs

  • allocations[] - From MeshletBinning
  • survivor_count - Regular survivor count
  • lod_cluster_survivor_count - LOD survivor count

Outputs

  • binned_meshlet_info[] - Linearized VisibleMeshletInfo per pipeline:
    struct VisibleMeshletInfo {
    uint objectID; // Primitive ID for transform lookup
    uint meshletIndex; // Global meshlet index
    float ditherFactor; // LOD transition (0.0-1.0)
    };

Dispatch

GPU-driven via indirect dispatch. CountDispatcher reads survivor_count + lod_cluster_survivor_count and generates dispatch commands.

vkCmdDispatchIndirect(cmd, dispatchBuffer, 0);

Stage 3: PrepareDraw

Shader: PrepareDraw.comp

Converts per-pipeline meshlet counts to indirect draw commands for mesh shaders.

Algorithm

One thread per pipeline. Reads meshlet count, writes indirect command:

uint pipelineIndex = gl_GlobalInvocationID.x;
commands[pipelineIndex] = DrawMeshTaskIndirectCommand {
.groupCountX = pipeline_meshlet_counts[pipelineIndex],
.groupCountY = 1,
.groupCountZ = 1
};

Mesh shader work group count = meshlet count (one thread group processes one meshlet).

Inputs

  • pipeline_meshlet_counts[] - From MeshletBinning (Stage 1)

Outputs

  • commands[] - VkDrawMeshTasksIndirectCommandEXT per pipeline

Dispatch

vkCmdDispatch(cmd, 1, 1, 1); // Single thread, one per pipeline

Data Flow Summary

CPU Upload:
- Local bounds (once)
- Per-object transforms (per frame)
- Frustum planes (per frame)
Stage 0 (PrimitiveCulling):
Reads: local_bounds, transforms, frustum, Hi-Z
Writes: culling_survivors[], cull_count, vs_visible_instances[], lod_cluster_survivors[]
Stage 1 (MeshletBinning):
Reads: culling_survivors[], cull_count, lod_cluster_survivors[]
Writes: allocations[], pipeline_meshlet_counts[]
Stage 2 (MeshletUnpacking):
Reads: allocations[], survivor_count, lod_cluster_survivor_count
Writes: binned_meshlet_info[]
Stage 3 (PrepareDraw):
Reads: pipeline_meshlet_counts[]
Writes: indirect_draw_commands[]
Mesh Shader:
Reads: binned_meshlet_info[pipelineIndex * MAX_MESHLETS_PER_BIN + gl_WorkGroupID.x]
Draws via: vkCmdDrawMeshTasksIndirectEXT(cmd, indirect_draw_commands, ...)

Barriers

Each stage synchronizes via compute-to-compute pipeline barriers:

VkMemoryBarrier2 {
.srcStageMask = VK_PIPELINE_STAGE_2_COMPUTE_SHADER_BIT,
.srcAccessMask = VK_ACCESS_2_SHADER_WRITE_BIT,
.dstStageMask = VK_PIPELINE_STAGE_2_COMPUTE_SHADER_BIT,
.dstAccessMask = VK_ACCESS_2_SHADER_READ_BIT
}

Final barrier before mesh shader:

VkBufferMemoryBarrier2 {
.srcStageMask = VK_PIPELINE_STAGE_2_COMPUTE_SHADER_BIT,
.dstStageMask = VK_PIPELINE_STAGE_2_MESH_SHADER_BIT_EXT | VK_PIPELINE_STAGE_2_DRAW_INDIRECT_BIT
}

Performance Characteristics

Atomic contention reduction: Subgroup aggregation in MeshletBinning reduces atomic operations by ~32x (one atomic per subgroup instead of per thread).

GPU-driven dispatch: MeshletUnpacking uses indirect dispatch. No CPU readback required to determine work size.

Fixed-stride indexing: Mesh shaders use simple arithmetic (pipelineID * stride + meshletID) instead of indirection chains.

Two-pass occlusion: Pass 1 uses stale Hi-Z (allows overlap with previous frame rendering). Pass 2 catches false negatives with accurate Hi-Z.

Buffer Initialization

Counters and dispatch buffers must be zeroed before PrimitiveCulling:

vkCmdFillBuffer(cmd, cull_count, 0, sizeof(uint32_t), 0);
vkCmdFillBuffer(cmd, vs_visible_count, 0, sizeof(uint32_t), 0);
vkCmdFillBuffer(cmd, lod_cluster_survivor_count, 0, sizeof(uint32_t), 0);
vkCmdFillBuffer(cmd, culling_failed_count, 0, sizeof(uint32_t), 0);
// Fill pipeline_meshlet_counts[32] with zeros
vkCmdFillBuffer(cmd, pipeline_meshlet_counts, 0, 32 * sizeof(uint32_t), 0);

Vertex Shader Offloading Pipeline

Single-meshlet geometry bypasses the mesh shader pipeline using instanced vertex shader rendering. This reduces overhead for tiny meshes and heavily instanced objects.

Routing Decision

PrimitiveCulling routes primitives based on meshlet count:

  • meshletCount == 1Vertex shader path
  • meshletCount > 1 → Mesh shader path

Single-meshlet primitives write to vs_visible_instances[] instead of culling_survivors[].

Pipeline Stages

PrimitiveCulling → VSBinningAllocator → VSInstanceUnpacking → VSPrepareDraw → vkCmdDrawIndexedIndirectCount
(Stage 0) (VS Stage 1) (VS Stage 2) (VS Stage 3)

VS Stage 1: VSBinningAllocator

Shader: VSBinningAllocator.comp

Bins visible single-meshlet primitives by geometry type. Groups instances sharing the same mesh to enable instanced drawing.

Algorithm:

Reads primitive IDs from vs_visible_instances[], resolves geometry type via instance_culling_data[].meshGeometryIdmesh_geometry_data[].singleMeshletGeoIndex, allocates write offsets per geometry bin using subgroup-optimized atomics.

uint primitive_id = vs_visible_instances[survivor_idx];
uint mesh_geo_id = instance_culling_data[primitive_id].meshGeometryId;
uint geo_index = mesh_geometry_data[mesh_geo_id].singleMeshletGeoIndex;
// Subgroup optimization batches atomics per geometry type
uint global_offset = atomicAdd(vs_geometry_counters[geo_index], instance_count);

Inputs:

  • vs_visible_instances[] - Primitive IDs from PrimitiveCulling
  • vs_visible_count - Number of VS path survivors
  • instance_culling_data[] - Maps primitive ID to geometry ID
  • mesh_geometry_data[] - Contains singleMeshletGeoIndex

Outputs:

  • vs_instance_allocations[] - Per-survivor allocation: (primitiveID, geoIndex, writeOffset)
  • vs_geometry_counters[] - Atomic instance count per geometry type

Dispatch:

vkCmdDispatch(cmd, (primitiveCount + threadCount - 1) / threadCount, 1, 1);

VS Stage 2: VSInstanceUnpacking

Shader: VSInstanceUnpacking.comp

Writes primitive IDs to fixed-stride instance buffer organized by geometry type.

Buffer Layout:

[Geo0: slots 0..MAX-1][Geo1: slots MAX..2*MAX-1][Geo2: ...]

Each geometry type reserves MAX_VS_INSTANCES_PER_GEO (16384) slots.

Algorithm:

uint dest_idx = alloc.geoIndex * MAX_VS_INSTANCES_PER_GEO + alloc.writeOffset;
vs_instance_ids[dest_idx] = alloc.primitiveID;

Vertex shader reads: uint objectID = vs_instance_ids[gl_InstanceIndex];

Inputs:

  • vs_instance_allocations[] - From VSBinningAllocator
  • vs_visible_count - Number of survivors

Outputs:

  • vs_instance_ids[] - Fixed-stride buffer for vertex shader lookup

Dispatch:

vkCmdDispatch(cmd, (primitiveCount + threadCount - 1) / threadCount, 1, 1);

VS Stage 3: VSPrepareDraw

Shader: VSPrepareDraw.comp

Generates one VkDrawIndexedIndirectCommand per geometry type with visible instances.

Algorithm:

One thread per geometry type. Reads instance count, writes indirect command if count > 0.

uint instance_count = vs_geometry_counters[geo_idx];
if (instance_count == 0) return;
DrawIndexedIndirectCommand cmd = {
.indexCount = geo.indexCount,
.instanceCount = instance_count,
.firstIndex = geo.firstIndex,
.vertexOffset = geo.vertexOffset,
.firstInstance = geo_idx * MAX_VS_INSTANCES_PER_GEO // Base into instance buffer
};

Subgroup optimization: elect one thread to allocate draw command slots for all geometries in subgroup.

Inputs:

  • single_meshlet_geo[] - Static geometry data (index/vertex info)
  • vs_geometry_counters[] - Instance counts from VSBinningAllocator

Outputs:

  • vs_indirect_draws[] - VkDrawIndexedIndirectCommand per geometry
  • vs_draw_count - Number of commands generated (atomic counter)

Push Constants:

  • uniqueGeometryCount - Number of geometry types to process

Dispatch:

vkCmdDispatch(cmd, (uniqueGeometryCount + threadCount - 1) / threadCount, 1, 1);

Vertex Shader Draw

Command:

vkCmdDrawIndexedIndirectCount(
cmd,
indirectBuffer, // vs_indirect_draws[]
0,
countBuffer, // vs_draw_count
0,
maxDrawCount, // uniqueGeometryCount
stride
);

Vertex Shader Lookup:

Placeholder.vert maps gl_InstanceIndex to object ID:

uint objectID = vs_instance_ids[gl_InstanceIndex];
mat4 worldMatrix = objects[objectID].worldMatrix;
gl_Position = viewProjection.matrices[gl_ViewIndex] * worldMatrix * vec4(inPosition.xyz, 1.0);

Data Flow Summary

PrimitiveCulling:
Input: primitive bounds, transforms, frustum
Output: vs_visible_instances[] (primitive IDs)
VSBinningAllocator:
Input: vs_visible_instances[], geometry metadata
Output: vs_instance_allocations[], vs_geometry_counters[]
VSInstanceUnpacking:
Input: vs_instance_allocations[]
Output: vs_instance_ids[] (fixed-stride per geometry)
VSPrepareDraw:
Input: vs_geometry_counters[], geometry data
Output: vs_indirect_draws[], vs_draw_count
DrawIndexedIndirectCount:
Reads: vs_indirect_draws[], vs_draw_count
Vertex shader reads: vs_instance_ids[gl_InstanceIndex]

Performance Characteristics

Draw call reduction: 5000 visible cubes with 1 geometry type: 5000 draws → 1 instanced draw.

Memory overhead: Instance ID buffer: uniqueGeometryCount × MAX_VS_INSTANCES_PER_GEO × 4 bytes (2 MB for 128 geometry types).

Compute overhead: 3 lightweight dispatches (O(visible_instances) + O(unique_geometries)).

Routing threshold: Single-meshlet geometry (meshletCount == 1) automatically uses VS path. Multi-meshlet geometry uses mesh shader path. No manual configuration required.

Barriers

// After PrimitiveCulling
VK_PIPELINE_STAGE_2_COMPUTE_SHADER_BIT (WRITE)
→ VK_PIPELINE_STAGE_2_COMPUTE_SHADER_BIT (READ)
// After VSBinningAllocator
VK_PIPELINE_STAGE_2_COMPUTE_SHADER_BIT (WRITE)
→ VK_PIPELINE_STAGE_2_COMPUTE_SHADER_BIT (READ)
// After VSInstanceUnpacking
VK_PIPELINE_STAGE_2_COMPUTE_SHADER_BIT (WRITE)
→ VK_PIPELINE_STAGE_2_COMPUTE_SHADER_BIT (READ) | VK_PIPELINE_STAGE_2_VERTEX_SHADER_BIT (READ)
// After VSPrepareDraw
VK_PIPELINE_STAGE_2_COMPUTE_SHADER_BIT (WRITE)
→ VK_PIPELINE_STAGE_2_DRAW_INDIRECT_BIT (READ)

Related Documentation

  • VR Renderer
  • Shader Bindings (no anchor available)
  • Rendering Data Manager (no anchor available)