Compute Shaders: Breaking Out of the Pipeline

Act I

The Conveyor Belt

You already know the deal. You've written vertex shaders that nudge positions, fragment shaders that paint pixels. The pipeline is your friend. But let's take a step back and look at what it actually is — not to learn it, but to name the thing we're about to leave behind.

hover over stages to explore

The pipeline is powerful. It's also opinionated. Your vertex shader runs once per vertex — the mesh decides how many times. Your fragment shader runs once per fragment — the screen resolution and geometry decide. You don't get to choose.

And the data flow is fixed. Vertices come in. Colours come out. You can be creative within those constraints — we all have been — but the structure is rigid. Your fragment shader can't say "hey, what colour did the fragment next to me get?" It can't write to an arbitrary location in a buffer. It can't decide to spawn more work.

For rendering, this is fine. The pipeline was designed for rendering, and it's spectacularly good at it.

But what about everything else? What if you want to simulate ten thousand particles and update their positions? What if you want to blur an image with a kernel that needs to read neighbouring pixels? What if you want to sort a list, run a physics step, build a histogram of brightness values?

You can hack some of these into fragment shaders — render a fullscreen quad, encode your data into textures, read it back. People did this for years. It works, but it's like delivering a letter by putting it inside a pizza box and handing it to the pizza delivery driver. Technically functional. Architecturally absurd.

There should be a way to just say: here's some data, here's a program, run it a bunch of times. No conveyor belt. No pizza boxes.

There is.

Act II

Breaking Free

A compute shader is exactly that. A program that runs on the GPU with no pipeline around it. No vertices feeding in, no framebuffer coming out. Just:

1. Here's a buffer of data.
2. Here's a program.
3. Run it N times.

That's it. That's the whole concept.

Graphics Pipeline

Compute Shader

Look at the compute side. No rasterizer. No blending stage. No depth test. You're not shading fragments — you're just running code. The GPU is now a general-purpose parallel processor, and you're the one deciding what it does.

The inputs and outputs are buffers — chunks of memory you define. They can hold anything: floats, vectors, structs, a million particle positions, an image's worth of pixel data. You read what you need, compute what you want, write the result wherever you want.

And here's the part that might feel strange coming from the frag shader world: there's no implicit "current pixel." A fragment shader always knows what fragment it's shading, because the rasterizer told it. A compute shader has no such context — it's not attached to any geometry. So you need a different way to know "which piece of the work am I doing?"

That's where invocations come in.

Act III

Invocations, Workgroups, Dispatch

When you dispatch a compute shader, you're telling the GPU: "run this program in a grid of invocations." Each invocation is one execution of your shader — one little worker. And every worker gets a unique ID, so it knows which slice of the problem it owns.

But the GPU doesn't just launch a flat list of workers. It organises them into workgroups — small teams of invocations that execute together on the same compute unit. You choose two things:

Workgroup size — how many invocations per team (defined in the shader)
Dispatch size — how many workgroups to launch (defined CPU-side)

Total invocations = workgroup size × dispatch size. Let's make this concrete.

Workgroup X: 4 Workgroup Y: 4 Grid: 16×16 = 256 invocations

So when you see @builtin(global_invocation_id) in a compute shader, it's just this: the cell's address in the big grid. If your buffer stores a 512×512 image, your global ID is the pixel coordinate. You use it to index into the buffer, do your work, and write the result.

And @builtin(local_invocation_id)? That's your position within the team. It doesn't matter much if every invocation is independent — but it matters a lot when your team needs to cooperate. We'll get there.

Try it:

Set the workgroup to 8×8. How many workgroups cover the grid?
Try 16×1 — a single row. Watch how the tiling changes shape.
Try 4×4, then 8×2. Same invocation count per group — different team shape.

First, let's talk about what you're actually reading and writing.

Act IV

Buffers, Not Framebuffers

In a fragment shader, the output is implicit: you write to a colour (and maybe depth), and the pipeline puts it in the framebuffer. You don't choose where it goes — the rasterizer already decided which pixel you're shading.

In a compute shader, you're talking directly to memory. You bind storage buffers — readable, writable chunks of typed data — and you access them however you want.

Operation:

Here's the key thing to notice: each worker indexed into the buffer using its own ID. Worker 0 read input[0], worker 1 read input[1], and so on. The buffer is just an array, and the global invocation ID is just the index.

But did you notice what happened with "Avg with neighbours"? Worker 5 didn't just read input[5] — it also read input[4] and input[6]. That's perfectly legal. Any invocation can read any position in a storage buffer.

So far, that's fine for reading. But what about when invocations need to share intermediate results with each other — not through the big global buffer, but quickly, within a workgroup? That's when things get interesting.

Act V

Shared Memory & the Barrier

Here's a scenario. You want to find the maximum value in a tile of pixels — say, an 8×8 block. One approach: every invocation writes its value to the output buffer, then you do a second pass on the CPU. That works, but it's slow — you're bouncing data back through the bus.

Better approach: let the 64 invocations in a workgroup cooperate on the GPU itself. They can do this via workgroup shared memory — a small, fast block of memory that all invocations in a workgroup can see.

But there's a catch. GPU invocations don't all execute in lockstep the way you might imagine. Within a workgroup, invocations run in sub-batches (often 32, called "warps" or "waves"). So if invocation 0 writes a value to shared memory, invocation 33 might not be running yet — and if it tries to read that value, it'll get garbage.

The solution: barriers. A workgroup barrier says "everyone stop here and wait until ALL invocations in the workgroup have reached this point." After the barrier, every value written before it is guaranteed to be visible.

Step 0: Load

Barriers

Try it:

Count the steps. For 16 values, we need only 4 comparison rounds. That's log₂(16). A parallel reduction turns an O(n) scan into O(log n) steps.
Try removing the barriers. Watch how invocations sometimes read the wrong value — the red flashes show stale reads.
Notice that at each step, half the invocations go idle. This is the tradeoff — we finish fast, but we're not using all our threads all the time.

This is the mental model for shared memory: it's a scratchpad that lives on-chip, close to the compute unit. It's much faster than reading from a global buffer, but it's only visible within one workgroup, and you need barriers to keep everyone in sync.

This is also why workgroup size matters. A 64-invocation workgroup can do a 64-element reduction internally. A 256-invocation workgroup can handle 256 elements. The workgroup size determines how much cooperation you can do before you need to go back to global memory.

Act VI

Putting It Together

Let's walk through a real compute shader — a Gaussian blur on an image. This is a task you could do in a fragment shader (sample the texture at offsets, average the results), but a compute shader can do it more efficiently by using shared memory to avoid redundant texture reads.

The idea: each workgroup loads a tile of pixels (plus a border of "apron" pixels for the blur radius) into shared memory. Then each invocation reads from shared memory instead of the global texture. Since many invocations need overlapping neighbourhoods, this avoids reading the same texels over and over from slow global memory.

Here's the naive version first — no shared memory, just straight global reads:

@group(0) @binding(0) var input : texture_2d<f32>; source image
@group(0) @binding(1) var output : texture_storage_2d<rgba8unorm, write>; destination

@compute @workgroup_size(16, 16) 16×16 = 256 threads per group
fn main(@builtin(global_invocation_id) gid : vec3u) {
    var sum = vec4f(0.0);
    for (var dy = -2i; dy <= 2; dy++) { 5×5 kernel
        for (var dx = -2i; dx <= 2; dx++) {
            sum += textureLoad(input, vec2i(gid.xy) + vec2i(dx, dy), 0); 25 global reads per thread
        }
    }
    textureStore(output, gid.xy, sum / 25.0); write blurred pixel
}
    

This works fine. For a 5×5 blur, each invocation does 25 texture reads. In a 16×16 workgroup, that's 6,400 reads — but many of those overlap with neighbouring invocations. The texel at position (8, 8) gets read by every invocation within a 2-pixel radius of it.

Now here's the shared-memory version. The workgroup cooperatively loads a tile (including the apron border), syncs, and then each thread reads from the fast local copy:

const RADIUS = 2;
const TILE = 16;
const PADDED = TILE + 2 * RADIUS; 20 = 16 + 4

var<workgroup> tile : array<vec4f, PADDED * PADDED>; shared memory: 20×20 texels

@compute @workgroup_size(16, 16)
fn main(
    @builtin(global_invocation_id) gid : vec3u,
    @builtin(local_invocation_id)  lid : vec3u, position within workgroup
    @builtin(workgroup_id)          wid : vec3u,
) {
    // — Phase 1: Cooperative load —
    let tileOrigin = vec2i(wid.xy) * TILE - RADIUS; top-left of apron
    let threads = TILE * TILE;  // 256
    let texels  = PADDED * PADDED; // 400
    let idx = i32(lid.y * TILE + lid.x);
    for (var i = idx; i < texels; i += threads) { each thread loads ~2 texels
        let tx = i % PADDED;
        let ty = i / PADDED;
        tile[i] = textureLoad(input, tileOrigin + vec2i(tx, ty), 0);
    }

    workgroupBarrier(); sync — everyone loaded

    // — Phase 2: Blur from shared memory —
    var sum = vec4f(0.0);
    for (var dy = 0i; dy <= 2 * RADIUS; dy++) {
        for (var dx = 0i; dx <= 2 * RADIUS; dx++) {
            let si = (i32(lid.y) + dy) * PADDED + i32(lid.x) + dx;
            sum += tile[si]; fast shared memory read
        }
    }
    textureStore(output, gid.xy, sum / 25.0);
}
    

The cooperative load phase is the clever bit. Each of the 256 threads in the workgroup loads about 2 texels (400 ÷ 256 ≈ 1.6) from global memory into the shared tile. Then after the barrier — when everyone's done loading — each thread does its 25 reads from shared memory instead of global memory. That's a massive reduction in global texture traffic.

The pattern you'll see over and over: load cooperatively, process locally, write globally.

Act VII

When to Reach for Compute

So you've got this new tool. When should you actually use it?

The honest answer: not always. A fragment shader is still the right tool when you're producing one colour per pixel and each pixel is independent. The pipeline sets up the invocations for you, handles the output, and it's less code. Don't over-engineer.

Compute shaders earn their place when you need one or more of these:

Arbitrary output locations. A fragment shader writes to its own pixel. A compute shader can write anywhere — scatter patterns, histograms, indirect draw buffers.

Cross-invocation cooperation. Shared memory and barriers let invocations within a workgroup talk to each other. Fragment shaders can't do this.

Non-image data. Particle positions, physics state, sort keys, BVH nodes — compute shaders work on buffers of any structure, not just textures.

Multi-pass efficiency. When an algorithm has stages that feed into each other (like separable blur: horizontal pass, then vertical pass), compute shaders with shared memory can avoid redundant global memory traffic.

Variable work per element. A fragment shader always runs once per fragment. A compute shader can branch, skip, or spawn indirect dispatches.

Compute shaders aren't a replacement for the graphics pipeline. They're an escape hatch for when the pipeline's assumptions don't match your problem.

You already knew how to harness the GPU's parallelism. Now you know how to harness it without the conveyor belt.