You already know the deal. You've written vertex shaders that nudge positions, fragment shaders that paint pixels. The pipeline is your friend. But let's take a step back and look at what it actually is — not to learn it, but to name the thing we're about to leave behind.
The pipeline is powerful. It's also opinionated. Your vertex shader runs once per vertex — the mesh decides how many times. Your fragment shader runs once per fragment — the screen resolution and geometry decide. You don't get to choose.
And the data flow is fixed. Vertices come in. Colours come out. You can be creative within those constraints — we all have been — but the structure is rigid. Your fragment shader can't say "hey, what colour did the fragment next to me get?" It can't write to an arbitrary location in a buffer. It can't decide to spawn more work.
For rendering, this is fine. The pipeline was designed for rendering, and it's spectacularly good at it.
But what about everything else? What if you want to simulate ten thousand particles and update their positions? What if you want to blur an image with a kernel that needs to read neighbouring pixels? What if you want to sort a list, run a physics step, build a histogram of brightness values?
You can hack some of these into fragment shaders — render a fullscreen quad, encode your data into textures, read it back. People did this for years. It works, but it's like delivering a letter by putting it inside a pizza box and handing it to the pizza delivery driver. Technically functional. Architecturally absurd.
There should be a way to just say: here's some data, here's a program, run it a bunch of times. No conveyor belt. No pizza boxes.
There is.
A compute shader is exactly that. A program that runs on the GPU with no pipeline around it. No vertices feeding in, no framebuffer coming out. Just:
1. Here's a buffer of data.
2. Here's a program.
3. Run it N times.
That's it. That's the whole concept.
Look at the compute side. No rasterizer. No blending stage. No depth test. You're not shading fragments — you're just running code. The GPU is now a general-purpose parallel processor, and you're the one deciding what it does.
The inputs and outputs are buffers — chunks of memory you define. They can hold anything: floats, vectors, structs, a million particle positions, an image's worth of pixel data. You read what you need, compute what you want, write the result wherever you want.
And here's the part that might feel strange coming from the frag shader world: there's no implicit "current pixel." A fragment shader always knows what fragment it's shading, because the rasterizer told it. A compute shader has no such context — it's not attached to any geometry. So you need a different way to know "which piece of the work am I doing?"
That's where invocations come in.
When you dispatch a compute shader, you're telling the GPU: "run this program in a grid of invocations." Each invocation is one execution of your shader — one little worker. And every worker gets a unique ID, so it knows which slice of the problem it owns.
But the GPU doesn't just launch a flat list of workers. It organises them into workgroups — small teams of invocations that execute together on the same compute unit. You choose two things:
Workgroup size — how many invocations per team (defined in the shader)
Dispatch size — how many workgroups to launch (defined CPU-side)
Total invocations = workgroup size × dispatch size. Let's make this concrete.
So when you see @builtin(global_invocation_id) in a compute shader, it's just this: the cell's address in the big grid. If your buffer stores a 512×512 image, your global ID is the pixel coordinate. You use it to index into the buffer, do your work, and write the result.
And @builtin(local_invocation_id)? That's your position within the team. It doesn't matter much if every invocation is independent — but it matters a lot when your team needs to cooperate. We'll get there.
First, let's talk about what you're actually reading and writing.
In a fragment shader, the output is implicit: you write to a colour (and maybe depth), and the pipeline puts it in the framebuffer. You don't choose where it goes — the rasterizer already decided which pixel you're shading.
In a compute shader, you're talking directly to memory. You bind storage buffers — readable, writable chunks of typed data — and you access them however you want.
Here's the key thing to notice: each worker indexed into the buffer using its own ID. Worker 0 read input[0], worker 1 read input[1], and so on. The buffer is just an array, and the global invocation ID is just the index.
But did you notice what happened with "Avg with neighbours"? Worker 5 didn't just read input[5] — it also read input[4] and input[6]. That's perfectly legal. Any invocation can read any position in a storage buffer.
So far, that's fine for reading. But what about when invocations need to share intermediate results with each other — not through the big global buffer, but quickly, within a workgroup? That's when things get interesting.
Here's a scenario. You want to find the maximum value in a tile of pixels — say, an 8×8 block. One approach: every invocation writes its value to the output buffer, then you do a second pass on the CPU. That works, but it's slow — you're bouncing data back through the bus.
Better approach: let the 64 invocations in a workgroup cooperate on the GPU itself. They can do this via workgroup shared memory — a small, fast block of memory that all invocations in a workgroup can see.
But there's a catch. GPU invocations don't all execute in lockstep the way you might imagine. Within a workgroup, invocations run in sub-batches (often 32, called "warps" or "waves"). So if invocation 0 writes a value to shared memory, invocation 33 might not be running yet — and if it tries to read that value, it'll get garbage.
The solution: barriers. A workgroup barrier says "everyone stop here and wait until ALL invocations in the workgroup have reached this point." After the barrier, every value written before it is guaranteed to be visible.
This is the mental model for shared memory: it's a scratchpad that lives on-chip, close to the compute unit. It's much faster than reading from a global buffer, but it's only visible within one workgroup, and you need barriers to keep everyone in sync.
This is also why workgroup size matters. A 64-invocation workgroup can do a 64-element reduction internally. A 256-invocation workgroup can handle 256 elements. The workgroup size determines how much cooperation you can do before you need to go back to global memory.
Let's walk through a real compute shader — a Gaussian blur on an image. This is a task you could do in a fragment shader (sample the texture at offsets, average the results), but a compute shader can do it more efficiently by using shared memory to avoid redundant texture reads.
The idea: each workgroup loads a tile of pixels (plus a border of "apron" pixels for the blur radius) into shared memory. Then each invocation reads from shared memory instead of the global texture. Since many invocations need overlapping neighbourhoods, this avoids reading the same texels over and over from slow global memory.
Here's the naive version first — no shared memory, just straight global reads:
This works fine. For a 5×5 blur, each invocation does 25 texture reads. In a 16×16 workgroup, that's 6,400 reads — but many of those overlap with neighbouring invocations. The texel at position (8, 8) gets read by every invocation within a 2-pixel radius of it.
Now here's the shared-memory version. The workgroup cooperatively loads a tile (including the apron border), syncs, and then each thread reads from the fast local copy:
The cooperative load phase is the clever bit. Each of the 256 threads in the workgroup loads about 2 texels (400 ÷ 256 ≈ 1.6) from global memory into the shared tile. Then after the barrier — when everyone's done loading — each thread does its 25 reads from shared memory instead of global memory. That's a massive reduction in global texture traffic.
So you've got this new tool. When should you actually use it?
The honest answer: not always. A fragment shader is still the right tool when you're producing one colour per pixel and each pixel is independent. The pipeline sets up the invocations for you, handles the output, and it's less code. Don't over-engineer.
Compute shaders earn their place when you need one or more of these:
Arbitrary output locations. A fragment shader writes to its own pixel. A compute shader can write anywhere — scatter patterns, histograms, indirect draw buffers.
Cross-invocation cooperation. Shared memory and barriers let invocations within a workgroup talk to each other. Fragment shaders can't do this.
Non-image data. Particle positions, physics state, sort keys, BVH nodes — compute shaders work on buffers of any structure, not just textures.
Multi-pass efficiency. When an algorithm has stages that feed into each other (like separable blur: horizontal pass, then vertical pass), compute shaders with shared memory can avoid redundant global memory traffic.
Variable work per element. A fragment shader always runs once per fragment. A compute shader can branch, skip, or spawn indirect dispatches.
You already knew how to harness the GPU's parallelism. Now you know how to harness it without the conveyor belt.