An interactive guide to ddx, ddy, fwidth — and why a function that receives just a number can somehow know what the pixel next door computed
You're writing a fragment shader. You compute some value — maybe a distance to a circle, maybe a UV coordinate, maybe something totally custom. And now you want to know: how fast is this value changing across the screen?
GLSL hands you a function for this:
And this should feel deeply weird. Think about what you're asking. You pass in a number. Just a float. No metadata, no pixel coordinates, no reference to your shader code. And somehow this function returns the rate of change — which means it must know what the neighboring pixel computed for that same expression.
If you try to imagine how you'd implement this on a CPU, you immediately hit a wall:
This is the heart of the confusion. A function that receives 3.7
can't compute a derivative. It doesn't know where 3.7 came from. It
can't look up what the pixel to the right computed, because it doesn't even know
which expression produced this number.
The answer isn't that the GPU is doing something clever at runtime to figure
out the expression. It's that the entire execution model is different
from what you're imagining. And once you see it, ddx goes
from feeling like magic to feeling obvious.
Here's the key fact: a GPU doesn't run your fragment shader on one pixel at a time. It runs it on four pixels simultaneously, in a 2×2 block called a quad.
And "simultaneously" doesn't mean "four threads that happen to run at the same time." It means something much more radical: all four pixels share the same instruction pointer. They execute the exact same instruction, at the exact same clock cycle, every step of the way. The only difference is that each one has its own set of registers holding its own data.
This is SIMD — Single Instruction, Multiple Data. Think of it like four calculators bolted together, all pressing the same button at the same time, but each with different numbers on the screen.
Here's the crucial consequence: when your shader computes float x =
uv.x * 2.0;, that instruction runs on all four lanes at once. After it
executes, all four results are sitting right there in adjacent
registers, at the same time, in the same hardware unit.
ddx, the value is already
sitting in the neighbor's register — no lookup needed.
Let's make this concrete with pseudocode. On a CPU, if you wanted derivatives, you'd have to do something horrible — run the shader multiple times, or store values in shared memory, or pass the expression itself as a callback:
Two passes! You need to compute all the values first, store them, then go back and subtract neighbors. That's because on a CPU each pixel runs in sequence — when pixel (5, 3) runs, pixel (6, 3) hasn't been computed yet.
Now here's what the GPU actually does:
See it? There's no "lookup." There's no shared memory. There's no second
pass. The four values are already sitting in four registers because the SIMD
unit just computed them all. ddx compiles down to a single
instruction that reads across lanes — something like a lane
shuffle or subgroup swap.
This is the last piece of the puzzle. When you write ddx(myValue),
the compiler — not the hardware at runtime — knows which register
holds myValue. It compiled your shader. It assigned
myValue to, say, register R7. So it emits something like:
The function doesn't need to know "where the value came from" at runtime.
The compiler already resolved that at compile time — it knows the register
address, and hardcodes "swap lane, read that register" into the
machine code. It's exactly the same way a CPU compiler knows that variable
x lives at stack offset [rbp-8].
Now let's be precise about which lane subtracts from which. A 2×2 quad has four pixels mapped to four SIMD lanes:
The key things to notice: ddx is always right minus
left, computed once per horizontal pair and shared by both
pixels in the pair. Same for ddy: bottom minus top,
shared by both in a column. And fwidth is just
|ddx| + |ddy| — a quick Manhattan-distance measure of total change.
Click any cell below to type a number. The derivatives update instantly.
Now for the question that should be nagging at you: if derivatives are only computed within a 2×2 quad, what about changes that happen across quad boundaries?
The honest answer: the GPU doesn't see them. And yes, this
means derivatives are an approximation. Specifically, they're a
piecewise-constant approximation — every pixel in a quad gets
the same ddx, computed from just that one pair of neighbors.
Is this a problem in practice? Usually not, for a few reasons:
First, the values that fragment shaders work with — UV coordinates, world positions, distances — tend to change smoothly across the screen. A smooth gradient looks almost identical whether you sample it at the quad level or per-pixel. The 2×2 approximation is excellent for smooth signals.
Second, where it does break down is exactly where you'd expect: at hard discontinuities. A triangle edge, a texture seam, a sudden material change. But these are also the places where the GPU already has special handling — helper invocations along triangle edges, proper mip selection for texture boundaries, and so on.
Third, the alternative — true per-pixel derivatives with cross-quad communication — would be massively more expensive. You'd need synchronization barriers between quads, which would kill the pipeline. The 2×2 approach is a brilliant trade-off: trivially cheap, surprisingly accurate for 99% of cases, and only noticeably wrong in situations that are already being handled by other mechanisms.
One more consequence of the quad system. What if a triangle covers just one pixel in a 2×2 block? The GPU still needs the full quad for derivatives. So it launches helper invocations — fragment shader runs that execute the full shader but whose output is silently discarded. They exist only so the "real" pixels have neighbors to subtract from.
This means along triangle edges, you're paying for extra shader invocations that don't write anything. It's usually a tiny cost, but worth knowing about.
Remember: all four lanes execute the same instruction. If an if
statement causes some lanes to take one branch and others to take a different
branch, the hardware actually runs both branches for all lanes (masking
off results for the "wrong" lanes). But if you call ddx inside
one branch, the neighbor lane might be masked — its register might hold stale
or irrelevant data. Result: undefined.
In the basic model, both pixels in a row get the same ddx.
Newer APIs offer two flavors:
The "fine" variant lets the bottom-left pixel compute ddx from
the bottom row instead of the top row, so pixels in the same column can get
different ddx values. dFdx in standard GLSL is
implementation-defined — it could be either.
Let's finish with the most common real-world use of fwidth:
screen-space anti-aliasing. Below, a shape is drawn two ways:
hard step() on the left, and smoothstep with
fwidth on the right.
fwidth widens the smoothing
automatically as the spatial frequency increases.
The shader code for the smooth version is three lines:
That's it. One subtraction across SIMD lanes gives you the pixel footprint.
One smoothstep gives you a perfect anti-aliased edge. And it works
at every zoom level, every angle, every resolution — because the derivative
adapts automatically.
So, to summarize the whole journey:
ddx looks like a function that receives a number and magically
knows what the neighbor computed. But it's not a function in
any meaningful sense. It compiles to a single hardware instruction that reads
a specific register from the adjacent SIMD lane — a register that already holds
the neighbor's value, because all lanes in a 2×2 quad execute every instruction
together, in lockstep, on the same clock cycle.
The compiler knows which register to read because it assigned the variable to that register at compile time — the same way any compiler resolves variable addresses. No runtime introspection, no metadata, no magic.
The trade-off is that derivatives are only computed within the 2×2 quad, making them blind to changes across quad boundaries. For the smooth signals that shaders typically work with, this is an excellent approximation. For hard discontinuities, it's a known limitation — and one the GPU handles through other mechanisms.