How GPUs Compute Derivatives — ddx, ddy & fwidth

Act I

The Impossible Function

You're writing a fragment shader. You compute some value — maybe a distance to a circle, maybe a UV coordinate, maybe something totally custom. And now you want to know: how fast is this value changing across the screen?

GLSL hands you a function for this:

float rate = dFdx(myValue);

And this should feel deeply weird. Think about what you're asking. You pass in a number. Just a float. No metadata, no pixel coordinates, no reference to your shader code. And somehow this function returns the rate of change — which means it must know what the neighboring pixel computed for that same expression.

If you try to imagine how you'd implement this on a CPU, you immediately hit a wall:

CPU pseudocode — the problem

// I'm one pixel. I computed my value.
float myValue = computeSomething(myUV);

// Now I want the derivative. But...
float ddx(float n) {
    // I received: 3.7
    // I need: what my right neighbor computed
    // But I just have a number. Where do I even look?
    // I don't know the expression that produced n.
    // I can't re-run the shader for the neighbor pixel.
    // I have nothing.
    return ???;
}

This is the heart of the confusion. A function that receives 3.7 can't compute a derivative. It doesn't know where 3.7 came from. It can't look up what the pixel to the right computed, because it doesn't even know which expression produced this number.

The answer isn't that the GPU is doing something clever at runtime to figure out the expression. It's that the entire execution model is different from what you're imagining. And once you see it, ddx goes from feeling like magic to feeling obvious.

Act II

Four Pixels, One Brain

Here's the key fact: a GPU doesn't run your fragment shader on one pixel at a time. It runs it on four pixels simultaneously, in a 2×2 block called a quad.

And "simultaneously" doesn't mean "four threads that happen to run at the same time." It means something much more radical: all four pixels share the same instruction pointer. They execute the exact same instruction, at the exact same clock cycle, every step of the way. The only difference is that each one has its own set of registers holding its own data.

This is SIMD — Single Instruction, Multiple Data. Think of it like four calculators bolted together, all pressing the same button at the same time, but each with different numbers on the screen.

Here's the crucial consequence: when your shader computes float x = uv.x * 2.0;, that instruction runs on all four lanes at once. After it executes, all four results are sitting right there in adjacent registers, at the same time, in the same hardware unit.

INSTRUCTION: float x = uv.x * 2.0

Step through the instructions above. Watch how every instruction fills all four lanes at once. By the time we reach ddx, the value is already sitting in the neighbor's register — no lookup needed.

ddx doesn't "look up" the neighbor's value. The neighbor's value is already there, in the adjacent SIMD lane, because both lanes just executed the exact same instruction. It's a register read, not a function call.

Act III

It's Not a Function Call

Let's make this concrete with pseudocode. On a CPU, if you wanted derivatives, you'd have to do something horrible — run the shader multiple times, or store values in shared memory, or pass the expression itself as a callback:

CPU approach — store and look up

// CPU: run each pixel one at a time, store results
float results[width][height];

for (int y = 0; y < height; y++) {
    for (int x = 0; x < width; x++) {
        results[x][y] = computeValue(x, y);
    }
}

// Then a second pass for derivatives:
for (int y = 0; y < height; y++) {
    for (int x = 0; x < width; x++) {
        ddx[x][y] = results[x+1][y] - results[x][y];
    }
}

Two passes! You need to compute all the values first, store them, then go back and subtract neighbors. That's because on a CPU each pixel runs in sequence — when pixel (5, 3) runs, pixel (6, 3) hasn't been computed yet.

Now here's what the GPU actually does:

GPU reality — it's already there

// GPU: all 4 lanes execute every instruction together.
// Each lane has its own registers. Let's call them R0.

// Instruction: "float v = uv.x * 2.0"
// After this executes:
//   Lane 0 (top-left)  →  R0 = 3.0
//   Lane 1 (top-right) →  R0 = 5.0
//   Lane 2 (bot-left)  →  R0 = 4.0
//   Lane 3 (bot-right) →  R0 = 6.0

// Instruction: "ddx(v)"
// This compiles to: "read lane 1's R0, subtract lane 0's R0"
//   Result for lanes 0 & 1: R0[lane1] - R0[lane0] = 5.0 - 3.0 = 2.0
//   Result for lanes 2 & 3: R0[lane3] - R0[lane2] = 6.0 - 4.0 = 2.0

See it? There's no "lookup." There's no shared memory. There's no second pass. The four values are already sitting in four registers because the SIMD unit just computed them all. ddx compiles down to a single instruction that reads across lanes — something like a lane shuffle or subgroup swap.

But how does the hardware know which register?

This is the last piece of the puzzle. When you write ddx(myValue), the compiler — not the hardware at runtime — knows which register holds myValue. It compiled your shader. It assigned myValue to, say, register R7. So it emits something like:

What the compiler emits

// Your GLSL:
float v = uv.x * 2.0;        → stored in R7
float d = dFdx(v);             → LANE_SWAP R7, subtract

// The compiled instruction is roughly:
R8 = READ_NEIGHBOR_LANE(R7) - R7  peek at lane 1's R7

The function doesn't need to know "where the value came from" at runtime. The compiler already resolved that at compile time — it knows the register address, and hardcodes "swap lane, read that register" into the machine code. It's exactly the same way a CPU compiler knows that variable x lives at stack offset [rbp-8].

ddx(v) doesn't receive "just a number." It receives a register address — baked in by the compiler — and reads that same register from the neighboring SIMD lane. One instruction, zero ambiguity.

Act IV

The Quad Up Close

Now let's be precise about which lane subtracts from which. A 2×2 quad has four pixels mapped to four SIMD lanes:

All four lanes just executed the same instruction. Each has its value in the same register.

The key things to notice: ddx is always right minus left, computed once per horizontal pair and shared by both pixels in the pair. Same for ddy: bottom minus top, shared by both in a column. And fwidth is just |ddx| + |ddy| — a quick Manhattan-distance measure of total change.

Playground: edit the values yourself

Click any cell below to type a number. The derivatives update instantly.

3.0

5.0

4.0

6.0

ddx (top row)

2.0

ddy (left col)

1.0

ddx (bot row)

2.0

ddy (right col)

1.0

fwidth per pixel

—

Set all four to the same value — derivatives go to zero. Now try left column = 0, right column = 10. Can you make ddx large while ddy stays zero?

Act V

The Blind Spots

Now for the question that should be nagging at you: if derivatives are only computed within a 2×2 quad, what about changes that happen across quad boundaries?

The honest answer: the GPU doesn't see them. And yes, this means derivatives are an approximation. Specifically, they're a piecewise-constant approximation — every pixel in a quad gets the same ddx, computed from just that one pair of neighbors.

hover to see quad boundaries and derivatives

Signal Show

Switch to "Sharp step" and display "ddx". See those columns where ddx is high? They only appear inside quads that straddle the step edge. Quads entirely on one side show zero — even though the step is one pixel away. Now try "Checkerboard" — notice how the derivatives are huge everywhere because every quad contains a change.

Is this a problem in practice? Usually not, for a few reasons:

First, the values that fragment shaders work with — UV coordinates, world positions, distances — tend to change smoothly across the screen. A smooth gradient looks almost identical whether you sample it at the quad level or per-pixel. The 2×2 approximation is excellent for smooth signals.

Second, where it does break down is exactly where you'd expect: at hard discontinuities. A triangle edge, a texture seam, a sudden material change. But these are also the places where the GPU already has special handling — helper invocations along triangle edges, proper mip selection for texture boundaries, and so on.

Third, the alternative — true per-pixel derivatives with cross-quad communication — would be massively more expensive. You'd need synchronization barriers between quads, which would kill the pipeline. The 2×2 approach is a brilliant trade-off: trivially cheap, surprisingly accurate for 99% of cases, and only noticeably wrong in situations that are already being handled by other mechanisms.

Derivatives are blind to changes outside the 2×2 quad. For smooth signals this doesn't matter. For sharp edges, it means derivatives can be zero even right next to a discontinuity — if the discontinuity falls between quads.

Act VI

Helper Invocations and Gotchas

Helper Invocations

One more consequence of the quad system. What if a triangle covers just one pixel in a 2×2 block? The GPU still needs the full quad for derivatives. So it launches helper invocations — fragment shader runs that execute the full shader but whose output is silently discarded. They exist only so the "real" pixels have neighbors to subtract from.

This means along triangle edges, you're paying for extra shader invocations that don't write anything. It's usually a tiny cost, but worth knowing about.

Derivatives in Branches

Remember: all four lanes execute the same instruction. If an if statement causes some lanes to take one branch and others to take a different branch, the hardware actually runs both branches for all lanes (masking off results for the "wrong" lanes). But if you call ddx inside one branch, the neighbor lane might be masked — its register might hold stale or irrelevant data. Result: undefined.

// ⚠ DANGER: derivative inside non-uniform branch
if (someCondition) {  some lanes take this, some don't
    float d = dFdx(value);  ← neighbor might be in the else branch
}

// ✓ SAFE: compute derivative before branching
float d = dFdx(value);  all lanes execute this together
if (someCondition) {
    // use d here safely
}

Coarse vs Fine

In the basic model, both pixels in a row get the same ddx. Newer APIs offer two flavors:

dFdxCoarse(v)  one subtraction per pair (what we've seen)
dFdxFine(v)   can use both rows — varies per pixel

The "fine" variant lets the bottom-left pixel compute ddx from the bottom row instead of the top row, so pixels in the same column can get different ddx values. dFdx in standard GLSL is implementation-defined — it could be either.

Act VII

Seeing It in Action

Let's finish with the most common real-world use of fwidth: screen-space anti-aliasing. Below, a shape is drawn two ways: hard step() on the left, and smoothstep with fwidth on the right.

Zoom 1.0× Shape

Switch to "Grid lines" and zoom to 5×. The left side disintegrates into aliased noise. The right stays clean — fwidth widens the smoothing automatically as the spatial frequency increases.

The shader code for the smooth version is three lines:

float dist = length(uv) - radius;     signed distance to circle
float fw = fwidth(dist);               screen-space pixel width
float alpha = smoothstep(fw, -fw, dist); smooth across exactly ~1 pixel

That's it. One subtraction across SIMD lanes gives you the pixel footprint. One smoothstep gives you a perfect anti-aliased edge. And it works at every zoom level, every angle, every resolution — because the derivative adapts automatically.

The Full Picture

So, to summarize the whole journey:

ddx looks like a function that receives a number and magically knows what the neighbor computed. But it's not a function in any meaningful sense. It compiles to a single hardware instruction that reads a specific register from the adjacent SIMD lane — a register that already holds the neighbor's value, because all lanes in a 2×2 quad execute every instruction together, in lockstep, on the same clock cycle.

The compiler knows which register to read because it assigned the variable to that register at compile time — the same way any compiler resolves variable addresses. No runtime introspection, no metadata, no magic.

The trade-off is that derivatives are only computed within the 2×2 quad, making them blind to changes across quad boundaries. For the smooth signals that shaders typically work with, this is an excellent approximation. For hard discontinuities, it's a known limitation — and one the GPU handles through other mechanisms.

ddx isn't a function that receives a number. It's a single hardware instruction that reads a register from the neighboring SIMD lane — a register the compiler chose at compile time, holding a value that's already there because all four lanes just executed the same code.

How GPUs KnowTheir Neighbors

The Impossible Function

Four Pixels, One Brain

It's Not a Function Call

But how does the hardware know which register?

The Quad Up Close

Playground: edit the values yourself

The Blind Spots

Helper Invocations and Gotchas

Helper Invocations

Derivatives in Branches

Coarse vs Fine

Seeing It in Action

The Full Picture

How GPUs Know
Their Neighbors