You know the vertex shader. You know the fragment shader. Let's explore everything in between — and beyond.
Every frame of every game, every spin of every 3D model viewer, every flashy WebGL landing page — they all follow the same sequence. Your GPU takes in raw vertex data and, through a carefully orchestrated chain of stages, produces the colored pixels you see on screen.
You've already written code for two of these stages: the vertex shader and the fragment shader. But the full pipeline has a lot more going on. Some stages are programmable (you write the code), some are fixed-function (the hardware handles it), and some are optional — only activated when you need them.
Here's the whole thing. Hover over any stage to see what it does:
That's a lot of stages! Don't worry — we're going to walk through each one, with interactive demos that let you see exactly what's happening at every step. Let's start at the very beginning.
Everything starts on the CPU. Your application — the game engine, the WebGL app, whatever it is — decides it's time to draw something. It issues a draw call: "Hey GPU, here's a vertex buffer, here's how to interpret it. Go."
The GPU's first job is Input Assembly. It reads raw numbers from your vertex buffer and groups them into primitives — the basic shapes the rest of the pipeline will work with. Usually triangles, but also points and lines.
Here's a common situation: you want to draw a square. That's two triangles, six vertices — but a square only has four corners. Without an index buffer, you'd have to list two of those corners twice. With an index buffer, you list each corner once and then say "triangle 1 uses corners 0, 1, 2 — triangle 2 uses corners 2, 3, 0."
Play with this — toggle between indexed and non-indexed mode and watch what happens to the vertex data:
This one's familiar territory! Your vertex shader runs once per vertex. It receives attributes — position, normal, UV coordinates, whatever you've packed into the buffer — and outputs a position in clip space plus any varyings you want passed down the pipeline.
But let's make sure we're precise about what "clip space" means, because it matters for the next stage. Your vertex shader outputs a 4D vector: (x, y, z, w). This isn't screen pixels yet — it's a coordinate system where everything visible lives inside a specific volume. We'll see exactly what that volume is in Act V.
Here's a tiny vertex shader doing the classic model-view-projection transform. Drag the vertex around in world space and watch the clip-space output change:
Notice how the clip-space coordinates change as you move the vertex or adjust the camera. The w component encodes depth — further away means a larger w. That's going to be important when we get to the perspective divide.
Between the vertex shader and clipping, there are optional programmable stages that most tutorials skip over. These are powerful tools — but they're opt-in. If you don't enable them, the pipeline skips straight from the vertex shader to clipping.
Why would you want the GPU to create more triangles? One huge reason: level of detail. Send a coarse mesh to the GPU and let it subdivide based on distance. Close objects get smooth, detailed surfaces. Far objects stay low-poly. All on the GPU, no CPU cost.
Tessellation actually involves three sub-stages: the tessellation control shader (also called the hull shader) decides how much to subdivide, the tessellation generator (fixed-function) actually creates the new vertices, and the tessellation evaluation shader (domain shader) positions them.
Drag the slider to crank up the tessellation level and watch a single triangle turn into hundreds:
The geometry shader is different. It receives an entire primitive (a triangle, a line, a point) and can output zero or more new primitives. Classic uses: expanding points into camera-facing quads (billboarding), generating wireframe overlays, or extruding shadow volumes.
Here, each input point gets expanded into a camera-facing quad — perfect for particle systems:
One more optional feature: stream output (or transform feedback) lets you capture the transformed vertices and write them back to a buffer — without rasterizing at all. This creates a feedback loop: the GPU processes geometry, stores the result, and you can feed it back in on the next frame. It's how GPU-driven particle simulations work: each particle's new position is computed by the vertex shader and captured via stream output, ready for the next frame.
Your vertex shader (and optionally tessellation/geometry shaders) produced vertices in clip space — that 4D (x, y, z, w) coordinate. Now the GPU needs to figure out: what's actually visible?
The visible volume in clip space is defined by six planes: anything where -w ≤ x ≤ w, -w ≤ y ≤ w, and 0 ≤ z ≤ w (or -w ≤ z ≤ w depending on the API) is inside. Everything outside gets clipped away.
But here's the interesting part: if a triangle is partially outside, the GPU doesn't just discard it. It clips the triangle against the frustum planes, creating new vertices where the edges cross the boundary. One triangle can become two, or even more!
Drag the triangle around and watch clipping happen in real time:
After clipping, every surviving vertex gets divided by its own w component: (x/w, y/w, z/w). This is the perspective divide, and it's what makes far-away things look smaller. The result is Normalized Device Coordinates (NDC) — a cube from -1 to 1 on each axis.
Finally, NDC gets mapped to actual pixel coordinates on your screen. The x and y go from [-1, 1] to [0, width] and [0, height]. The z gets mapped to the depth range (usually [0, 1]) for later depth testing.
Before the GPU spends effort rasterizing a triangle, it asks a quick question: is this triangle facing towards the camera, or away from it? If it's facing away — and you've enabled backface culling — it gets discarded immediately.
How does the GPU know which way a triangle faces? Winding order. If the vertices appear in counter-clockwise order on screen, the triangle is front-facing. Clockwise? Back-facing. (This convention is configurable, but CCW = front is the most common.)
Here's a rotating cube. Toggle backface culling and watch half the triangles disappear:
This is the big one. We've got a triangle in screen coordinates — three 2D points with some associated data (depth, interpolated varyings). The rasterizer's job is to figure out which pixels (or more precisely, which fragments) that triangle covers.
The distinction matters: a fragment is a candidate pixel. It carries interpolated data from the vertices, plus a position on the screen. It might not survive the depth test later — so it's not a pixel yet, just a candidate.
For each pixel in the triangle's bounding box, the rasterizer tests: is this pixel's center inside all three edges? This is done with a simple cross-product test against each edge — if the point is on the "inside" half-plane of all three edges, it's in.
Watch the rasterizer work, pixel by pixel:
For every fragment inside the triangle, the GPU computes barycentric coordinates — three weights (α, β, γ) that describe how close the fragment is to each vertex. These weights always add up to 1, and they're used to smoothly interpolate all the varyings: color, texture coordinates, normals, anything your vertex shader passed along.
Here each vertex has a pure color. Watch how the barycentric weights create a smooth gradient across the triangle:
Look closely at the edges of a rasterized triangle and you'll see staircase-shaped jaggies — aliasing. The problem: each pixel is either "in" or "out". There's no in-between.
MSAA (Multi-Sample Anti-Aliasing) fixes this by testing multiple points within each pixel. If 2 of 4 sample points are inside the triangle, the pixel gets 50% coverage. The fragment shader still only runs once per pixel, but the coverage mask determines how much of the final color to blend in.
We've got fragments. Each one knows its screen position, its depth, and its interpolated data. But should it actually become a pixel? The GPU runs up to three tests to decide — and some of them can happen before the fragment shader even runs.
If the GPU knows your fragment shader won't modify the depth value (which is most of the time), it can test depth before running the shader. Why? Because the fragment shader is expensive! If a fragment is behind something already drawn, there's no point running its shader. Early-Z is one of the most important performance optimizations in modern GPUs.
The depth buffer (or z-buffer) stores one depth value per pixel. When a new fragment arrives, its depth gets compared against what's already stored. If it's closer, it wins — its color replaces the old one, and its depth gets written. If it's further away, it's discarded.
Adjust the depth of the two triangles and watch the depth buffer update:
The stencil buffer is the pipeline's secret weapon. It stores an integer per pixel (usually 8-bit) and lets you define arbitrary pass/fail rules. You can increment, decrement, or set the stencil value when fragments pass or fail, and you can make later draw calls conditional on the stencil value.
Classic uses: portals (draw the portal shape into the stencil, then only draw the portal's world where stencil passes), outline effects (draw the object, increment stencil, then draw a slightly larger version only where stencil is zero), mirrors, shadow volumes, and more.
Draw a mask shape into the stencil buffer, then see how it controls what gets rendered:
You already know how to write a fragment shader — but now you understand what's feeding it. Each fragment arriving at your shader carries:
Screen position — where on the screen this fragment lands (gl_FragCoord)
Depth — how far from the camera, interpolated from the vertices
Interpolated varyings — everything your vertex shader output (UVs, normals, colors), blended via barycentric coordinates from the rasterizer
Face orientation — whether this fragment belongs to a front or back face (gl_FrontFacing)
And it's the rasterizer that did all that interpolation work. Your vertex shader set up three corners with different normals, UVs, and colors — the rasterizer smoothly blended them across every fragment in between. The fragment shader just gets to enjoy the result.
One thing that might now click: early-Z. If your fragment shader writes to gl_FragDepth (overriding the interpolated depth), the GPU can't do the early depth test — because it doesn't know the final depth until after the shader runs. That one innocent-looking line can cost you the entire early-Z optimization for that draw call.
The fragment survived the depth test, passed the stencil test, and your shader gave it a color. Now what? If blending is disabled (the default for opaque objects), the color simply overwrites whatever was in the framebuffer. Done.
But for transparent objects, we need blending — combining the new fragment's color with the existing framebuffer color using a configurable equation.
The most common: result = src × srcAlpha + dst × (1 - srcAlpha). This is classic alpha blending — a fragment with 50% alpha mixes equally with whatever's behind it.
But there are others: additive blending (src + dst) for glow and fire effects, multiplicative (src × dst) for tinting and shadows, and more exotic combinations.
Reorder the transparent triangles and see how draw order affects the result:
The GPU has been writing fragments to a framebuffer — but that's not what you see on screen. Your display reads from a different buffer: the front buffer. The GPU writes to the back buffer. When a frame is done, they swap.
Without double buffering, the display might read the framebuffer while the GPU is halfway through rendering — you'd see the top half of the new frame and the bottom half of the old one. That's tearing.
Double buffering fixes this: the GPU renders to a hidden back buffer, then swaps the buffers between display refresh cycles. V-Sync synchronizes this swap to the monitor's refresh rate — but if the GPU can't finish a frame in time, it has to wait a whole extra refresh cycle, causing stutter.
Triple buffering adds a third buffer: the GPU can start the next frame immediately into a second back buffer while the first one waits to be displayed. Less stutter, slightly more latency.
And there it is — the complete GPU rendering pipeline, from draw call to display. Let's zoom back out and see the whole thing one more time, now that every stage is familiar:
Every frame of every real-time 3D application runs this pipeline — often millions of triangles through it in under 16 milliseconds. The GPU's massively parallel architecture handles the per-vertex and per-fragment stages across thousands of cores simultaneously, while the fixed-function hardware handles the rest at dedicated silicon speed.
Now when you write a vertex shader, you know exactly what prepared the data it receives. When you write a fragment shader, you know what rasterization did to create each fragment, what tests it might have already passed, and what happens to its output. The pipeline isn't a black box anymore — it's a well-oiled machine, and you understand every gear.