The GPU Rendering Pipeline

You know the vertex shader. You know the fragment shader. Let's explore everything in between — and beyond.

↓ scroll to begin
Act I

The Assembly Line

Every frame of every game, every spin of every 3D model viewer, every flashy WebGL landing page — they all follow the same sequence. Your GPU takes in raw vertex data and, through a carefully orchestrated chain of stages, produces the colored pixels you see on screen.

You've already written code for two of these stages: the vertex shader and the fragment shader. But the full pipeline has a lot more going on. Some stages are programmable (you write the code), some are fixed-function (the hardware handles it), and some are optional — only activated when you need them.

Here's the whole thing. Hover over any stage to see what it does:

Programmable
Fixed-function
Optional
You know this!
Try it: Hover each stage from top to bottom. Follow the data as it flows — vertices become primitives, primitives become fragments, fragments become pixels.

That's a lot of stages! Don't worry — we're going to walk through each one, with interactive demos that let you see exactly what's happening at every step. Let's start at the very beginning.

Act II

The Draw Call & Input Assembly

Everything starts on the CPU. Your application — the game engine, the WebGL app, whatever it is — decides it's time to draw something. It issues a draw call: "Hey GPU, here's a vertex buffer, here's how to interpret it. Go."

The GPU's first job is Input Assembly. It reads raw numbers from your vertex buffer and groups them into primitives — the basic shapes the rest of the pipeline will work with. Usually triangles, but also points and lines.

Index Buffers: Don't Repeat Yourself

Here's a common situation: you want to draw a square. That's two triangles, six vertices — but a square only has four corners. Without an index buffer, you'd have to list two of those corners twice. With an index buffer, you list each corner once and then say "triangle 1 uses corners 0, 1, 2 — triangle 2 uses corners 2, 3, 0."

Play with this — toggle between indexed and non-indexed mode and watch what happens to the vertex data:

Try it: Switch to "With Index Buffer" — see how we go from 6 vertices down to 4? Now try "Triangle Strip" topology — notice how each new triangle shares an edge with the previous one, saving even more data.
Input Assembly is where raw buffer data becomes geometry. The index buffer and primitive topology tell the GPU how to connect the dots — literally.
Act III

The Vertex Shader

This one's familiar territory! Your vertex shader runs once per vertex. It receives attributes — position, normal, UV coordinates, whatever you've packed into the buffer — and outputs a position in clip space plus any varyings you want passed down the pipeline.

But let's make sure we're precise about what "clip space" means, because it matters for the next stage. Your vertex shader outputs a 4D vector: (x, y, z, w). This isn't screen pixels yet — it's a coordinate system where everything visible lives inside a specific volume. We'll see exactly what that volume is in Act V.

Here's a tiny vertex shader doing the classic model-view-projection transform. Drag the vertex around in world space and watch the clip-space output change:

drag the orange vertex

Notice how the clip-space coordinates change as you move the vertex or adjust the camera. The w component encodes depth — further away means a larger w. That's going to be important when we get to the perspective divide.

The vertex shader transforms each vertex into clip space — a 4D coordinate system. Everything after this stage works in clip space until the perspective divide flattens it onto your screen.
Act IV

The Optional Middle: Tessellation & Geometry Shaders

Between the vertex shader and clipping, there are optional programmable stages that most tutorials skip over. These are powerful tools — but they're opt-in. If you don't enable them, the pipeline skips straight from the vertex shader to clipping.

Tessellation: Making More Triangles

Why would you want the GPU to create more triangles? One huge reason: level of detail. Send a coarse mesh to the GPU and let it subdivide based on distance. Close objects get smooth, detailed surfaces. Far objects stay low-poly. All on the GPU, no CPU cost.

Tessellation actually involves three sub-stages: the tessellation control shader (also called the hull shader) decides how much to subdivide, the tessellation generator (fixed-function) actually creates the new vertices, and the tessellation evaluation shader (domain shader) positions them.

Drag the slider to crank up the tessellation level and watch a single triangle turn into hundreds:

Triangles: 1
Try it: Slide the tessellation level from 1 all the way to 32. Watch the triangle count explode. Then toggle "Displacement" on — now the subdivided vertices are being pushed along a procedural height field, turning a flat triangle into terrain. This is exactly how GPU-driven terrain works!

Geometry Shader: Creating, Destroying, and Transforming Primitives

The geometry shader is different. It receives an entire primitive (a triangle, a line, a point) and can output zero or more new primitives. Classic uses: expanding points into camera-facing quads (billboarding), generating wireframe overlays, or extruding shadow volumes.

Here, each input point gets expanded into a camera-facing quad — perfect for particle systems:

Stream Output

One more optional feature: stream output (or transform feedback) lets you capture the transformed vertices and write them back to a buffer — without rasterizing at all. This creates a feedback loop: the GPU processes geometry, stores the result, and you can feed it back in on the next frame. It's how GPU-driven particle simulations work: each particle's new position is computed by the vertex shader and captured via stream output, ready for the next frame.

Tessellation creates geometry on the GPU. The geometry shader transforms it. Stream output captures it. These are all optional — but when you need them, they're invaluable.
Act V

Clipping, Perspective Divide & Viewport Transform

Your vertex shader (and optionally tessellation/geometry shaders) produced vertices in clip space — that 4D (x, y, z, w) coordinate. Now the GPU needs to figure out: what's actually visible?

Clipping

The visible volume in clip space is defined by six planes: anything where -w ≤ x ≤ w, -w ≤ y ≤ w, and 0 ≤ z ≤ w (or -w ≤ z ≤ w depending on the API) is inside. Everything outside gets clipped away.

But here's the interesting part: if a triangle is partially outside, the GPU doesn't just discard it. It clips the triangle against the frustum planes, creating new vertices where the edges cross the boundary. One triangle can become two, or even more!

Drag the triangle around and watch clipping happen in real time:

drag the vertices around
Input triangles: 1 After clipping: 1
Try it: Drag one vertex outside the visible area (the bright rectangle). Watch the GPU generate new vertices along the boundary — the triangle gets "trimmed" to fit. Drag two vertices out and see how the clipped shape changes.

Perspective Divide

After clipping, every surviving vertex gets divided by its own w component: (x/w, y/w, z/w). This is the perspective divide, and it's what makes far-away things look smaller. The result is Normalized Device Coordinates (NDC) — a cube from -1 to 1 on each axis.

Viewport Transform

Finally, NDC gets mapped to actual pixel coordinates on your screen. The x and y go from [-1, 1] to [0, width] and [0, height]. The z gets mapped to the depth range (usually [0, 1]) for later depth testing.

Clipping trims geometry to the visible volume. The perspective divide makes far things smaller. The viewport transform maps everything to screen pixels. Three fixed-function stages, each essential.
Act VI

Face Culling

Before the GPU spends effort rasterizing a triangle, it asks a quick question: is this triangle facing towards the camera, or away from it? If it's facing away — and you've enabled backface culling — it gets discarded immediately.

How does the GPU know which way a triangle faces? Winding order. If the vertices appear in counter-clockwise order on screen, the triangle is front-facing. Clockwise? Back-facing. (This convention is configurable, but CCW = front is the most common.)

Here's a rotating cube. Toggle backface culling and watch half the triangles disappear:

Visible triangles: 12
Try it: Turn on "Cull Back Faces" — for a closed solid like a cube, you'll never see a back face anyway, so this is pure free performance. Now try "Cull Front Faces" — you see the inside of the cube! Toggle "Show winding" to see the arrows showing vertex order on each face.
Face culling is a cheap, powerful optimization. For any closed solid, roughly half the triangles face away from the camera — culling them means the rasterizer and fragment shader only run on what you can actually see.
Act VII

Rasterization

This is the big one. We've got a triangle in screen coordinates — three 2D points with some associated data (depth, interpolated varyings). The rasterizer's job is to figure out which pixels (or more precisely, which fragments) that triangle covers.

The distinction matters: a fragment is a candidate pixel. It carries interpolated data from the vertices, plus a position on the screen. It might not survive the depth test later — so it's not a pixel yet, just a candidate.

Edge Testing

For each pixel in the triangle's bounding box, the rasterizer tests: is this pixel's center inside all three edges? This is done with a simple cross-product test against each edge — if the point is on the "inside" half-plane of all three edges, it's in.

Watch the rasterizer work, pixel by pixel:

drag the triangle vertices
Fragments: 0
Try it: Hit "Step" to advance one pixel at a time. Watch the edge tests happen — red outlines for "outside", filled for "inside". Then hit "Run" to watch the full rasterization. Drag the vertices to reshape the triangle and reset to try again.

Barycentric Interpolation

For every fragment inside the triangle, the GPU computes barycentric coordinates — three weights (α, β, γ) that describe how close the fragment is to each vertex. These weights always add up to 1, and they're used to smoothly interpolate all the varyings: color, texture coordinates, normals, anything your vertex shader passed along.

Here each vertex has a pure color. Watch how the barycentric weights create a smooth gradient across the triangle:

drag the vertices · hover to see weights
α (red vertex) β (green vertex) γ (blue vertex) hover a pixel

Multisampling (MSAA)

Look closely at the edges of a rasterized triangle and you'll see staircase-shaped jaggies — aliasing. The problem: each pixel is either "in" or "out". There's no in-between.

MSAA (Multi-Sample Anti-Aliasing) fixes this by testing multiple points within each pixel. If 2 of 4 sample points are inside the triangle, the pixel gets 50% coverage. The fragment shader still only runs once per pixel, but the coverage mask determines how much of the final color to blend in.

Try it: Start with "No AA" — see the harsh staircase on the edge? Now go to 4× MSAA. The edge pixels become semi-transparent as coverage kicks in. Toggle "Show samples" to see the individual sample points inside each pixel.
Rasterization converts triangles to fragments using edge tests and barycentric interpolation. It's the bridge between the world of geometry and the world of pixels — and MSAA smooths the transition at the edges.
Act VIII

The Depth & Stencil Gauntlet

We've got fragments. Each one knows its screen position, its depth, and its interpolated data. But should it actually become a pixel? The GPU runs up to three tests to decide — and some of them can happen before the fragment shader even runs.

Early Depth Test (Early-Z)

If the GPU knows your fragment shader won't modify the depth value (which is most of the time), it can test depth before running the shader. Why? Because the fragment shader is expensive! If a fragment is behind something already drawn, there's no point running its shader. Early-Z is one of the most important performance optimizations in modern GPUs.

The Depth Buffer

The depth buffer (or z-buffer) stores one depth value per pixel. When a new fragment arrives, its depth gets compared against what's already stored. If it's closer, it wins — its color replaces the old one, and its depth gets written. If it's further away, it's discarded.

Adjust the depth of the two triangles and watch the depth buffer update:

Try it: Slide the red triangle closer (lower depth) and it covers the blue one. Slide it further away and the blue one wins. Now uncheck "Depth test" — both triangles render in draw order with no occlusion. That's what games looked like before z-buffers!

The Stencil Buffer

The stencil buffer is the pipeline's secret weapon. It stores an integer per pixel (usually 8-bit) and lets you define arbitrary pass/fail rules. You can increment, decrement, or set the stencil value when fragments pass or fail, and you can make later draw calls conditional on the stencil value.

Classic uses: portals (draw the portal shape into the stencil, then only draw the portal's world where stencil passes), outline effects (draw the object, increment stencil, then draw a slightly larger version only where stencil is zero), mirrors, shadow volumes, and more.

Draw a mask shape into the stencil buffer, then see how it controls what gets rendered:

click and drag to paint the stencil mask
Try it: Paint a shape on the left panel (that's the stencil buffer). Then look at the right panel — the scene only renders through your stencil mask. Switch to "Render where stencil != 1" to invert it. This is exactly how portal effects work!
Depth testing handles occlusion — closer things hide further things. Stencil testing handles masking — you define arbitrary rules for which pixels survive. Together, they're the gatekeepers between your fragment shader and the framebuffer.
Act IX

The Fragment Shader, Revisited

You already know how to write a fragment shader — but now you understand what's feeding it. Each fragment arriving at your shader carries:

Screen position — where on the screen this fragment lands (gl_FragCoord)
Depth — how far from the camera, interpolated from the vertices
Interpolated varyings — everything your vertex shader output (UVs, normals, colors), blended via barycentric coordinates from the rasterizer
Face orientation — whether this fragment belongs to a front or back face (gl_FrontFacing)

And it's the rasterizer that did all that interpolation work. Your vertex shader set up three corners with different normals, UVs, and colors — the rasterizer smoothly blended them across every fragment in between. The fragment shader just gets to enjoy the result.

One thing that might now click: early-Z. If your fragment shader writes to gl_FragDepth (overriding the interpolated depth), the GPU can't do the early depth test — because it doesn't know the final depth until after the shader runs. That one innocent-looking line can cost you the entire early-Z optimization for that draw call.

The fragment shader is powerful precisely because of everything that comes before it. Rasterization gives it smooth interpolation. Depth and stencil tests keep it from running on invisible fragments. Understanding the context makes you write better shaders.
Act X

Blending & the Framebuffer

The fragment survived the depth test, passed the stencil test, and your shader gave it a color. Now what? If blending is disabled (the default for opaque objects), the color simply overwrites whatever was in the framebuffer. Done.

But for transparent objects, we need blending — combining the new fragment's color with the existing framebuffer color using a configurable equation.

The Blend Equation

The most common: result = src × srcAlpha + dst × (1 - srcAlpha). This is classic alpha blending — a fragment with 50% alpha mixes equally with whatever's behind it.

But there are others: additive blending (src + dst) for glow and fire effects, multiplicative (src × dst) for tinting and shadows, and more exotic combinations.

Reorder the transparent triangles and see how draw order affects the result:

Try it: With alpha blending, change the draw order — notice how the final image changes? This is the classic transparency ordering problem. Now switch to "Additive" mode — order doesn't matter anymore! (But the look is very different.) This is why games draw opaque objects first (order doesn't matter with depth test), then transparent objects sorted back-to-front.
Blending combines the fragment's color with the framebuffer. For alpha blending, draw order matters — back-to-front for correct results. Additive blending is order-independent, which is why it's popular for particles and effects.
Act XI

Swap Chain & Presentation

The GPU has been writing fragments to a framebuffer — but that's not what you see on screen. Your display reads from a different buffer: the front buffer. The GPU writes to the back buffer. When a frame is done, they swap.

Double Buffering & V-Sync

Without double buffering, the display might read the framebuffer while the GPU is halfway through rendering — you'd see the top half of the new frame and the bottom half of the old one. That's tearing.

Double buffering fixes this: the GPU renders to a hidden back buffer, then swaps the buffers between display refresh cycles. V-Sync synchronizes this swap to the monitor's refresh rate — but if the GPU can't finish a frame in time, it has to wait a whole extra refresh cycle, causing stutter.

Triple buffering adds a third buffer: the GPU can start the next frame immediately into a second back buffer while the first one waits to be displayed. Less stutter, slightly more latency.

The swap chain is the final handoff: GPU rendering meets display hardware. Double buffering + V-Sync prevents tearing; triple buffering reduces stutter. It's the last step before photons hit your eyes.
Act XII

The Complete Picture

And there it is — the complete GPU rendering pipeline, from draw call to display. Let's zoom back out and see the whole thing one more time, now that every stage is familiar:

Stage: Ready

Every frame of every real-time 3D application runs this pipeline — often millions of triangles through it in under 16 milliseconds. The GPU's massively parallel architecture handles the per-vertex and per-fragment stages across thousands of cores simultaneously, while the fixed-function hardware handles the rest at dedicated silicon speed.

Now when you write a vertex shader, you know exactly what prepared the data it receives. When you write a fragment shader, you know what rasterization did to create each fragment, what tests it might have already passed, and what happens to its output. The pipeline isn't a black box anymore — it's a well-oiled machine, and you understand every gear.

You started knowing two stages. Now you know them all. Every optimization tip, every rendering technique, every visual artifact you'll ever encounter traces back to one of these stages. Welcome to the full picture.