Morton Z-Curve Benchmark

How reordering memory made the simulation 2× faster at 1 billion cells
with zero changes to the GPU code

The Problem

GPU simulations touch neighbours. Every cell in the simulation reads the state of its 6 neighbouring cells, every tick. On a hex grid with 1 billion cells, that's 6 billion memory reads per tick. Where those neighbours live in memory matters. GPUs read memory in chunks called "cache lines" (128 bytes on Apple Silicon). If your neighbour's data is in the same cache line you already loaded, the read is essentially free. If it's far away in memory, the GPU has to fetch a whole new chunk from main memory — that costs ~100 nanoseconds. Row-major layout is the default: cells are stored left-to-right, row by row. Your left and right neighbours are right next to you in memory — fast. But your top and bottom neighbours are an entire row width away. On a 32,768-wide grid, that's 128 KB away — a guaranteed cache miss for every vertical neighbour read. Morton Z-curve interleaves the bits of the column and row coordinates, creating a space-filling Z-pattern. Cells that are spatial neighbours end up close together in memory — in all directions, not just horizontal. The GPU loads one cache line and finds most of its neighbours already there.

Row-Major: neighbours in same row = fast (adjacent in memory) neighbours in other rows = slow (width apart in memory) Morton Z: ALL neighbours clustered within a few hundred indices cache line loads serve multiple neighbour reads At 1M cells: doesn't matter (everything fits in cache) At 1B cells: everything — row-major thrashes, Morton stays fast

The Experiment

We ran the exact same simulation — same hex grid, same predator-prey biology, same Metal GPU kernels — with two different memory layouts. The only change was how the state arrays are indexed: row-major vs Morton Z-curve. No kernel code was modified. No algorithms changed. No GPU shader touched. Just data layout. Each grid size was run 10 times with a fresh grid and different random seed per run. We report the mean, standard deviation, and 95% confidence interval. The confidence intervals don't overlap at any scale ≥ 4M cells — the results are statistically significant.

Throughput (GCUPS) vs Grid Size

Row-major collapses at scale. Morton stays flat. The gap widens with every doubling.

Latency (ms/tick) vs Grid Size

Speedup Factor vs Grid Size

Summary Table

10 runs per grid. 95% CIs non-overlapping at all scales ≥ 4M. Zero kernel changes between versions.