← Back to Savanna Engine
Morton Z-Curve vs Row-Major
How reordering memory made the simulation 2× faster at 1 billion cells
with zero changes to the GPU code
The Problem
GPU simulations touch neighbours. Every cell in the simulation reads the state of its 6 neighbouring cells, every tick. On a hex grid with 1 billion cells, that's 6 billion memory reads per tick.
Where those neighbours live in memory matters. GPUs read memory in chunks called "cache lines" (128 bytes on Apple Silicon). If your neighbour's data is in the same cache line you already loaded, the read is essentially free. If it's far away in memory, the GPU has to fetch a whole new chunk from main memory — that costs ~100 nanoseconds.
Row-major layout is the default: cells are stored left-to-right, row by row. Your left and right neighbours are right next to you in memory — fast. But your top and bottom neighbours are an entire row width away. On a 32,768-wide grid, that's 128 KB away — a guaranteed cache miss for every vertical neighbour read.
Morton Z-curve interleaves the bits of the column and row coordinates, creating a space-filling Z-pattern. Cells that are spatial neighbours end up close together in memory — in all directions, not just horizontal. The GPU loads one cache line and finds most of its neighbours already there.
Row-Major: neighbours in same row = fast (adjacent in memory)
neighbours in other rows = slow (width apart in memory)
Morton Z: ALL neighbours clustered within a few hundred indices
cache line loads serve multiple neighbour reads
At 1M cells: doesn't matter (everything fits in cache)
At 1B cells: everything — row-major thrashes, Morton stays fast
The Experiment
We ran the exact same simulation — same hex grid, same predator-prey biology, same Metal GPU kernels — with two different memory layouts. The only change was how the state arrays are indexed: row-major vs Morton Z-curve.
No kernel code was modified. No algorithms changed. No GPU shader touched. Just data layout.
Each grid size was run 10 times with a fresh grid and different random seed per run. We report the mean, standard deviation, and 95% confidence interval. The confidence intervals don't overlap at any scale ≥ 4M cells — the results are statistically significant.
Throughput (GCUPS) vs Grid Size
Row-major collapses at scale. Morton stays flat. The gap widens with every doubling.
Latency (ms/tick) vs Grid Size
Log-log scale. Perfect linear scaling = slope 1. Morton hugs the ideal line.
Speedup Factor vs Grid Size
At 1 billion cells, Morton is 2.11× faster — purely from data layout.
Summary Table
| Grid | Cells | Row-Major (ms) | Morton (ms) | Speedup | Row GCUPS | Morton GCUPS |
| 1024² | 1M | 0.65 | 0.58 | +10.8% | 11.2 | 12.6 |
| 2048² | 4M | 2.04 | 1.98 | +3.0% | 14.4 | 14.8 |
| 4096² | 16M | 8.01 | 7.78 | +3.0% | 14.7 | 15.1 |
| 8192² | 64M | 33.69 | 29.77 | +11.6% | 13.9 | 15.8 |
| 16384² | 256M | 206.90 | 126.88 | +38.7% | 9.1 | 14.8 |
| 32768² | 1B | 1102.24 | 522.75 | +52.6% (2.11×) | 6.8 | 14.4 |
10 runs per grid. 95% CIs non-overlapping at all scales ≥ 4M. Zero kernel changes between versions.