GPU Graph Compute at 15.8 Billion Cell-Updates Per Second
1,048,576 hexagonal cells
Each cell = one node in a graph
■ Grass grows ■ Zebras graze
■ Lions hunt ■ Water attracts
Simulated 1,722× per second
Displayed at 60-120 fps (screen refresh)
On one Apple M5 Max GPU
A million hexagonal cells. A living ecosystem simulated seventeen hundred and twenty-two times per second on a single GPU. Displayed at your screen's refresh rate. The biology is the test workload. The engine is an ultra-scale spatial lattice engine.
1 / 10
THE STATE TENSOR
Channel
Type
Size
entity
Int8
1 byte
energy
Int16
2 bytes
ternary
Int8
1 byte
age
Int16
2 bytes
orientation
Int8
1 byte
Per node
7 bytes
× 1M nodes
7 MB
+ 4 scent fields (Float32)
16 MB
Total state
23 MB
At 1,722 tps
40 GB/s
Five channels per node. Seven bytes per cell. Times one million. Plus four scent diffusion fields. Twenty-three megabytes of state, mutated seventeen hundred and twenty-two times per second. Forty gigabytes per second of state throughput.
2 / 10
GPU PIPELINE — 13 Kernels Per Tick
4× Scent Diffusion (1M cells each)
zebra, grass, lion, water
7× Entity Update (150K cells each)
one per colour group
sense → compute → emit
1× Grass Growth (1M cells)
1× Census (1M cells)
Total: 7M effective cell-updates per tick
Peak at 64M cells: 15.8B cell-updates/sec
Thirteen Metal kernel launches per tick. Four scent diffusions, each touching all million cells. Seven entity updates, one per colour group. Plus grass growth and census. Seven million effective cell updates per tick. Peak throughput at sixty-four million cells: fifteen point eight billion cell updates per second.
3 / 10
7-COLOURING — Lock-Free Parallelism
Problem: nodes MOVE.
Two threads writing same cell = silent data loss.
Solution: colour the graph so no two
same-coloured nodes share ANY neighbour.
⬡₁ ⬡₂ ⬡₃ ⬡₄ ⬡₅ ⬡₆ ⬡₇ ⬡₁ ⬡₂ ⬡₃⬡₅ ⬡₆ ⬡₇ ⬡₁ ⬡₂ ⬡₃ ⬡₄ ⬡₅ ⬡₆⬡₂ ⬡₃ ⬡₄ ⬡₅ ⬡₆ ⬡₇ ⬡₁ ⬡₂ ⬡₃ ⬡₄
Formula: (col + row + 4×(col&1)) mod 7
Molloy & Salavatipour, 2005
7 is the minimum. Exhaustive-verified.
lock-freeatomic-freedistance-2 safe
When nodes move, they write to a neighbour cell. If two GPU threads write the same cell simultaneously, data is silently destroyed. The seven colouring guarantees no two same-coloured cells share any common neighbour. Within each group, all one hundred and fifty thousand cells run in parallel. No locks. No atomics. Pure data parallelism.
4 / 10
PERFORMANCE — Where Time Goes
Component
Time
GPU compute (sim)
0.6 ms
fast
GPU render
0.2 ms
fast
vsync wait
15.6 ms
idle
Total per frame
16.6 ms
= 60 fps
The GPU is idle 95% of the time.
Waiting for the screen to refresh.
Without vsync: ~1,722 tps
= 15.8 BILLION cell-updates/sec
The GPU computes one million cells in zero point five eight milliseconds. Then renders in zero point two. Then waits fifteen milliseconds for the screen. The GPU is idle ninety-five percent of the time. Without vsync: seventeen hundred and twenty-two ticks per second. Peak at sixty-four million cells: fifteen point eight billion cell updates per second. On one chip.
5 / 10
SCALING — How Big Can We Go?
Grid
ms/tick
TPS
Memory
1M (1K×1K)
0.58
1,722
23 MB
4M (2K×2K)
1.98
504
92 MB
16M (4K×4K)
7.78
128
368 MB
64M (8K×8K)
29.8
33
1.5 GB
256M (16K×16K)
126.9
7.9
5.9 GB
1B (32K×32K)
522.8
1.9
23 GB
M5 Max: 128 GB unified memory
Theoretical max: ~5 billion cells
Real-time (>1 tps): up to 1 billion
Metal compute scales linearly. Double the cells, double the time. Sixteen million cells: one hundred twenty-eight ticks per second. Sixty-four million: thirty-three. One billion cells: nearly two ticks per second. The M5 Max has one hundred twenty-eight gigabytes of unified memory. Real-time up to one billion cells.
6 / 10
TWO PIPELINES
HTML (remote — phone, AirPlay)
GPU→CPU→File→HTTP→JS→Canvas~25ms
Metal Native (local — zero latency)
GPU→Screen~1ms
HTML goes through six hops: GPU, CPU, file, HTTP, JavaScript, canvas. Twenty-five milliseconds. Metal native: GPU to screen. One millisecond. Twenty-five times less latency. Both run simultaneously.
7 / 10
THE CEILING IS THE ALGORITHM
Today: 13 kernels × 64M cells = 15.8B updates/sec (peak)
Phase-Transition Engine (next):
┌────────────────────────────────────┐
│ Herd of 10,000 nodes │
│ Currently: 10,000 individual sims │
│ With hull encoding: 1 block move │
│ Theoretical: O(√N) vs O(N) │
└────────────────────────────────────┘
Same hardware → 100× more nodes
Applications:
• Predator-prey ecosystems
• Power grid simulation
• Drug metabolism (liver digital twin)
• Any spatial lattice with local compute
The ceiling is not the hardware. It is the algorithm. When ten thousand nodes move as a block, we compute one hull translation instead of ten thousand individual decisions. One hundred times speedup. Same hardware. The savanna was the proof of concept. The engine runs any local computation on any colourable graph at fifteen point eight billion cell updates per second.
8 / 10
MORTON Z-CURVE — 2× Faster at Scale
Grid
Cells
Row-Major
Morton
Speedup
4096²
16M
8.01 ms
7.78 ms
+3%
8192²
64M
33.69 ms
29.77 ms
+12%
16384²
256M
206.90 ms
126.88 ms
+39%
32768²
1B
1,102 ms
523 ms
2.11×
GCUPS scaling (10 runs, 95% CI):
Row-Major: 14.7 → 13.9 → 9.1 → 6.8 collapses at scale
Morton: 15.1 → 15.8 → 14.8 → 14.4 flat across 3 orders
Same Metal kernels. Zero code changes.
The speedup comes from data layout alone.
Morton Z-curve reorders memory so spatial neighbors are close. At one billion cells, it is two point one times faster than row-major layout. The Metal kernels are identical — zero code changes. The speedup comes entirely from data layout. Row-major throughput collapses from fourteen point seven to six point eight GCUPS as the grid grows. Morton stays flat at fourteen point four. See the full comparison with interactive charts.
9 / 10
HOW TO RUN
1. Clone and build
git clone https://github.com/norayr-m/savanna-engine.git
cd savanna-engine
swift build -c release
2. Run the benchmark
swift run -c release savanna-bench
3. Run the simulation
swift run -c release savanna-cli
4. Options
swift run -c release savanna-cli --grid 2048 # 4M cells
swift run -c release savanna-cli --ram 1 # 1GB (fits 8GB Mac)
swift run -c release savanna-cli --bench # no file I/O
swift run -c release savanna-cli --ticks 1000 # run 1000 then stop
Requirements
• macOS with Apple Silicon (M1 or later)
• Swift 5.9+
• No Xcode needed — command line tools sufficient
Memory
• Simulation: ~23 MB (any Mac)
• Ring buffer: configurable via --ram flag
• --ram 1 → 8 GB Mac
• --ram 4 → 16 GB Mac (default)
• --ram 50 → 128 GB Mac