SAVANNA ENGINE

GPU Graph Compute at 15.8 Billion Cell-Updates Per Second

1,048,576 hexagonal cells Each cell = one node in a graph Grass grows Zebras graze Lions hunt Water attracts Simulated 1,722× per second Displayed at 60-120 fps (screen refresh) On one Apple M5 Max GPU
A million hexagonal cells. A living ecosystem simulated seventeen hundred and twenty-two times per second on a single GPU. Displayed at your screen's refresh rate. The biology is the test workload. The engine is an ultra-scale spatial lattice engine.
1 / 10

THE STATE TENSOR

ChannelTypeSize
entityInt81 byte
energyInt162 bytes
ternaryInt81 byte
ageInt162 bytes
orientationInt81 byte
Per node7 bytes
× 1M nodes7 MB
+ 4 scent fields (Float32)16 MB
Total state23 MB
At 1,722 tps40 GB/s
Five channels per node. Seven bytes per cell. Times one million. Plus four scent diffusion fields. Twenty-three megabytes of state, mutated seventeen hundred and twenty-two times per second. Forty gigabytes per second of state throughput.
2 / 10

GPU PIPELINE — 13 Kernels Per Tick

4× Scent Diffusion (1M cells each) zebra, grass, lion, water 7× Entity Update (150K cells each) one per colour group sense → compute → emit 1× Grass Growth (1M cells) 1× Census (1M cells) Total: 7M effective cell-updates per tick Peak at 64M cells: 15.8B cell-updates/sec
Thirteen Metal kernel launches per tick. Four scent diffusions, each touching all million cells. Seven entity updates, one per colour group. Plus grass growth and census. Seven million effective cell updates per tick. Peak throughput at sixty-four million cells: fifteen point eight billion cell updates per second.
3 / 10

7-COLOURING — Lock-Free Parallelism

Problem: nodes MOVE. Two threads writing same cell = silent data loss. Solution: colour the graph so no two same-coloured nodes share ANY neighbour. ⬡₁ ⬡₂ ⬡₃ ⬡₄ ⬡₅ ⬡₆ ⬡₇ ⬡₁ ⬡₂ ⬡₃ ⬡₅ ⬡₆ ⬡₇ ⬡₁ ⬡₂ ⬡₃ ⬡₄ ⬡₅ ⬡₆ ⬡₂ ⬡₃ ⬡₄ ⬡₅ ⬡₆ ⬡₇ ⬡₁ ⬡₂ ⬡₃ ⬡₄ Formula: (col + row + 4×(col&1)) mod 7 Molloy & Salavatipour, 2005 7 is the minimum. Exhaustive-verified.

lock-free atomic-free distance-2 safe

When nodes move, they write to a neighbour cell. If two GPU threads write the same cell simultaneously, data is silently destroyed. The seven colouring guarantees no two same-coloured cells share any common neighbour. Within each group, all one hundred and fifty thousand cells run in parallel. No locks. No atomics. Pure data parallelism.
4 / 10

PERFORMANCE — Where Time Goes

ComponentTime
GPU compute (sim)0.6 msfast
GPU render0.2 msfast
vsync wait15.6 msidle
Total per frame16.6 ms= 60 fps
The GPU is idle 95% of the time. Waiting for the screen to refresh. Without vsync: ~1,722 tps = 15.8 BILLION cell-updates/sec
The GPU computes one million cells in zero point five eight milliseconds. Then renders in zero point two. Then waits fifteen milliseconds for the screen. The GPU is idle ninety-five percent of the time. Without vsync: seventeen hundred and twenty-two ticks per second. Peak at sixty-four million cells: fifteen point eight billion cell updates per second. On one chip.
5 / 10

SCALING — How Big Can We Go?

Gridms/tickTPSMemory
1M (1K×1K)0.581,72223 MB
4M (2K×2K)1.9850492 MB
16M (4K×4K)7.78128368 MB
64M (8K×8K)29.8331.5 GB
256M (16K×16K)126.97.95.9 GB
1B (32K×32K)522.81.923 GB
M5 Max: 128 GB unified memory Theoretical max: ~5 billion cells Real-time (>1 tps): up to 1 billion
Metal compute scales linearly. Double the cells, double the time. Sixteen million cells: one hundred twenty-eight ticks per second. Sixty-four million: thirty-three. One billion cells: nearly two ticks per second. The M5 Max has one hundred twenty-eight gigabytes of unified memory. Real-time up to one billion cells.
6 / 10

TWO PIPELINES

HTML (remote — phone, AirPlay)

GPU CPU File HTTP JS Canvas ~25ms

Metal Native (local — zero latency)

GPU Screen ~1ms
HTML goes through six hops: GPU, CPU, file, HTTP, JavaScript, canvas. Twenty-five milliseconds. Metal native: GPU to screen. One millisecond. Twenty-five times less latency. Both run simultaneously.
7 / 10

THE CEILING IS THE ALGORITHM

Today: 13 kernels × 64M cells = 15.8B updates/sec (peak) Phase-Transition Engine (next): ┌────────────────────────────────────┐ │ Herd of 10,000 nodes │ │ Currently: 10,000 individual sims │ │ With hull encoding: 1 block move │ │ Theoretical: O(√N) vs O(N) │ └────────────────────────────────────┘ Same hardware → 100× more nodes Applications: • Predator-prey ecosystems • Power grid simulation • Drug metabolism (liver digital twin) • Any spatial lattice with local compute
The ceiling is not the hardware. It is the algorithm. When ten thousand nodes move as a block, we compute one hull translation instead of ten thousand individual decisions. One hundred times speedup. Same hardware. The savanna was the proof of concept. The engine runs any local computation on any colourable graph at fifteen point eight billion cell updates per second.
8 / 10

MORTON Z-CURVE — 2× Faster at Scale

GridCellsRow-MajorMortonSpeedup
4096²16M8.01 ms7.78 ms+3%
8192²64M33.69 ms29.77 ms+12%
16384²256M206.90 ms126.88 ms+39%
32768²1B1,102 ms523 ms2.11×
GCUPS scaling (10 runs, 95% CI): Row-Major: 14.7 → 13.9 → 9.1 → 6.8 collapses at scale Morton: 15.1 → 15.8 → 14.8 → 14.4 flat across 3 orders Same Metal kernels. Zero code changes. The speedup comes from data layout alone.

Full comparison with charts →

Morton Z-curve reorders memory so spatial neighbors are close. At one billion cells, it is two point one times faster than row-major layout. The Metal kernels are identical — zero code changes. The speedup comes entirely from data layout. Row-major throughput collapses from fourteen point seven to six point eight GCUPS as the grid grows. Morton stays flat at fourteen point four. See the full comparison with interactive charts.
9 / 10

HOW TO RUN

1. Clone and build git clone https://github.com/norayr-m/savanna-engine.git cd savanna-engine swift build -c release 2. Run the benchmark swift run -c release savanna-bench 3. Run the simulation swift run -c release savanna-cli 4. Options swift run -c release savanna-cli --grid 2048 # 4M cells swift run -c release savanna-cli --ram 1 # 1GB (fits 8GB Mac) swift run -c release savanna-cli --bench # no file I/O swift run -c release savanna-cli --ticks 1000 # run 1000 then stop
Requirements • macOS with Apple Silicon (M1 or later) • Swift 5.9+ • No Xcode needed — command line tools sufficient Memory • Simulation: ~23 MB (any Mac) • Ring buffer: configurable via --ram flag • --ram 1 → 8 GB Mac • --ram 4 → 16 GB Mac (default) • --ram 50 → 128 GB Mac
10 / 10
Built by Norayr Matevosyan
AI: Claude Opus 4.6 (Anthropic) · Gemini Deep Think (Google)
7-colouring: Molloy & Salavatipour, 2005
Press Space to narrate • ↑↓ to navigate • Works offline