SAVANNA ENGINE

GPU Graph Compute at 15.8 Billion Cell-Updates Per Second

1,048,576 hexagonal cells Each cell = one node in a graph ■ Grass grows ■ Zebras graze ■ Lions hunt ■ Water attracts Simulated 1,722× per second Displayed at 60-120 fps (screen refresh) On one Apple M5 Max GPU

A million hexagonal cells. A living ecosystem simulated seventeen hundred and twenty-two times per second on a single GPU. Displayed at your screen's refresh rate. The biology is the test workload. The engine is an ultra-scale spatial lattice engine.

1 / 10

THE STATE TENSOR

Channel	Type	Size
entity	Int8	1 byte
energy	Int16	2 bytes
ternary	Int8	1 byte
age	Int16	2 bytes
orientation	Int8	1 byte
Per node		7 bytes
× 1M nodes		7 MB
+ 4 scent fields (Float32)		16 MB
Total state		23 MB
At 1,722 tps		40 GB/s

Five channels per node. Seven bytes per cell. Times one million. Plus four scent diffusion fields. Twenty-three megabytes of state, mutated seventeen hundred and twenty-two times per second. Forty gigabytes per second of state throughput.

2 / 10

GPU PIPELINE — 13 Kernels Per Tick

4× Scent Diffusion (1M cells each) zebra, grass, lion, water 7× Entity Update (150K cells each) one per colour group sense → compute → emit 1× Grass Growth (1M cells) 1× Census (1M cells) Total: 7M effective cell-updates per tick Peak at 64M cells: 15.8B cell-updates/sec

Thirteen Metal kernel launches per tick. Four scent diffusions, each touching all million cells. Seven entity updates, one per colour group. Plus grass growth and census. Seven million effective cell updates per tick. Peak throughput at sixty-four million cells: fifteen point eight billion cell updates per second.

3 / 10

7-COLOURING — Lock-Free Parallelism

Problem: nodes MOVE. Two threads writing same cell = silent data loss. Solution: colour the graph so no two same-coloured nodes share ANY neighbour. ⬡₁ ⬡₂ ⬡₃ ⬡₄ ⬡₅ ⬡₆ ⬡₇ ⬡₁ ⬡₂ ⬡₃ ⬡₅ ⬡₆ ⬡₇ ⬡₁ ⬡₂ ⬡₃ ⬡₄ ⬡₅ ⬡₆ ⬡₂ ⬡₃ ⬡₄ ⬡₅ ⬡₆ ⬡₇ ⬡₁ ⬡₂ ⬡₃ ⬡₄ Formula: (col + row + 4×(col&1)) mod 7 Molloy & Salavatipour, 2005 7 is the minimum. Exhaustive-verified.

lock-free atomic-free distance-2 safe

When nodes move, they write to a neighbour cell. If two GPU threads write the same cell simultaneously, data is silently destroyed. The seven colouring guarantees no two same-coloured cells share any common neighbour. Within each group, all one hundred and fifty thousand cells run in parallel. No locks. No atomics. Pure data parallelism.

4 / 10

PERFORMANCE — Where Time Goes

Component	Time
GPU compute (sim)	0.6 ms	fast
GPU render	0.2 ms	fast
vsync wait	15.6 ms	idle
Total per frame	16.6 ms	= 60 fps

The GPU is idle 95% of the time. Waiting for the screen to refresh. Without vsync: ~1,722 tps = 15.8 BILLION cell-updates/sec

The GPU computes one million cells in zero point five eight milliseconds. Then renders in zero point two. Then waits fifteen milliseconds for the screen. The GPU is idle ninety-five percent of the time. Without vsync: seventeen hundred and twenty-two ticks per second. Peak at sixty-four million cells: fifteen point eight billion cell updates per second. On one chip.

5 / 10

SCALING — How Big Can We Go?

Grid	ms/tick	TPS	Memory
1M (1K×1K)	0.58	1,722	23 MB
4M (2K×2K)	1.98	504	92 MB
16M (4K×4K)	7.78	128	368 MB
64M (8K×8K)	29.8	33	1.5 GB
256M (16K×16K)	126.9	7.9	5.9 GB
1B (32K×32K)	522.8	1.9	23 GB

M5 Max: 128 GB unified memory Theoretical max: ~5 billion cells Real-time (>1 tps): up to 1 billion

Metal compute scales linearly. Double the cells, double the time. Sixteen million cells: one hundred twenty-eight ticks per second. Sixty-four million: thirty-three. One billion cells: nearly two ticks per second. The M5 Max has one hundred twenty-eight gigabytes of unified memory. Real-time up to one billion cells.

6 / 10

TWO PIPELINES

HTML (remote — phone, AirPlay)

GPU→ CPU→ File→ HTTP→ JS→ Canvas ~25ms

Metal Native (local — zero latency)

GPU→ Screen ~1ms

HTML goes through six hops: GPU, CPU, file, HTTP, JavaScript, canvas. Twenty-five milliseconds. Metal native: GPU to screen. One millisecond. Twenty-five times less latency. Both run simultaneously.

7 / 10

THE CEILING IS THE ALGORITHM

  Today: 13 kernels × 64M cells = 15.8B updates/sec (peak)

  Phase-Transition Engine (next):
  ┌────────────────────────────────────┐
  │ Herd of 10,000 nodes              │
  │ Currently: 10,000 individual sims  │
  │ With hull encoding: 1 block move  │
  │ Theoretical: O(√N) vs O(N)          │
  └────────────────────────────────────┘

  Same hardware → 100× more nodes

  Applications:
  • Predator-prey ecosystems
  • Power grid simulation
  • Drug metabolism (liver digital twin)
  • Any spatial lattice with local compute
  

The ceiling is not the hardware. It is the algorithm. When ten thousand nodes move as a block, we compute one hull translation instead of ten thousand individual decisions. One hundred times speedup. Same hardware. The savanna was the proof of concept. The engine runs any local computation on any colourable graph at fifteen point eight billion cell updates per second.

8 / 10

MORTON Z-CURVE — 2× Faster at Scale

Grid	Cells	Row-Major	Morton	Speedup
4096²	16M	8.01 ms	7.78 ms	+3%
8192²	64M	33.69 ms	29.77 ms	+12%
16384²	256M	206.90 ms	126.88 ms	+39%
32768²	1B	1,102 ms	523 ms	2.11×

GCUPS scaling (10 runs, 95% CI): Row-Major: 14.7 → 13.9 → 9.1 → 6.8 collapses at scale Morton: 15.1 → 15.8 → 14.8 → 14.4 flat across 3 orders Same Metal kernels. Zero code changes. The speedup comes from data layout alone.

Full comparison with charts →

Morton Z-curve reorders memory so spatial neighbors are close. At one billion cells, it is two point one times faster than row-major layout. The Metal kernels are identical — zero code changes. The speedup comes entirely from data layout. Row-major throughput collapses from fourteen point seven to six point eight GCUPS as the grid grows. Morton stays flat at fourteen point four. See the full comparison with interactive charts.

9 / 10

HOW TO RUN

1. Clone and build git clone https://github.com/norayr-m/savanna-engine.git cd savanna-engine swift build -c release 2. Run the benchmark swift run -c release savanna-bench 3. Run the simulation swift run -c release savanna-cli 4. Options swift run -c release savanna-cli --grid 2048 # 4M cells swift run -c release savanna-cli --ram 1 # 1GB (fits 8GB Mac) swift run -c release savanna-cli --bench # no file I/O swift run -c release savanna-cli --ticks 1000 # run 1000 then stop

Requirements • macOS with Apple Silicon (M1 or later) • Swift 5.9+ • No Xcode needed — command line tools sufficient Memory • Simulation: ~23 MB (any Mac) • Ring buffer: configurable via --ram flag • --ram 1 → 8 GB Mac • --ram 4 → 16 GB Mac (default) • --ram 50 → 128 GB Mac

10 / 10