Leaderboard

Lower time is better.

G5 Foundation (Canonical)

Metric	Value
Baseline pipeline mean	2407.974 ms
Runtime warm request mean (3-run)	82.707 ms
Runtime warm request p99 (3-run)	91.738 ms
Baseline/runtime ratio (pipeline)	29.11x

Model	Early TTFT	clean4 TTFT	Speedup
qwen	27574.564 ms	1100.044 ms	25.067x
donut	67360.388 ms	150.322 ms	448.107x
bart	77520.798 ms	125.011 ms	620.112x
minilm	23.342 ms	22.621 ms	1.032x

All values above are runtime-instrumented timing.ttft_ms. Validation rerun (clean7, 2026-02-18) matches these values within run noise.

Metric	Internal	External	Ratio
Mean latency	94.849 ms	97.927 ms	1.032x

Internal routing is faster in aggregate and faster on 3/4 tasks in this set (the remaining task is effectively tied).

Backend	Startup->Healthy	Request TTFT	Request Full	Cold Total First Response
runtime	1003.537 ms	1108.979 ms	1339.459 ms	2342.996 ms
pytorch_transformers	-	528.483 ms	2288.516 ms	8725.259 ms
vllm	24032.203 ms	51.763 ms	1036.815 ms	25069.018 ms
ollama (GGUF)	1002.695 ms	2168.902 ms	2527.411 ms	3530.106 ms

Cold total first response ratio over runtime:

Request-path only note:

vLLM is fastest on request-path TTFT/full once healthy, but has high startup in this run.

Backend	Startup->Healthy	Request TTFT	Request Full	Cold Total First Response
runtime	2004.735 ms	91.596 ms	271.346 ms	2276.081 ms
pytorch_transformers	-	522.795 ms	2252.382 ms	8374.324 ms
vllm	27036.682 ms	51.725 ms	1035.826 ms	28072.508 ms
ollama (GGUF)	1002.508 ms	2182.542 ms	2538.609 ms	3541.117 ms

Runtime advantage in this variant:

Request full latency vs vLLM: 3.817x faster.
Cold total first response vs vLLM: 12.334x faster.
TTFT still trails vLLM (91.596 ms vs 51.725 ms).
Caveat: runtime was still using 4 decode steps in this run while vLLM/PyTorch/Ollama used 48.

Backend	Startup->Healthy	Request TTFT	Request Full	Cold Total First Response
runtime	2004.203 ms	91.207 ms	2518.142 ms	4522.345 ms
pytorch_transformers	-	501.946 ms	2244.327 ms	11272.831 ms
vllm	27036.248 ms	51.310 ms	1075.404 ms	28111.652 ms
ollama (GGUF)	1002.560 ms	2197.797 ms	2556.652 ms	3559.212 ms

Parity interpretation:

Backend	Startup->Healthy	Request TTFT	Request Full	Cold Total First Response
runtime	2004.759 ms	5.022 ms	311.289 ms	2316.048 ms
pytorch_transformers	-	515.597 ms	2291.854 ms	8237.561 ms
vllm	24036.762 ms	52.995 ms	1094.517 ms	25131.279 ms
ollama (GGUF)	1002.630 ms	2184.383 ms	2543.219 ms	3545.849 ms

Post-fix interpretation:

Runtime now wins TTFT and full request latency vs vLLM at token parity.
Runtime keeps the large cold-total lead vs vLLM.
3-run repeatability (runtime+vLLM) means: runtime 5.022 ms TTFT, 311.444 ms full, 2316.002 ms cold-total vs vLLM 51.894/1052.767/24752.842 ms.

Set	Checked	Failed	Strict
T4	3	0	true
G5	3	0	true