Leaderboard
Side-by-side benchmark results for T4 and G5 experiment sets.
Lower time is better.
| Metric | Value |
|---|
| Baseline pipeline mean | 2407.974 ms |
| Runtime warm request mean (3-run) | 82.707 ms |
| Runtime warm request p99 (3-run) | 91.738 ms |
| Baseline/runtime ratio (pipeline) | 29.11x |
| Model | Early TTFT | clean4 TTFT | Speedup |
|---|
| qwen | 27574.564 ms | 1100.044 ms | 25.067x |
| donut | 67360.388 ms | 150.322 ms | 448.107x |
| bart | 77520.798 ms | 125.011 ms | 620.112x |
| minilm | 23.342 ms | 22.621 ms | 1.032x |
All values above are runtime-instrumented timing.ttft_ms.
Validation rerun (clean7, 2026-02-18) matches these values within run noise.
| Metric | Internal | External | Ratio |
|---|
| Mean latency | 94.849 ms | 97.927 ms | 1.032x |
| Task | Internal | External |
|---|
| general_short | 150.767 ms | 152.274 ms |
| receipt_extract | 80.732 ms | 81.270 ms |
| search_grounded | 46.945 ms | 57.237 ms |
| summarize_short | 100.950 ms | 100.928 ms |
Internal routing is faster in aggregate and faster on 3/4 tasks in this set (the remaining task is effectively tied).
| Backend | Startup->Healthy | Request TTFT | Request Full | Cold Total First Response |
|---|
| runtime | 1003.537 ms | 1108.979 ms | 1339.459 ms | 2342.996 ms |
| pytorch_transformers | - | 528.483 ms | 2288.516 ms | 8725.259 ms |
| vllm | 24032.203 ms | 51.763 ms | 1036.815 ms | 25069.018 ms |
| ollama (GGUF) | 1002.695 ms | 2168.902 ms | 2527.411 ms | 3530.106 ms |
Cold total first response ratio over runtime:
- PyTorch:
3.724x
- vLLM:
10.7x
- Ollama:
1.507x
Request-path only note:
- vLLM is fastest on request-path TTFT/full once healthy, but has high startup in this run.
| Backend | Startup->Healthy | Request TTFT | Request Full | Cold Total First Response |
|---|
| runtime | 2004.735 ms | 91.596 ms | 271.346 ms | 2276.081 ms |
| pytorch_transformers | - | 522.795 ms | 2252.382 ms | 8374.324 ms |
| vllm | 27036.682 ms | 51.725 ms | 1035.826 ms | 28072.508 ms |
| ollama (GGUF) | 1002.508 ms | 2182.542 ms | 2538.609 ms | 3541.117 ms |
Runtime advantage in this variant:
- Request full latency vs vLLM:
3.817x faster.
- Cold total first response vs vLLM:
12.334x faster.
- TTFT still trails vLLM (
91.596 ms vs 51.725 ms).
- Caveat: runtime was still using 4 decode steps in this run while vLLM/PyTorch/Ollama used 48.
| Backend | Startup->Healthy | Request TTFT | Request Full | Cold Total First Response |
|---|
| runtime | 2004.203 ms | 91.207 ms | 2518.142 ms | 4522.345 ms |
| pytorch_transformers | - | 501.946 ms | 2244.327 ms | 11272.831 ms |
| vllm | 27036.248 ms | 51.310 ms | 1075.404 ms | 28111.652 ms |
| ollama (GGUF) | 1002.560 ms | 2197.797 ms | 2556.652 ms | 3559.212 ms |
Parity interpretation:
- Runtime still wins cold-total first response vs vLLM (
6.216x better).
- vLLM wins request-path TTFT and full latency at equal 48-token budget.
| Backend | Startup->Healthy | Request TTFT | Request Full | Cold Total First Response |
|---|
| runtime | 2004.759 ms | 5.022 ms | 311.289 ms | 2316.048 ms |
| pytorch_transformers | - | 515.597 ms | 2291.854 ms | 8237.561 ms |
| vllm | 24036.762 ms | 52.995 ms | 1094.517 ms | 25131.279 ms |
| ollama (GGUF) | 1002.630 ms | 2184.383 ms | 2543.219 ms | 3545.849 ms |
Post-fix interpretation:
- Runtime now wins TTFT and full request latency vs vLLM at token parity.
- Runtime keeps the large cold-total lead vs vLLM.
- 3-run repeatability (runtime+vLLM) means: runtime
5.022 ms TTFT, 311.444 ms full, 2316.002 ms cold-total vs vLLM 51.894/1052.767/24752.842 ms.
| Set | Runtime HTTP request mean | Runtime HTTP request p99 |
|---|
| T4 (2026-02-15) | 146279.609 ms | 156769.1 ms |
| G5 (2026-02-15) | 77449.605 ms | 83346.187 ms |
| G5 registry-cached single run (2026-02-16) | 82.913 ms | 91.877 ms |
| Set | Checked | Failed | Strict |
|---|
| T4 | 3 | 0 | true |
| G5 | 3 | 0 | true |