Treni

Leaderboard

Side-by-side benchmark results for T4 and G5 experiment sets.

Lower time is better.

G5 Foundation (Canonical)

MetricValue
Baseline pipeline mean2407.974 ms
Runtime warm request mean (3-run)82.707 ms
Runtime warm request p99 (3-run)91.738 ms
Baseline/runtime ratio (pipeline)29.11x

G5 Cold First-Hit — True TTFT (3-run Means, 2026-02-17)

ModelEarly TTFTclean4 TTFTSpeedup
qwen27574.564 ms1100.044 ms25.067x
donut67360.388 ms150.322 ms448.107x
bart77520.798 ms125.011 ms620.112x
minilm23.342 ms22.621 ms1.032x

All values above are runtime-instrumented timing.ttft_ms. Validation rerun (clean7, 2026-02-18) matches these values within run noise.

Internal vs External Routing (G5, 2026-02-17)

MetricInternalExternalRatio
Mean latency94.849 ms97.927 ms1.032x
TaskInternalExternal
general_short150.767 ms152.274 ms
receipt_extract80.732 ms81.270 ms
search_grounded46.945 ms57.237 ms
summarize_short100.950 ms100.928 ms

Internal routing is faster in aggregate and faster on 3/4 tasks in this set (the remaining task is effectively tied).

External Cold Comparison (G5, 2026-02-18)

BackendStartup->HealthyRequest TTFTRequest FullCold Total First Response
runtime1003.537 ms1108.979 ms1339.459 ms2342.996 ms
pytorch_transformers-528.483 ms2288.516 ms8725.259 ms
vllm24032.203 ms51.763 ms1036.815 ms25069.018 ms
ollama (GGUF)1002.695 ms2168.902 ms2527.411 ms3530.106 ms

Cold total first response ratio over runtime:

  • PyTorch: 3.724x
  • vLLM: 10.7x
  • Ollama: 1.507x

Request-path only note:

  • vLLM is fastest on request-path TTFT/full once healthy, but has high startup in this run.

External Cold Comparison (G5, 2026-02-18, Runtime Preload + Tokenizer Cache, Non-Parity)

BackendStartup->HealthyRequest TTFTRequest FullCold Total First Response
runtime2004.735 ms91.596 ms271.346 ms2276.081 ms
pytorch_transformers-522.795 ms2252.382 ms8374.324 ms
vllm27036.682 ms51.725 ms1035.826 ms28072.508 ms
ollama (GGUF)1002.508 ms2182.542 ms2538.609 ms3541.117 ms

Runtime advantage in this variant:

  • Request full latency vs vLLM: 3.817x faster.
  • Cold total first response vs vLLM: 12.334x faster.
  • TTFT still trails vLLM (91.596 ms vs 51.725 ms).
  • Caveat: runtime was still using 4 decode steps in this run while vLLM/PyTorch/Ollama used 48.

External Cold Comparison (G5, 2026-02-18, Token Parity = 48)

BackendStartup->HealthyRequest TTFTRequest FullCold Total First Response
runtime2004.203 ms91.207 ms2518.142 ms4522.345 ms
pytorch_transformers-501.946 ms2244.327 ms11272.831 ms
vllm27036.248 ms51.310 ms1075.404 ms28111.652 ms
ollama (GGUF)1002.560 ms2197.797 ms2556.652 ms3559.212 ms

Parity interpretation:

  • Runtime still wins cold-total first response vs vLLM (6.216x better).
  • vLLM wins request-path TTFT and full latency at equal 48-token budget.

External Cold Comparison (G5, 2026-02-18, Token Parity = 48, Decoder/Sampling Fix)

BackendStartup->HealthyRequest TTFTRequest FullCold Total First Response
runtime2004.759 ms5.022 ms311.289 ms2316.048 ms
pytorch_transformers-515.597 ms2291.854 ms8237.561 ms
vllm24036.762 ms52.995 ms1094.517 ms25131.279 ms
ollama (GGUF)1002.630 ms2184.383 ms2543.219 ms3545.849 ms

Post-fix interpretation:

  • Runtime now wins TTFT and full request latency vs vLLM at token parity.
  • Runtime keeps the large cold-total lead vs vLLM.
  • 3-run repeatability (runtime+vLLM) means: runtime 5.022 ms TTFT, 311.444 ms full, 2316.002 ms cold-total vs vLLM 51.894/1052.767/24752.842 ms.

Historical Legacy Mixed-Mode Context

SetRuntime HTTP request meanRuntime HTTP request p99
T4 (2026-02-15)146279.609 ms156769.1 ms
G5 (2026-02-15)77449.605 ms83346.187 ms
G5 registry-cached single run (2026-02-16)82.913 ms91.877 ms

Parity Health

SetCheckedFailedStrict
T430true
G530true

On this page