Benchmark Status

Direct Answers

T4 set: runtime JSON exists.
G5 set: runtime JSON exists.
Includes cold starts, model run timing, and HTTP request latency.
True TTFT rerun exists (runtime timing, not SSE proxy).
Cold optimization rerun exists after tensor index-cache fix.
Stage-level cold decomposition exists (tokenizer/index/upload/prefill/step0 timings).
Fast tensor collect optimization rerun exists (clean4).
External cold canonical run exists across four backends (runtime, PyTorch, vLLM, Ollama) on G5 (2026-02-18).
External cold optimized run exists with runtime startup preload + tokenizer cache (2026-02-18).
External cold token-parity rerun exists after decoder/sampling fixes; runtime now wins request and cold-total vs vLLM (2026-02-18).

Warm path on G5 remains strong (~80.6 ms mean, ~90.4 ms p99 in latest clean7 sanity run).
Internal routing is faster than external routing (1.032x external/internal ratio).
Cold TTFT dropped further after stage decomposition + fast tensor collect:
- qwen: 1.41s -> 1.10s (22.1% lower)
- donut: 619ms -> 150ms (75.7% lower)
- bart: 777ms -> 125ms (83.9% lower)
- minilm: 23.4ms -> 22.6ms (3.4% lower)
model_tensor_index_build is no longer dominant (~1-2.3 ms mean across models in clean4).
An async pinned-upload experiment regressed Qwen cold TTFT and was reverted; clean4 remains the accepted cold-path reference.
Revert validation set (clean7, 2026-02-18 UTC) confirms clean4 numbers are reproducible within noise.

Runtime cold total first response: 2342.996 ms.
PyTorch cold total first response: 8725.259 ms (3.724x runtime).
vLLM cold total first response: 25069.018 ms (10.7x runtime).
Ollama cold total first response: 3530.106 ms (1.507x runtime).
vLLM has the fastest request-path TTFT once healthy (51.763 ms), but startup (24032.203 ms) dominates end-to-end cold in this run.

Runtime request full latency: 271.346 ms (vs vLLM 1035.826 ms).
Runtime cold total first response: 2276.081 ms (vs vLLM 28072.508 ms).
Runtime still trails vLLM in request TTFT (91.596 ms vs 51.725 ms).
This run was not token-parity yet (runtime decode steps still 4 while others used 48).

Runtime request full latency: 2518.142 ms (vLLM: 1075.404 ms).
Runtime request TTFT: 91.207 ms (vLLM: 51.310 ms).
Runtime cold total first response: 4522.345 ms (vLLM: 28111.652 ms, 6.216x runtime advantage).
Request-path gap remains: runtime per-token decode is now the dominant issue.

Runtime request TTFT: 5.022 ms (vLLM: 52.995 ms, runtime 10.553x faster).
Runtime request full latency: 311.289 ms (vLLM: 1094.517 ms, runtime 3.516x faster).
Runtime cold total first response: 2316.048 ms (vLLM: 25131.279 ms, runtime 10.851x better).
Startup remained stable (~2004.8 ms) while request-path bottleneck was removed.
Confirmation rerun (runtime+vLLM) matched the direction: runtime 5.021/310.376/2314.581 ms vs vLLM 51.655/1033.214/24065.623 ms (TTFT/full/cold-total).
3-run repeatability (runtime+vLLM only) mean ratios: TTFT 10.333x, full 3.380x, cold-total 10.688x in runtime’s favor.

If following the full sequence:

Full-system canonical set remains g5-20260216-foundation.
Cold optimization is tracked as g5-20260217-cold-indexcache (latest cold-specific canonical evidence).
Cold decomposition/collect optimization is tracked as phase2-runtime clean4 (latest cold-stage evidence).