Benchmark Status
What was run already, what still remains, and why.
Direct Answers
- "Did we run the start benchmark (Phase 1 baseline)?" Yes.
- "Did we rerun after true TTFT and cold fixes?" Yes (2026-02-17 on G5).
- "Did we run all benchmarks in the full plan?" Not fully.
- "Is there anything else to run?" Yes (Phase 3 loops, A100, H100).
What Has Been Run
Phase 1 (Baseline, Python stack)
- T4 set: baseline JSON exists.
- G5 set: baseline JSON exists.
- Includes cold start breakdown, warm model runs, and pipeline runs.
Phase 2 (Minimal runtime benchmark)
- T4 set: runtime JSON exists.
- G5 set: runtime JSON exists.
- Includes cold starts, model run timing, and HTTP request latency.
- True TTFT rerun exists (runtime timing, not SSE proxy).
- Cold optimization rerun exists after tensor index-cache fix.
- Stage-level cold decomposition exists (tokenizer/index/upload/prefill/step0 timings).
- Fast tensor collect optimization rerun exists (
clean4). - External cold canonical run exists across four backends (runtime, PyTorch, vLLM, Ollama) on G5 (
2026-02-18). - External cold optimized run exists with runtime startup preload + tokenizer cache (
2026-02-18). - External cold token-parity rerun exists after decoder/sampling fixes; runtime now wins request and cold-total vs vLLM (
2026-02-18).
Week 3 (Numerical parity)
- T4 parity: strict mode, 0 failures.
- G5 parity: strict mode, 0 failures.
- Donut is intentionally skipped in parity check and marked as skipped.
Phase 3 comparison report
- T4 comparison report exists.
- G5 comparison report exists.
Latest Key Findings (2026-02-17)
- Warm path on G5 remains strong (
~80.6 msmean,~90.4 msp99 in latest clean7 sanity run). - Internal routing is faster than external routing (
1.032xexternal/internal ratio). - Cold TTFT dropped further after stage decomposition + fast tensor collect:
- qwen:
1.41s -> 1.10s(22.1%lower) - donut:
619ms -> 150ms(75.7%lower) - bart:
777ms -> 125ms(83.9%lower) - minilm:
23.4ms -> 22.6ms(3.4%lower)
- qwen:
model_tensor_index_buildis no longer dominant (~1-2.3 msmean across models in clean4).- An async pinned-upload experiment regressed Qwen cold TTFT and was reverted; clean4 remains the accepted cold-path reference.
- Revert validation set (
clean7, 2026-02-18 UTC) confirms clean4 numbers are reproducible within noise.
Latest Key Findings (2026-02-18, External Cold Canonical)
- Runtime cold total first response:
2342.996 ms. - PyTorch cold total first response:
8725.259 ms(3.724xruntime). - vLLM cold total first response:
25069.018 ms(10.7xruntime). - Ollama cold total first response:
3530.106 ms(1.507xruntime). - vLLM has the fastest request-path TTFT once healthy (
51.763 ms), but startup (24032.203 ms) dominates end-to-end cold in this run.
Latest Key Findings (2026-02-18, External Cold Optimized Runtime)
- Runtime request full latency:
271.346 ms(vs vLLM1035.826 ms). - Runtime cold total first response:
2276.081 ms(vs vLLM28072.508 ms). - Runtime still trails vLLM in request TTFT (
91.596 msvs51.725 ms). - This run was not token-parity yet (runtime decode steps still 4 while others used 48).
Latest Key Findings (2026-02-18, External Cold Token-Parity = 48, Pre-Fix)
- Runtime request full latency:
2518.142 ms(vLLM:1075.404 ms). - Runtime request TTFT:
91.207 ms(vLLM:51.310 ms). - Runtime cold total first response:
4522.345 ms(vLLM:28111.652 ms,6.216xruntime advantage). - Request-path gap remains: runtime per-token decode is now the dominant issue.
Latest Key Findings (2026-02-18, External Cold Token-Parity = 48, Decoder/Sampling Fix)
- Runtime request TTFT:
5.022 ms(vLLM:52.995 ms, runtime10.553xfaster). - Runtime request full latency:
311.289 ms(vLLM:1094.517 ms, runtime3.516xfaster). - Runtime cold total first response:
2316.048 ms(vLLM:25131.279 ms, runtime10.851xbetter). - Startup remained stable (
~2004.8 ms) while request-path bottleneck was removed. - Confirmation rerun (runtime+vLLM) matched the direction: runtime
5.021/310.376/2314.581 msvs vLLM51.655/1033.214/24065.623 ms(TTFT/full/cold-total). - 3-run repeatability (runtime+vLLM only) mean ratios: TTFT
10.333x, full3.380x, cold-total10.688xin runtime’s favor.
What Is Still Missing Per Plan
If following the full sequence:
- Phase 3 agentic loop capability study.
- Track B failure-amplification routing tests (timeouts/retries under load).
- A100 run set.
- H100 run set.
- Final paper-grade figure/table package.
Canonical Clarification
- Full-system canonical set remains g5-20260216-foundation.
- Cold optimization is tracked as g5-20260217-cold-indexcache (latest cold-specific canonical evidence).
- Cold decomposition/collect optimization is tracked as phase2-runtime clean4 (latest cold-stage evidence).
Artifact Pointers
- True TTFT set:
/benchmarks/g5-20260217-truettft/ - Cold index-cache set:
/benchmarks/g5-20260217-cold-indexcache/ - Routing comparison set:
/benchmarks/g5-20260217-routing/ - Cold decomposition clean4 set:
/benchmarks/phase2_runtime/results/ - External cold canonical set:
/benchmarks/phase2_external_cold/results/