Treni

Benchmark Status

What was run already, what still remains, and why.

Direct Answers

  • "Did we run the start benchmark (Phase 1 baseline)?" Yes.
  • "Did we rerun after true TTFT and cold fixes?" Yes (2026-02-17 on G5).
  • "Did we run all benchmarks in the full plan?" Not fully.
  • "Is there anything else to run?" Yes (Phase 3 loops, A100, H100).

What Has Been Run

Phase 1 (Baseline, Python stack)

  • T4 set: baseline JSON exists.
  • G5 set: baseline JSON exists.
  • Includes cold start breakdown, warm model runs, and pipeline runs.

Phase 2 (Minimal runtime benchmark)

  • T4 set: runtime JSON exists.
  • G5 set: runtime JSON exists.
  • Includes cold starts, model run timing, and HTTP request latency.
  • True TTFT rerun exists (runtime timing, not SSE proxy).
  • Cold optimization rerun exists after tensor index-cache fix.
  • Stage-level cold decomposition exists (tokenizer/index/upload/prefill/step0 timings).
  • Fast tensor collect optimization rerun exists (clean4).
  • External cold canonical run exists across four backends (runtime, PyTorch, vLLM, Ollama) on G5 (2026-02-18).
  • External cold optimized run exists with runtime startup preload + tokenizer cache (2026-02-18).
  • External cold token-parity rerun exists after decoder/sampling fixes; runtime now wins request and cold-total vs vLLM (2026-02-18).

Week 3 (Numerical parity)

  • T4 parity: strict mode, 0 failures.
  • G5 parity: strict mode, 0 failures.
  • Donut is intentionally skipped in parity check and marked as skipped.

Phase 3 comparison report

  • T4 comparison report exists.
  • G5 comparison report exists.

Latest Key Findings (2026-02-17)

  • Warm path on G5 remains strong (~80.6 ms mean, ~90.4 ms p99 in latest clean7 sanity run).
  • Internal routing is faster than external routing (1.032x external/internal ratio).
  • Cold TTFT dropped further after stage decomposition + fast tensor collect:
    • qwen: 1.41s -> 1.10s (22.1% lower)
    • donut: 619ms -> 150ms (75.7% lower)
    • bart: 777ms -> 125ms (83.9% lower)
    • minilm: 23.4ms -> 22.6ms (3.4% lower)
  • model_tensor_index_build is no longer dominant (~1-2.3 ms mean across models in clean4).
  • An async pinned-upload experiment regressed Qwen cold TTFT and was reverted; clean4 remains the accepted cold-path reference.
  • Revert validation set (clean7, 2026-02-18 UTC) confirms clean4 numbers are reproducible within noise.

Latest Key Findings (2026-02-18, External Cold Canonical)

  • Runtime cold total first response: 2342.996 ms.
  • PyTorch cold total first response: 8725.259 ms (3.724x runtime).
  • vLLM cold total first response: 25069.018 ms (10.7x runtime).
  • Ollama cold total first response: 3530.106 ms (1.507x runtime).
  • vLLM has the fastest request-path TTFT once healthy (51.763 ms), but startup (24032.203 ms) dominates end-to-end cold in this run.

Latest Key Findings (2026-02-18, External Cold Optimized Runtime)

  • Runtime request full latency: 271.346 ms (vs vLLM 1035.826 ms).
  • Runtime cold total first response: 2276.081 ms (vs vLLM 28072.508 ms).
  • Runtime still trails vLLM in request TTFT (91.596 ms vs 51.725 ms).
  • This run was not token-parity yet (runtime decode steps still 4 while others used 48).

Latest Key Findings (2026-02-18, External Cold Token-Parity = 48, Pre-Fix)

  • Runtime request full latency: 2518.142 ms (vLLM: 1075.404 ms).
  • Runtime request TTFT: 91.207 ms (vLLM: 51.310 ms).
  • Runtime cold total first response: 4522.345 ms (vLLM: 28111.652 ms, 6.216x runtime advantage).
  • Request-path gap remains: runtime per-token decode is now the dominant issue.

Latest Key Findings (2026-02-18, External Cold Token-Parity = 48, Decoder/Sampling Fix)

  • Runtime request TTFT: 5.022 ms (vLLM: 52.995 ms, runtime 10.553x faster).
  • Runtime request full latency: 311.289 ms (vLLM: 1094.517 ms, runtime 3.516x faster).
  • Runtime cold total first response: 2316.048 ms (vLLM: 25131.279 ms, runtime 10.851x better).
  • Startup remained stable (~2004.8 ms) while request-path bottleneck was removed.
  • Confirmation rerun (runtime+vLLM) matched the direction: runtime 5.021/310.376/2314.581 ms vs vLLM 51.655/1033.214/24065.623 ms (TTFT/full/cold-total).
  • 3-run repeatability (runtime+vLLM only) mean ratios: TTFT 10.333x, full 3.380x, cold-total 10.688x in runtime’s favor.

What Is Still Missing Per Plan

If following the full sequence:

  1. Phase 3 agentic loop capability study.
  2. Track B failure-amplification routing tests (timeouts/retries under load).
  3. A100 run set.
  4. H100 run set.
  5. Final paper-grade figure/table package.

Canonical Clarification

  • Full-system canonical set remains g5-20260216-foundation.
  • Cold optimization is tracked as g5-20260217-cold-indexcache (latest cold-specific canonical evidence).
  • Cold decomposition/collect optimization is tracked as phase2-runtime clean4 (latest cold-stage evidence).

Artifact Pointers

On this page