Treni

Findings Changelog

Dated summary of major experiment findings and interpretation.

At A Glance

  • Warm request path on G5 is stable and fast in the current runtime.
  • Internal routing beats external routing on matched benchmark tasks.
  • Cold start bottlenecks were decomposed stage-by-stage; model_tensor_index_build is no longer a dominant stage.
  • Remaining cold cost is now concentrated mostly in Qwen decoder_tensor_upload.
  • External cold-start comparison (runtime vs PyTorch vs vLLM vs Ollama) now has a canonical G5 artifact with explicit request-path vs total-cold interpretation.
  • After decoder loop and sampling fixes, parity-corrected 48-token request path now beats vLLM on TTFT and full latency in latest G5 run.

Timeline

Latest Key Numbers

Warm Path (G5)

  • Warm steady-state request mean: ~80.8 ms
  • Warm steady-state p99: ~89.6 ms

Routing (Internal vs External, G5)

  • Internal mean: 94.849 ms
  • External mean: 97.927 ms
  • External/Internal: 1.032x (internal faster)

External Cold Comparison (G5, 2026-02-18, Qwen 3B family)

BackendStartup->HealthyRequest TTFTRequest FullCold Total First TokenCold Total First Response
runtime1003.537 ms1108.979 ms1339.459 ms2112.516 ms2342.996 ms
pytorch_transformers-528.483 ms2288.516 ms6965.227 ms8725.259 ms
vllm24032.203 ms51.763 ms1036.815 ms24083.966 ms25069.018 ms
ollama (GGUF)1002.695 ms2168.902 ms2527.411 ms3171.597 ms3530.106 ms

Runtime-normalized (lower is better for runtime):

  • PyTorch cold total first response: 3.724x runtime.
  • vLLM cold total first response: 10.7x runtime.
  • Ollama cold total first response: 1.507x runtime.

Interpretation:

  • vLLM request-path TTFT is fastest once healthy, but startup dominates total cold in this run.
  • Runtime is strongest on end-to-end cold total in this specific setup.
  • Ollama is quantized GGUF and kept with caveat tags (not precision-equivalent to BF16 paths).

External Cold Comparison (G5, 2026-02-18, preload + tokenizer cache)

BackendStartup->HealthyRequest TTFTRequest FullCold Total First TokenCold Total First Response
runtime2004.735 ms91.596 ms271.346 ms2096.331 ms2276.081 ms
pytorch_transformers-522.795 ms2252.382 ms6644.737 ms8374.324 ms
vllm27036.682 ms51.725 ms1035.826 ms27088.407 ms28072.508 ms
ollama (GGUF)1002.508 ms2182.542 ms2538.609 ms3185.050 ms3541.117 ms

Runtime-normalized (lower is better for runtime):

  • vLLM request full latency: 3.817x runtime.
  • vLLM cold total first response: 12.334x runtime.
  • Remaining gap: vLLM request TTFT is still lower (51.725 ms vs runtime 91.596 ms).

Important caveat:

  • The run above was before request max_tokens was wired through runtime inference (runtime still generated 4 tokens there).

External Cold Comparison (G5, 2026-02-18, token parity fixed at 48, pre decoder fix)

BackendStartup->HealthyRequest TTFTRequest FullCold Total First TokenCold Total First Response
runtime2004.203 ms91.207 ms2518.142 ms2095.410 ms4522.345 ms
pytorch_transformers-501.946 ms2244.327 ms9530.450 ms11272.831 ms
vllm27036.248 ms51.310 ms1075.404 ms27087.558 ms28111.652 ms
ollama (GGUF)1002.560 ms2197.797 ms2556.652 ms3200.357 ms3559.212 ms

Interpretation:

  • Runtime still wins cold-total first response vs vLLM (6.216x better).
  • Runtime request-path TTFT and full latency are still slower than vLLM at equal 48-token budget.
  • Residual bottleneck is decoder per-token step cost (not tensor upload anymore in this mode).

External Cold Comparison (G5, 2026-02-18, token parity + decoder/sampling fix)

BackendStartup->HealthyRequest TTFTRequest FullCold Total First TokenCold Total First Response
runtime2004.759 ms5.022 ms311.289 ms2009.781 ms2316.048 ms
pytorch_transformers-515.597 ms2291.854 ms6461.304 ms8237.561 ms
vllm24036.762 ms52.995 ms1094.517 ms24089.757 ms25131.279 ms
ollama (GGUF)1002.630 ms2184.383 ms2543.219 ms3187.013 ms3545.849 ms

Interpretation:

  • Runtime now leads vLLM in request-path TTFT (10.553x faster) and full latency (3.516x faster) on this G5 token-parity run.
  • Runtime also remains much lower on cold-total first response (10.851x better vs vLLM).
  • Main measured bottleneck in prior parity run (sampling + per-step host sync) is no longer dominant.
  • 3-run repeatability (runtime+vLLM reruns) keeps the same direction: mean speedups 10.333x TTFT, 3.380x full latency, 10.688x cold-total first response.

Cold TTFT Before vs After Index Cache (3-run means, G5)

ModelBeforeAfterSpeedup
qwen27574.564 ms1774.951 ms15.535x
donut67360.388 ms572.485 ms117.663x
bart77520.798 ms743.652 ms104.243x
minilm23.342 ms22.698 ms1.028x

Cold TTFT: clean3 vs clean4 (3-run means, G5)

Modelclean3clean4Improvement
qwen1411.831 ms1100.044 ms22.1% lower
donut619.499 ms150.322 ms75.7% lower
bart776.545 ms125.011 ms83.9% lower
minilm23.421 ms22.621 ms3.4% lower

Dominant Cold Stages After clean4

  • model_tensor_index_build dropped to ~1-2.3 ms across models (down ~99.6% vs clean3 for Bart/Donut).
  • Qwen still dominated by decoder_tensor_upload (~1015 ms mean).
  • Donut and Bart are now mostly in decoder setup/upload and no longer index-build bound.

Reverted Experiment (Transparency)

  • Tried an async pinned conversion-buffer upload strategy after clean4.
  • Result: Qwen decoder_tensor_upload regressed to ~1419 ms and TTFT regressed by ~37%.
  • Decision: reverted that path; clean4 remains the accepted cold-path baseline.
  • Follow-up validation run set (clean7) matched clean4 within run noise (Qwen TTFT delta -0.16%).

What Was Actually Tested

  1. Baseline (Python/dependency path) runs on T4 and G5.
  2. Runtime cold and warm request-path benchmarks.
  3. True runtime-reported TTFT (not SSE first-event proxy).
  4. Internal-vs-external routing comparison on matched tasks.
  5. Week 3 numerical parity checks (strict mode; donut intentionally skipped in parity harness).

What Is Not Finished Yet

  1. Phase 3 agentic loop capability study (retrieval correction, tool-state adaptation, confidence-gated branching).
  2. A100/H100 reruns from the original expansion phase.
  3. Paper-grade figures package.

On this page