Findings Changelog
Dated summary of major experiment findings and interpretation.
At A Glance
- Warm request path on G5 is stable and fast in the current runtime.
- Internal routing beats external routing on matched benchmark tasks.
- Cold start bottlenecks were decomposed stage-by-stage;
model_tensor_index_buildis no longer a dominant stage. - Remaining cold cost is now concentrated mostly in Qwen
decoder_tensor_upload. - External cold-start comparison (runtime vs PyTorch vs vLLM vs Ollama) now has a canonical G5 artifact with explicit request-path vs total-cold interpretation.
- After decoder loop and sampling fixes, parity-corrected 48-token request path now beats vLLM on TTFT and full latency in latest G5 run.
Timeline
Latest Key Numbers
Warm Path (G5)
- Warm steady-state request mean:
~80.8 ms - Warm steady-state p99:
~89.6 ms
Routing (Internal vs External, G5)
- Internal mean:
94.849 ms - External mean:
97.927 ms - External/Internal:
1.032x(internal faster)
External Cold Comparison (G5, 2026-02-18, Qwen 3B family)
| Backend | Startup->Healthy | Request TTFT | Request Full | Cold Total First Token | Cold Total First Response |
|---|---|---|---|---|---|
| runtime | 1003.537 ms | 1108.979 ms | 1339.459 ms | 2112.516 ms | 2342.996 ms |
| pytorch_transformers | - | 528.483 ms | 2288.516 ms | 6965.227 ms | 8725.259 ms |
| vllm | 24032.203 ms | 51.763 ms | 1036.815 ms | 24083.966 ms | 25069.018 ms |
| ollama (GGUF) | 1002.695 ms | 2168.902 ms | 2527.411 ms | 3171.597 ms | 3530.106 ms |
Runtime-normalized (lower is better for runtime):
- PyTorch cold total first response:
3.724xruntime. - vLLM cold total first response:
10.7xruntime. - Ollama cold total first response:
1.507xruntime.
Interpretation:
- vLLM request-path TTFT is fastest once healthy, but startup dominates total cold in this run.
- Runtime is strongest on end-to-end cold total in this specific setup.
- Ollama is quantized GGUF and kept with caveat tags (not precision-equivalent to BF16 paths).
External Cold Comparison (G5, 2026-02-18, preload + tokenizer cache)
| Backend | Startup->Healthy | Request TTFT | Request Full | Cold Total First Token | Cold Total First Response |
|---|---|---|---|---|---|
| runtime | 2004.735 ms | 91.596 ms | 271.346 ms | 2096.331 ms | 2276.081 ms |
| pytorch_transformers | - | 522.795 ms | 2252.382 ms | 6644.737 ms | 8374.324 ms |
| vllm | 27036.682 ms | 51.725 ms | 1035.826 ms | 27088.407 ms | 28072.508 ms |
| ollama (GGUF) | 1002.508 ms | 2182.542 ms | 2538.609 ms | 3185.050 ms | 3541.117 ms |
Runtime-normalized (lower is better for runtime):
- vLLM request full latency:
3.817xruntime. - vLLM cold total first response:
12.334xruntime. - Remaining gap: vLLM request TTFT is still lower (
51.725 msvs runtime91.596 ms).
Important caveat:
- The run above was before request
max_tokenswas wired through runtime inference (runtime still generated 4 tokens there).
External Cold Comparison (G5, 2026-02-18, token parity fixed at 48, pre decoder fix)
| Backend | Startup->Healthy | Request TTFT | Request Full | Cold Total First Token | Cold Total First Response |
|---|---|---|---|---|---|
| runtime | 2004.203 ms | 91.207 ms | 2518.142 ms | 2095.410 ms | 4522.345 ms |
| pytorch_transformers | - | 501.946 ms | 2244.327 ms | 9530.450 ms | 11272.831 ms |
| vllm | 27036.248 ms | 51.310 ms | 1075.404 ms | 27087.558 ms | 28111.652 ms |
| ollama (GGUF) | 1002.560 ms | 2197.797 ms | 2556.652 ms | 3200.357 ms | 3559.212 ms |
Interpretation:
- Runtime still wins cold-total first response vs vLLM (
6.216xbetter). - Runtime request-path TTFT and full latency are still slower than vLLM at equal 48-token budget.
- Residual bottleneck is decoder per-token step cost (not tensor upload anymore in this mode).
External Cold Comparison (G5, 2026-02-18, token parity + decoder/sampling fix)
| Backend | Startup->Healthy | Request TTFT | Request Full | Cold Total First Token | Cold Total First Response |
|---|---|---|---|---|---|
| runtime | 2004.759 ms | 5.022 ms | 311.289 ms | 2009.781 ms | 2316.048 ms |
| pytorch_transformers | - | 515.597 ms | 2291.854 ms | 6461.304 ms | 8237.561 ms |
| vllm | 24036.762 ms | 52.995 ms | 1094.517 ms | 24089.757 ms | 25131.279 ms |
| ollama (GGUF) | 1002.630 ms | 2184.383 ms | 2543.219 ms | 3187.013 ms | 3545.849 ms |
Interpretation:
- Runtime now leads vLLM in request-path TTFT (
10.553xfaster) and full latency (3.516xfaster) on this G5 token-parity run. - Runtime also remains much lower on cold-total first response (
10.851xbetter vs vLLM). - Main measured bottleneck in prior parity run (sampling + per-step host sync) is no longer dominant.
- 3-run repeatability (runtime+vLLM reruns) keeps the same direction: mean speedups
10.333xTTFT,3.380xfull latency,10.688xcold-total first response.
Cold TTFT Before vs After Index Cache (3-run means, G5)
| Model | Before | After | Speedup |
|---|---|---|---|
| qwen | 27574.564 ms | 1774.951 ms | 15.535x |
| donut | 67360.388 ms | 572.485 ms | 117.663x |
| bart | 77520.798 ms | 743.652 ms | 104.243x |
| minilm | 23.342 ms | 22.698 ms | 1.028x |
Cold TTFT: clean3 vs clean4 (3-run means, G5)
| Model | clean3 | clean4 | Improvement |
|---|---|---|---|
| qwen | 1411.831 ms | 1100.044 ms | 22.1% lower |
| donut | 619.499 ms | 150.322 ms | 75.7% lower |
| bart | 776.545 ms | 125.011 ms | 83.9% lower |
| minilm | 23.421 ms | 22.621 ms | 3.4% lower |
Dominant Cold Stages After clean4
model_tensor_index_builddropped to~1-2.3 msacross models (down ~99.6% vs clean3 for Bart/Donut).- Qwen still dominated by
decoder_tensor_upload(~1015 msmean). - Donut and Bart are now mostly in decoder setup/upload and no longer index-build bound.
Reverted Experiment (Transparency)
- Tried an async pinned conversion-buffer upload strategy after clean4.
- Result: Qwen
decoder_tensor_uploadregressed to~1419 msand TTFT regressed by~37%. - Decision: reverted that path; clean4 remains the accepted cold-path baseline.
- Follow-up validation run set (
clean7) matched clean4 within run noise (Qwen TTFT delta-0.16%).
What Was Actually Tested
- Baseline (Python/dependency path) runs on T4 and G5.
- Runtime cold and warm request-path benchmarks.
- True runtime-reported TTFT (not SSE first-event proxy).
- Internal-vs-external routing comparison on matched tasks.
- Week 3 numerical parity checks (strict mode; donut intentionally skipped in parity harness).
What Is Not Finished Yet
- Phase 3 agentic loop capability study (retrieval correction, tool-state adaptation, confidence-gated branching).
- A100/H100 reruns from the original expansion phase.
- Paper-grade figures package.
Raw Artifact Links
- Cold index-cache summary JSON
- Cold index-cache summary Markdown
- Routing comparison JSON
- True TTFT summary JSON
- Cold decomposition clean4 run 1 JSON
- Cold decomposition clean4 run 2 JSON
- Cold decomposition clean4 run 3 JSON
- Cold decomposition clean4 summary JSON
- Cold decomposition clean4 summary Markdown
- Cold decomposition clean7 run 1 JSON
- Cold decomposition clean7 run 2 JSON
- Cold decomposition clean7 run 3 JSON
- Cold decomposition clean7 summary JSON
- Cold decomposition clean7 summary Markdown
- Warm sanity clean7 JSON
- External cold canonical JSON
- External cold canonical Markdown
- External cold preload/tokenizer-cache JSON
- External cold preload/tokenizer-cache Markdown
- External cold token-parity JSON
- External cold token-parity Markdown
- External cold token-parity decoder-fix JSON
- External cold token-parity decoder-fix Markdown