Findings Changelog

At A Glance

Warm request path on G5 is stable and fast in the current runtime.
Internal routing beats external routing on matched benchmark tasks.
Cold start bottlenecks were decomposed stage-by-stage; model_tensor_index_build is no longer a dominant stage.
Remaining cold cost is now concentrated mostly in Qwen decoder_tensor_upload.
External cold-start comparison (runtime vs PyTorch vs vLLM vs Ollama) now has a canonical G5 artifact with explicit request-path vs total-cold interpretation.
After decoder loop and sampling fixes, parity-corrected 48-token request path now beats vLLM on TTFT and full latency in latest G5 run.

Timeline

Latest Key Numbers

Warm Path (G5)

Warm steady-state request mean: ~80.8 ms
Warm steady-state p99: ~89.6 ms

Routing (Internal vs External, G5)

Internal mean: 94.849 ms
External mean: 97.927 ms
External/Internal: 1.032x (internal faster)

External Cold Comparison (G5, 2026-02-18, Qwen 3B family)

Backend	Startup->Healthy	Request TTFT	Request Full	Cold Total First Token	Cold Total First Response
runtime	1003.537 ms	1108.979 ms	1339.459 ms	2112.516 ms	2342.996 ms
pytorch_transformers	-	528.483 ms	2288.516 ms	6965.227 ms	8725.259 ms
vllm	24032.203 ms	51.763 ms	1036.815 ms	24083.966 ms	25069.018 ms
ollama (GGUF)	1002.695 ms	2168.902 ms	2527.411 ms	3171.597 ms	3530.106 ms

Runtime-normalized (lower is better for runtime):

PyTorch cold total first response: 3.724x runtime.
vLLM cold total first response: 10.7x runtime.
Ollama cold total first response: 1.507x runtime.

Interpretation:

vLLM request-path TTFT is fastest once healthy, but startup dominates total cold in this run.
Runtime is strongest on end-to-end cold total in this specific setup.
Ollama is quantized GGUF and kept with caveat tags (not precision-equivalent to BF16 paths).

External Cold Comparison (G5, 2026-02-18, preload + tokenizer cache)

Backend	Startup->Healthy	Request TTFT	Request Full	Cold Total First Token	Cold Total First Response
runtime	2004.735 ms	91.596 ms	271.346 ms	2096.331 ms	2276.081 ms
pytorch_transformers	-	522.795 ms	2252.382 ms	6644.737 ms	8374.324 ms
vllm	27036.682 ms	51.725 ms	1035.826 ms	27088.407 ms	28072.508 ms
ollama (GGUF)	1002.508 ms	2182.542 ms	2538.609 ms	3185.050 ms	3541.117 ms

Runtime-normalized (lower is better for runtime):

vLLM request full latency: 3.817x runtime.
vLLM cold total first response: 12.334x runtime.
Remaining gap: vLLM request TTFT is still lower (51.725 ms vs runtime 91.596 ms).

Important caveat:

The run above was before request max_tokens was wired through runtime inference (runtime still generated 4 tokens there).

External Cold Comparison (G5, 2026-02-18, token parity fixed at 48, pre decoder fix)

Backend	Startup->Healthy	Request TTFT	Request Full	Cold Total First Token	Cold Total First Response
runtime	2004.203 ms	91.207 ms	2518.142 ms	2095.410 ms	4522.345 ms
pytorch_transformers	-	501.946 ms	2244.327 ms	9530.450 ms	11272.831 ms
vllm	27036.248 ms	51.310 ms	1075.404 ms	27087.558 ms	28111.652 ms
ollama (GGUF)	1002.560 ms	2197.797 ms	2556.652 ms	3200.357 ms	3559.212 ms

Interpretation:

Runtime still wins cold-total first response vs vLLM (6.216x better).
Runtime request-path TTFT and full latency are still slower than vLLM at equal 48-token budget.
Residual bottleneck is decoder per-token step cost (not tensor upload anymore in this mode).

External Cold Comparison (G5, 2026-02-18, token parity + decoder/sampling fix)

Backend	Startup->Healthy	Request TTFT	Request Full	Cold Total First Token	Cold Total First Response
runtime	2004.759 ms	5.022 ms	311.289 ms	2009.781 ms	2316.048 ms
pytorch_transformers	-	515.597 ms	2291.854 ms	6461.304 ms	8237.561 ms
vllm	24036.762 ms	52.995 ms	1094.517 ms	24089.757 ms	25131.279 ms
ollama (GGUF)	1002.630 ms	2184.383 ms	2543.219 ms	3187.013 ms	3545.849 ms

Interpretation:

Runtime now leads vLLM in request-path TTFT (10.553x faster) and full latency (3.516x faster) on this G5 token-parity run.
Runtime also remains much lower on cold-total first response (10.851x better vs vLLM).
Main measured bottleneck in prior parity run (sampling + per-step host sync) is no longer dominant.
3-run repeatability (runtime+vLLM reruns) keeps the same direction: mean speedups 10.333x TTFT, 3.380x full latency, 10.688x cold-total first response.

Cold TTFT Before vs After Index Cache (3-run means, G5)

Model	Before	After	Speedup
qwen	27574.564 ms	1774.951 ms	15.535x
donut	67360.388 ms	572.485 ms	117.663x
bart	77520.798 ms	743.652 ms	104.243x
minilm	23.342 ms	22.698 ms	1.028x

Cold TTFT: clean3 vs clean4 (3-run means, G5)

Model	clean3	clean4	Improvement
qwen	1411.831 ms	1100.044 ms	22.1% lower
donut	619.499 ms	150.322 ms	75.7% lower
bart	776.545 ms	125.011 ms	83.9% lower
minilm	23.421 ms	22.621 ms	3.4% lower

Dominant Cold Stages After clean4

model_tensor_index_build dropped to ~1-2.3 ms across models (down ~99.6% vs clean3 for Bart/Donut).
Qwen still dominated by decoder_tensor_upload (~1015 ms mean).
Donut and Bart are now mostly in decoder setup/upload and no longer index-build bound.

Reverted Experiment (Transparency)

Tried an async pinned conversion-buffer upload strategy after clean4.
Result: Qwen decoder_tensor_upload regressed to ~1419 ms and TTFT regressed by ~37%.
Decision: reverted that path; clean4 remains the accepted cold-path baseline.
Follow-up validation run set (clean7) matched clean4 within run noise (Qwen TTFT delta -0.16%).

What Was Actually Tested

Baseline (Python/dependency path) runs on T4 and G5.
Runtime cold and warm request-path benchmarks.
True runtime-reported TTFT (not SSE first-event proxy).
Internal-vs-external routing comparison on matched tasks.
Week 3 numerical parity checks (strict mode; donut intentionally skipped in parity harness).

What Is Not Finished Yet

Phase 3 agentic loop capability study (retrieval correction, tool-state adaptation, confidence-gated branching).
A100/H100 reruns from the original expansion phase.
Paper-grade figures package.

Findings Changelog

At A Glance

Timeline

Latest Key Numbers

Warm Path (G5)

Routing (Internal vs External, G5)

External Cold Comparison (G5, 2026-02-18, Qwen 3B family)

External Cold Comparison (G5, 2026-02-18, preload + tokenizer cache)

External Cold Comparison (G5, 2026-02-18, token parity fixed at 48, pre decoder fix)

External Cold Comparison (G5, 2026-02-18, token parity + decoder/sampling fix)

Cold TTFT Before vs After Index Cache (3-run means, G5)

Cold TTFT: clean3 vs clean4 (3-run means, G5)

Dominant Cold Stages After clean4

Reverted Experiment (Transparency)

What Was Actually Tested

What Is Not Finished Yet

Raw Artifact Links

On this page