Findings Changelog
Dated summary of major experiment findings and interpretation.
At A Glance
- Public GPU Agent console is now split cleanly between canonical and scratch surfaces:
- public console now exposes:
- a direct runtime test path for raw generation speed/logprobs/uncertainty
- a separate agent test path for SQLite/RAG/memory/tool verification
- docs link:
https://treni-docs.pages.dev - deck link:
https://monostate.com/pitch
- docs navigation is now reorganized around:
- canonical lanes
- detailed logs
- scratch experiments
- interpretation:
- the main experiment story is easier to read without mixing claim-safe lanes with random debugging work,
- while the scratch bucket still preserves the noisy exploratory trail when needed.
- public console now exposes:
- Native Hermes
4Bsame-VM conversation lane is now green for the split real-world persistence workflow:- artifact:
benchmarks/same_vm_mvp/results/hemkesh-v22_20260311T020710Z.json
- result:
- local discovery works
- exact facts are written to SQLite and queried back
- broader context is ingested into RAG and retrieval-checked
- a memory note is saved after persistence
- final recall correctly points exact facts to SQLite and broader context to RAG
- interpretation:
- the earlier failures were a mix of duplicate tool-call IDs, over-long replayed tool traces, and opaque worker errors,
- those are now fixed enough that the native Hermes
4Blane can complete a real multi-turn investor-style knowledge-building workflow on AWS, - the remaining open weakness is still the single-turn combined persistence prompt, not the split multi-turn workflow.
- artifact:
- Warm request path on G5 is stable and fast in the current runtime.
- Larger-N sampled strict confirmation (
2026-03-08) now strengthens the post-fix Qwen3.5 non-thinking claim:- artifact:
benchmarks/phase5_awareness_realbench/results/phase5_qwen35_remote_strict_matrix_20260308T235013Z.json
- result (
16samples/task,gpqa_diamond+ifeval,3seeds):- overall: runtime
0.371528vs vLLM0.296875, runtime1255.344 msvs vLLM1585.043 ms gpqa_diamond: runtime0.3750vs vLLM0.3125, runtime slower (801.900 msvs433.256 ms)ifeval: runtime0.368056vs vLLM0.281250, runtime faster (1708.789 msvs2736.831 ms)
- overall: runtime
- interpretation:
- the sampled strict runtime-vs-vLLM win survives beyond the
8-sample pilot, - the overall score and latency deltas stay positive with tighter confidence intervals.
- the sampled strict runtime-vs-vLLM win survives beyond the
- artifact:
- Finalized thinking strict lane (
2026-03-08) is now measurable rather than all-zero:- artifact:
benchmarks/phase5_awareness_realbench/results/phase5_qwen35_remote_strict_matrix_20260309T010353Z.json
- result (
8samples/task,gpqa_diamond+ifeval,3seeds):- overall: runtime
0.250000vs vLLM0.194444, runtime6823.816 msvs vLLM7503.000 ms gpqa_diamond: runtime0.166667vs vLLM0.166667, runtime7727.880 msvs7741.028 msifeval: runtime0.333333vs vLLM0.222222, runtime5919.753 msvs7264.973 ms
- overall: runtime
- interpretation:
- the old runtime
512cap and long-decode corruption were real and are now fixed, - the close-form finalize pass turns length-exhausted thinking traces into parseable answers on both backends,
- reducing the GPQA first-pass reasoning budget to
256preserves the new score lead while collapsing the old GPQA latency penalty, - the resulting finalized thinking lane now beats vLLM overall on both score and latency.
- the old runtime
- artifact:
- GSM8K-only finalized thinking follow-up (
2026-03-09) extends the same lane to another closed-form task family:- artifact:
benchmarks/phase5_awareness_realbench/results/phase5_qwen35_remote_strict_matrix_20260310T022347Z.json
- result (
32samples/task,3seeds):- runtime
0.197917vs vLLM0.177083, runtime7174.829 msvs vLLM7643.231 ms
- runtime
- interpretation:
- the same finalized thinking setup remains directionally runtime-positive on GSM8K,
- but the score interval is still too wide for a strong claim, so this should be treated as exploratory support rather than canonical proof.
- artifact:
- AIME25 isolated finalized thinking pilot (
2026-03-09) is a negative result:- artifact:
benchmarks/phase5_awareness_realbench/results/phase5_qwen35_remote_strict_matrix_20260310T021732Z.json
- result (
8samples,1seed,512tokens, patched AIME prompts):- runtime
0.0vs vLLM0.0, runtime19776.254 msvs vLLM16092.718 ms
- runtime
- interpretation:
- increasing the reasoning budget and adding AIME-specific prompt/finalize guidance still does not recover AIME25,
- this should be treated as an explicit limitation of the current thinking harness and/or the model size.
- artifact:
- AIME25 second-thinking recovery attempt (
2026-03-09) was also negative:- artifact:
benchmarks/phase5_awareness_realbench/results/phase5_qwen35_remote_strict_matrix_20260309T021331Z.json
- result:
- runtime
0.0vs vLLM0.0, runtime21409.322 msvs vLLM22110.402 ms
- runtime
- interpretation:
- a second short thinking finalize pass increases cost and still does not recover AIME,
- so this branch remains non-canonical.
- artifact:
- Late AWS sampled-lane fix (
2026-03-08) resolved the last Qwen3.5 reproducibility blocker:- root cause was in
scripts/phase5_awareness_realbench.py, not the runtime:- the shared first-pass
arm_a_controlrequest skipped the request seed and task-specific decode payload
- the shared first-pass
- post-fix sampled runtime-only reproducibility probes:
benchmarks/phase5_awareness_realbench/results/phase5_repro_runtime_ifeval_s7_raw_seedfix_r1.jsonbenchmarks/phase5_awareness_realbench/results/phase5_repro_runtime_ifeval_s7_raw_seedfix_r2.json- result: repeated sampled IFEval seed-7 runs are identical (
score_mean=0.3125both,8/8outputs identical)
- post-fix sampled strict one-host matrix artifact:
benchmarks/phase5_awareness_realbench/results/phase5_qwen35_remote_strict_matrix_20260308T220806Z.json
- repeatability confirmation artifact:
benchmarks/phase5_awareness_realbench/results/phase5_qwen35_remote_strict_matrix_20260308T221823Z.json
- result:
- overall: runtime
0.409722vs vLLM0.302083, runtime1617.187 msvs vLLM2017.206 ms gpqa_diamond: runtime0.3750vs vLLM0.2500, runtime slower (710.693 msvs435.823 ms)ifeval: runtime0.4444vs vLLM0.3542, runtime faster (2523.680 msvs3598.588 ms)- repeatability check stayed aligned:
- overall: runtime
0.409722vs vLLM0.281250, runtime1607.757 msvs vLLM2008.759 ms
- overall: runtime
- overall: runtime
- interpretation:
- sampled-lane drift was a harness bug rather than runtime instability,
- there is now a clean sampled strict AB3 lane where runtime wins overall on both score and latency,
- and that result holds on an immediate second full-matrix rerun.
- root cause was in
- First explicit thinking-mode strict parity lane is now measured (
2026-03-08), but it is not yet promotable:- initial thinking matrix:
benchmarks/phase5_awareness_realbench/results/phase5_qwen35_remote_strict_matrix_20260308T223442Z.json- overall: runtime
0.166667vs vLLM0.111111, runtime3589.124 msvs vLLM4635.395 ms
- budget-fixed thinking matrix:
benchmarks/phase5_awareness_realbench/results/phase5_qwen35_remote_strict_matrix_20260308T224358Z.json- overall: runtime
0.166667vs vLLM0.111111, runtime8678.709 msvs vLLM9041.981 ms gpqa_diamond: runtime0.0vs vLLM0.0
- one-example long-budget probes:
benchmarks/phase5_awareness_realbench/results/qwen35_thinking_gpqa_oneoff_runtime_1024_20260308T230352Z.jsonbenchmarks/phase5_awareness_realbench/results/qwen35_thinking_gpqa_oneoff_vllm_1024_20260308T230352Z.json
- interpretation:
- under the raw thinking template, both backends can stay trapped in reasoning without emitting a usable final GPQA answer,
- that exploratory lane should now be read as the pre-fix baseline for the finalized result above, not as the current canonical thinking state.
- initial thinking matrix:
- Late AWS deterministic rerun (
2026-03-08) is now the cleanest claim-safe Qwen3.5 one-host lane:- runtime-side reproducibility fix landed in
monolith/server/http.c: request-scoped decode env overrides are now serialized instead of racing through process-global env state - direct runtime reproducibility probe:
benchmarks/phase5_awareness_realbench/results/phase5_repro_runtime_ifeval_s7_t0_r1.jsonbenchmarks/phase5_awareness_realbench/results/phase5_repro_runtime_ifeval_s7_t0_r2.json- result: identical outputs and identical score (
0.5625) on repeatedtemperature=0IFEval seed-7 runs
- deterministic strict one-host matrix artifact:
benchmarks/phase5_awareness_realbench/results/phase5_qwen35_remote_strict_matrix_20260308T204248Z.json
- result:
- overall: runtime
0.295139vs vLLM0.267361, runtime824.714 msvs vLLM1572.529 ms gpqa_diamond: score parity (0.166667vs0.166667), runtime slower (671.640 msvs436.583 ms)ifeval: runtime leads score (0.423611vs0.368055) and latency (977.787 msvs2708.475 ms)
- overall: runtime
- interpretation:
- there is now a reproducible deterministic strict lane where runtime wins overall on both score and latency.
- runtime-side reproducibility fix landed in
- Historical note: the earlier sampled-lane reproducibility failure (
2026-03-08) is now explained and non-canonical:- repeated runtime-only IFEval seed-7 sampled runs:
benchmarks/phase5_awareness_realbench/results/phase5_repro_runtime_ifeval_s7_r1.jsonbenchmarks/phase5_awareness_realbench/results/phase5_repro_runtime_ifeval_s7_r2.json
- result:
- summary moved
0.375 -> 0.500with the same seed/config - all
8/8example outputs changed between reruns
- summary moved
- interpretation:
- these old drift artifacts came from the harness shared-first path skipping the request seed,
- they should not be interpreted as runtime sampler instability.
- repeated runtime-only IFEval seed-7 sampled runs:
- Late AWS sampler update (
2026-03-08) materially changed the Qwen3.5 strict picture again:- chunked stop-check plus fast top-k sampling landed after the hybrid prefill work,
- focused GPQA decode profile moved:
decoder_step0_logits_sample 40.701 -> 3.538 msdecoder_stepN_sample_mean 37.090 -> 2.366 msdecoder_stepN_total_mean 47.748 -> 12.721 ms
- focused artifacts:
benchmarks/phase5_awareness_realbench/results/q35-gpqa-profile-aws-stopchunk8_20260308T003422Z.jsonbenchmarks/phase5_awareness_realbench/results/q35-gpqa-profile-aws-samplefast1_20260308T003727Z.json
- fast-sampler AB3 artifact:
benchmarks/phase5_awareness_realbench/results/phase5_qwen35_remote_strict_matrix_20260308T003749Z.json- overall: runtime
0.305556vs vLLM0.347222, runtime1405.707 msvs vLLM1676.336 ms
- tie-stable fast-sampler AB3 artifact:
benchmarks/phase5_awareness_realbench/results/phase5_qwen35_remote_strict_matrix_20260308T004758Z.json- overall: runtime
0.315972vs vLLM0.347222, runtime1422.818 msvs vLLM1659.878 ms gpqa_diamond: runtime0.291667vs vLLM0.208333, runtime886.296 msvs vLLM515.171 msifeval: runtime0.340278vs vLLM0.486111, runtime1959.340 msvs vLLM2804.584 ms
- interpretation:
- runtime now has a clean strict latency lead on the one-host Qwen3.5 matrix,
- prompt prefill and sampled decode are both materially improved,
- the remaining blocker is recovering the small score deficit without giving back that latency win.
- Late AWS update (
2026-03-08) materially changed the Qwen3.5 strict picture:- new batched hybrid prefill landed in
monolith/models/decoder.cuandmonolith/main.c, - focused GPQA profile moved:
decoder_prefill 3263.527 -> 1341.628 -> 275.372 msdecoder_ttft 3317.441 -> 1405.739 -> 1017.876 ms
- latest strict AB3 artifact:
benchmarks/phase5_awareness_realbench/results/phase5_qwen35_remote_strict_matrix_20260308T000429Z.json - result:
- overall score: runtime
0.413195vs vLLM0.347222 - overall latency: runtime
2940.172 msvs vLLM1686.263 ms gpqa_diamond: runtime0.458333vs vLLM0.208333, latency1347.582 msvs512.075 msifeval: runtime0.368055vs vLLM0.486111, latency4532.763 msvs2860.452 ms
- overall score: runtime
- interpretation: prompt-prefill was a real architectural blocker and is no longer the main latency limiter; the remaining gap is narrower and now looks more like warm decode/request-path overhead.
- new batched hybrid prefill landed in
- Late AWS update (
2026-03-07) now has two new concrete results:- Qwen3.5 strict/AWS launcher drift is now fixed through a shared fast-path env:
- code:
scripts/qwen_runtime_env.py - consumers:
scripts/qwen35_remote_isolated_ab.py,scripts/treni_local_tool_worker.py,scripts/hermes_same_vm_mvp.py - clean AB3 artifact:
benchmarks/phase5_awareness_realbench/results/phase5_qwen35_remote_strict_matrix_20260307T231500Z.json - result:
- overall score: runtime
0.335648vs vLLM0.291667 - overall latency: runtime
3690.124 msvs vLLM1646.672 ms gpqa_diamond: score parity (0.25vs0.25) but runtime remains much slower
- overall score: runtime
ifeval: runtime higher score (0.421296vs0.333333) but still slower- interpretation: the old launcher mismatch was real, but fixing it does not close the latency gap; the remaining blocker is still long-prompt prefill.
- code:
- The code-level reason for the remaining Qwen3.5 latency gap is now explicit:
monolith/models/decoder.cucurrently returns invalid fromtreni_decoder_forward_f32(...)whenctx->is_linear_attnis true,- the comment in that path states that Qwen3.5 linear-attention is implemented only in cached/token decode,
- so Qwen3.5 prompt prefill still falls back to the token-by-token cached loop in
monolith/main.cinstead of a true batched prompt-prefill path. - interpretation: the remaining long-prompt latency gap is architectural, not just a missing launch flag.
- ORPO self-reload loop is real end-to-end:
- artifact:
benchmarks/same_vm_mvp/results/samevm-orpo-reload-aws_20260307T222341Z.json - local ORPO output was merged, packed into a new monolith container, restarted as a second runtime, and answered a real chat request.
- artifact:
- Qwen3.5 shared-prefix tiering (
64 -> 112runtime cache cap with quartile tiers + exact replay) yields a real clean latency win, but not a full fix:- sequential GPQA profile artifact:
benchmarks/phase5_awareness_realbench/results/q35-gpqa-profile-aws-seq2-cap112_20260307T222540Z.json - clean strict seed-7 spot A/B artifacts:
benchmarks/phase5_awareness_realbench/results/phase5_qwen35_remote_strict_matrix_20260307T223218Z.jsonbenchmarks/phase5_awareness_realbench/results/phase5_qwen35_remote_strict_matrix_20260307T223555Z.json
- effect on runtime latency (
112vs64):- overall
-363.908 ms gpqa_diamond-420.699 msifeval-307.116 ms
- overall
- runtime is still slower than vLLM overall even after this improvement.
- sequential GPQA profile artifact:
- Qwen3.5 strict/AWS launcher drift is now fixed through a shared fast-path env:
- Qwen3.5 contract validation + one-host strict rerun are now updated on AWS (
2026-03-07):- tokenizer audit artifact:
benchmarks/qwen35_tokenizer_audit/results/qwen35-tokenizer-audit-active_20260307T173024Z.json - runtime smoke artifact:
benchmarks/qwen35_smoke/results/qwen35-runtime-smoke-active2_20260307T173132Z.json - isolated semantic A/B artifact:
benchmarks/qwen35_smoke/results/qwen35-isolated-ab-active_20260307T173228Z.json - strict one-host matrix artifact:
benchmarks/phase5_awareness_realbench/results/phase5_qwen35_remote_strict_matrix_20260307T191653Z.json - new runner:
scripts/phase5_qwen35_remote_strict_matrix.py - current state:
- packed tokenizer exactly matches HF full vocab for
Qwen/Qwen3.5-0.8B(248077tokens), - runtime extended non-thinking smoke passes
7/7cases on the active AWS host, - runtime wins the isolated non-thinking probe suite overall while matched vLLM still misses multimodal-placeholder and forced-thinking probe cases in that probe harness,
- strict realbench score is no longer behind overall: runtime score
0.3333vs vLLM0.3160, - strict realbench latency is still far behind: runtime
3809.745 msvs vLLM1626.068 ms.
- packed tokenizer exactly matches HF full vocab for
- request-path fixes included in that rerun:
- Qwen3.5 decoder prefix cache now defaults on with
64prefix tokens, timing.ttft_msnow measures request-path first-token timing instead of the decode-loop step-0 proxy,- repeated prompt-family hot probe on AWS dropped from
infer_ms ~1798.5 -> 842.4 msandttft_ms ~1531.9 -> 782.5 mswith a cache hit.
- Qwen3.5 decoder prefix cache now defaults on with
- task split in the strict one-host matrix:
gpqa_diamond: score parity (0.2917vs0.2917) but runtime is still much slower (+2449.320 ms),ifeval: runtime higher score (0.3750vs0.3403) but still slower (+1918.033 ms).
- tokenizer audit artifact:
- Follow-up prefix-cache debugging on AWS (
2026-03-07, non-canonical debug cycle) isolated a real short-prompt runtime bug:- focused
2x gpqa + 2x ifevalprofile with cache enabled showed:- GPQA does get a real
64-token prefix-cache hit, - that hit reduces prefill (
~3075 ms -> ~2697 ms) but does not close the large prefill gap, - short IFEval requests were hitting
CUDA invalid argumenton the prefix-cache/store path and then poisoning the next request.
- GPQA does get a real
- focused no-cache rerun removed the short-prompt CUDA failures entirely.
- safe fix landed in
monolith/main.c: only long prompts are allowed to store into the prefix cache; short prompts skip the buggy store path. - focused post-fix profile confirms:
- GPQA cache hit is preserved,
- short IFEval CUDA invalid-argument path is gone in the probe,
- this is a stability/correctness fix, not yet a canonical strict-matrix win.
- focused
- Same-VM Hermes wrapper recovery is now complete on AWS (
2026-03-07):- artifact:
benchmarks/same_vm_mvp/results/samevm-q35-stack_20260307T172158Z.json - result: wrapper now auto-starts the local runtime + CPU tool worker, calls
samevm_runtime_health, runs the real extended Qwen3.5 smoke suite, and emits a deterministic plain-text summary from tool outputs. - entrypoints:
scripts/hermes_same_vm_mvp.py,scripts/run_samevm_qwen35_stack.sh - smoke sub-artifacts:
benchmarks/qwen35_smoke/results/samevm-smoke-20260307T172124Z_20260307T172124Z.json,benchmarks/qwen35_smoke/results/samevm-smoke-20260307T172124Z_20260307T172124Z.md - current status:
PASSacross7/7smoke cases in the wrapper path; remaining issue is an intermittent first tool-turn CUDA retry (compute/ops.cu:765, invalid argument) that recovered successfully in the observed run.
- artifact:
- Same-VM runtime-admin proof is now clean on AWS (
2026-03-07):- artifact:
benchmarks/same_vm_mvp/results/samevm-q35-runtime-admin-proof-v5_20260307T212852Z.json - result: Hermes calls the local
samevm_runtime_statusandsamevm_multimodal_statustools, and the wrapper now rewrites partial/truncated model responses into a deterministic tool-derived summary. - current state in that artifact:
- runtime is managed by the worker on
http://127.0.0.1:18080, - managed runtime PID is live (
pid_running=yes), - Qwen3.5 runtime uses the packed local container with prefix cache enabled,
- multimodal defaults are loaded from the same local worker (
embed,rerank,tts,stt).
- runtime is managed by the worker on
- artifact:
- Same-VM ORPO control-plane proof is now complete on AWS (
2026-03-07):- artifact:
benchmarks/same_vm_mvp/results/samevm-orpo-probe-aws_20260307T215307Z.json - runner:
scripts/samevm_orpo_probe.py - result:
- local preference dataset write succeeded,
- real background ORPO job launched through the worker,
- job completed with
returncode=0, - current scope is training control, not hot-reload: adapter/container ingestion back into the monolith runtime is still not wired.
- artifact:
- Same-VM multimodal tool surface is now wired into the local worker + Hermes bridge (
2026-03-07):- code:
scripts/samevm_multimodal_models.py,scripts/treni_local_tool_worker.py,scripts/hermes_same_vm_mvp.py - new tool classes:
samevm_multimodal_status,samevm_embed,samevm_rerank,samevm_tts,samevm_stt - default models:
Qwen/Qwen3-VL-Embedding-2BQwen/Qwen3-VL-Reranker-2BQwen/Qwen3-TTS-12Hz-1.7B-CustomVoiceQwen/Qwen3-ASR-0.6B- Whisper fallback STT if the requested model id contains
whisper
- live worker status smoke confirms the new endpoints are reachable.
- bootstrap entrypoint:
scripts/bootstrap_samevm_multimodal.sh - MVP readme:
benchmarks/same_vm_mvp/README.md
- code:
- First real same-VM stack proof now runs end-to-end on AWS (
2026-03-07):- artifact:
benchmarks/same_vm_mvp/results/samevm-stack-probe-aws-v3_20260307T213248Z.json - runner:
scripts/samevm_stack_probe.py - confirmed in one local-worker pass:
- runtime status: healthy managed Qwen3.5 runtime on the same VM,
- SQLite exec/query: pass (
1row), - RAG ingest/search: pass (
match_count=1, top hitSame VM locality), - TTS: pass with
Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice, - STT: pass with Whisper fallback on the generated WAV,
- embedding: pass with
Qwen/Qwen3-VL-Embedding-2B(dim=2048), - reranking: pass with
Qwen/Qwen3-VL-Reranker-2B.
- caveat: Whisper transcript is directionally correct but not exact on the synthetic audio (
Trenimisheard), so current STT proof is functional, not quality-benchmarked.
- artifact:
- Same-VM multimodal cache retention bug is now explicit and mitigated on AWS (
2026-03-07):- finding: after the multimodal proof, the local tool worker was holding about
13.3 GiBof GPU memory and starving the Qwen runtime path. - fix:
- new worker endpoint:
POST /v1/mm/clear_cache - new Hermes tool:
samevm_multimodal_clear_cache - status now reports
loaded_model_count,loaded_models, and CUDA allocation/reservation.
- new worker endpoint:
- code:
scripts/samevm_multimodal_models.py,scripts/treni_local_tool_worker.py,scripts/hermes_same_vm_mvp.py
- finding: after the multimodal proof, the local tool worker was holding about
- Canonical same-VM MVP proof was revalidated on AWS after the runtime compatibility fix (
2026-03-10):- artifact:
benchmarks/same_vm_mvp/results/samevm-full-mvp-aws-v15.json - summary artifact:
benchmarks/same_vm_mvp/results/samevm-full-mvp-aws-v15.md - result:
- runtime health: pass
- worker health: pass
- Hermes runtime-status: pass
- Hermes multimodal-status: pass
- direct same-VM runtime smoke: pass on the basic non-thinking profile (
all_ok=True,5cases, includes first-turn tool calling) - direct same-VM thinking smoke: pass on the extended/thinking profile with exact-match checks (
all_ok=True) - direct same-VM stack probe: pass for SQLite, RAG, embedding, reranking, TTS, Qwen ASR STT
- Qwen3.5 ORPO reload proof: pass via
benchmarks/same_vm_mvp/results/samevm-orpo-reload-q35-fixed.json - sidecar cleanup: pass (
port=18081,stopped=true) - final multimodal cache clear: pass
- implementation note:
- the full demo runner is now
scripts/samevm_full_mvp_demo.py - one-command entrypoint is
scripts/run_samevm_full_mvp.sh
- the full demo runner is now
- compatibility fix:
monolith/server/http.cnow accepts bothPOST /v1/chat/completionsandPOST /chat/completionsmonolith/server/http.cnow exposes bothGET /v1/modelsandGET /models- this removed the live Hermes 404 failure against the root runtime URL on AWS
- extra Hermes tool proofs after the rerun:
benchmarks/same_vm_mvp/results/hermes-demo-sqlite-query-v2.jsonbenchmarks/same_vm_mvp/results/hermes-demo-rag-search-v1.json- these confirm Hermes can use real SQLite and RAG tools on AWS beyond the status-only path
- artifact:
- Live capability validation pass on AWS added concrete speed and multimodal proofs (
2026-03-10):- direct generation speed on the current live Qwen3.5 runtime (
0.8B, non-thinking lane):3deterministic runs130completion tokens each- mean end-to-end throughput:
112.37 tok/s - mean decode-only throughput:
121.90 tok/s
- Hermes tool visibility proof:
benchmarks/same_vm_mvp/results/hermes-tool-list-v1.json- loaded tools include runtime control, smoke, SQLite, RAG, embedding, reranking, TTS, STT, ORPO, and job status
- Hermes audio roundtrip proofs:
- TTS:
benchmarks/same_vm_mvp/results/hermes-tts-v2.json - STT:
benchmarks/same_vm_mvp/results/hermes-stt-v2.json - current transcript roundtrip still shows the known synthetic-voice name drift (
Treni -> Trinity)
- TTS:
- PDF/RAG real-world proof:
- extracted
/Users/andrewcorrea/pncp-ata360/docs/manual-pncp-api.pdfto text - ingested extracted text into the AWS same-VM RAG store
- search for
Protocolo de Comunicação PNCPreturned the correct manual section
- extracted
- reranker proof:
- direct Qwen reranker call correctly ranked the
Protocolo de Comunicaçãocandidate first
- direct Qwen reranker call correctly ranked the
- current product caveat:
- same-VM RAG currently ingests plain text files and text payloads only; PDF parsing is still an external preprocessing step
- current 4B feasibility note:
- the live AWS host is an A10G
24 GBbox with enough GPU headroom to try Qwen3.5-4B, - but only about
12 GBroot disk remains free, so model download/pack is the practical blocker on the current host
- the live AWS host is an A10G
- direct generation speed on the current live Qwen3.5 runtime (
- Qwen3.5 thinking-mode response salvage is now wired for unfinished reasoning traces (
2026-03-08):- code:
monolith/server/http.c - behavior:
- unfinished
Thinking Process:outputs on exact-output prompts now return a usable finalmessage.content - the raw reasoning trace is preserved in
message.reasoning_content finish_reasonremainslengthwhen the model still truncates
- unfinished
- passing artifact:
benchmarks/qwen35_smoke/results/postmvp-extended-thinking-r4_20260308T193858Z.json
- code:
- Qwen3.5 ORPO reload is now promoted to the target family (
2026-03-08):- tokenizer parity for packed
tokenizer.json+ merges is now exact in the runtime - repacked ORPO sidecar containers now carry the correct runtime model kind (
qwen3_5) - canonical proof:
benchmarks/same_vm_mvp/results/samevm-orpo-reload-q35-fixed.json - observed sidecar chat preview:
READY Local tools are useful because they offer immediate, offline access...
- tokenizer parity for packed
- Same-VM multimodal STT is now promoted from Whisper fallback to Qwen ASR on AWS (
2026-03-08):- wrapper fix: the local STT loader now initializes the forced aligner only when timestamps are explicitly requested
- AWS disk cleanup removed obsolete ORPO runtime candidates and recovered the box from
100%to88%root usage - passing Qwen ASR probe:
benchmarks/same_vm_mvp/results/samevm-stack-probe-aws-v5.json - current observed transcript on the synthetic TTS probe:
Trinity runs its tools locally on the same machine.
- remaining caveat:
- timestamped STT still depends on the forced-aligner path and sufficient local disk to materialize that model
- Extended same-VM runtime smoke is now fully green (
2026-03-08):- extended non-thinking profile: pass (
benchmarks/qwen35_smoke/results/postmvp-extended_20260308T185130Z.json,7/7cases) - extended thinking profile: pass (
benchmarks/qwen35_smoke/results/postmvp-extended-thinking-r4_20260308T193858Z.json)
- extended non-thinking profile: pass (
- Clean direct GPQA runtime profile is now captured on AWS (
2026-03-07):- artifact:
benchmarks/phase5_awareness_realbench/results/q35-gpqa-profile-aws-clean_20260307T220200Z.json - runner:
scripts/q35_gpqa_profile_once.py - result:
- on a fresh runtime, the same GPQA prompt run twice still spends most time in
decoder_prefill, - first call:
decoder_prefill=3263.527 ms,decoder_ttft=3317.441 ms, - second call with prefix-cache reuse:
decoder_prefill=2690.001 ms,decoder_ttft=2750.672 ms, - step-0 decode itself is small (
decoder_step0_layers ~8 ms,decoder_step0_logits_sample ~33-36 ms), - tensor upload improves sharply on the second call (
218.091 -> 11.216 ms), but the remaining gap is still overwhelmingly prefill.
- on a fresh runtime, the same GPQA prompt run twice still spends most time in
- artifact:
- Qwen3.5 tokenizer audit is now exact for the current packed target (
2026-03-06):- artifact:
benchmarks/qwen35_tokenizer_audit/results/runtime-q35-tokenizer-audit-r4_20260306T190418Z.json - packed runtime tokenizer/full vocab exactly matches HF
Qwen/Qwen3.5-0.8B(248077tokens), including<think>and vision/control tokens.
- artifact:
- New Qwen3.5 probe matrix (
2026-03-06) gives a cleaner functional picture than the earlier strict matrix alone:- artifact:
benchmarks/qwen35_smoke/results/qwen35-probe-matrix-r2_20260306T200035Z.json - runtime
non-thinkingpasses the full extended probe set (all_ok=true). - runtime
thinkingalso completes all cases, but output discipline is weak and latency is very high. - vLLM
non-thinking/thinkingare not universal wins in this matrix:- current text-only launch rejects multimodal placeholder input (
400), - several thinking/exact-output cases end with
finish_reason=length.
- current text-only launch rejects multimodal placeholder input (
- claim-safe interpretation:
- runtime is functionally solid in
non-thinking, - vLLM remains much faster on long-prompt/tool cases,
- the current runtime blocker is long-prompt/tool-path infer latency, not basic Qwen3.5 tokenizer or tool-call plumbing.
- runtime is functionally solid in
- artifact:
- Same-VM Hermes MVP is now materially real (
2026-03-06):- smoke artifact:
benchmarks/same_vm_mvp/results/hermes-samevm-q35-smoke-r5_20260306T192703Z.json - ORPO smoke launch artifact:
benchmarks/same_vm_mvp/results/hermes-samevm-q35-orpo-smoke-r1_20260306T194152Z.json - local worker completed a real ORPO training smoke run and saved output under
benchmarks/same_vm_mvp/trainings/samevm-orpo-qwen25-smoke3/.
- smoke artifact:
- Phase 5 harness parse hardening landed (
2026-03-04) for closed-form tasks:- long reasoning traces no longer get scored via accidental "last number" extraction,
- parser now requires explicit answer signals (
ANSWER:/Final Answer:/ boxed / strict numeric-only), - think-tag blocks are stripped before parse.
- code:
/Users/andrewcorrea/treni/scripts/phase5_awareness_realbench.py - sanity artifact (vLLM thinking mode, no parser):
phase5_awareness_realbench_q35-parsefix-vllm-thinking1_20260304T032441Z.jsonnow yieldsprediction_parsed=nullinstead of false numeric parse.
- New strict paired AB3 rerun (
2026-03-04,gpqa_diamond+ifeval, Arm A only,request_logprobs=false,16/task, seeds7,17,27) is complete:- summary artifact:
phase5_q35_runtime_vs_vllm_ifeval_gpqa_ab3_20260304T034227Z.json - overall: runtime score
0.3403vs vLLM0.3229(delta+0.0174, CI includes0), runtime latency1772.931 msvs vLLM1553.034 ms(delta+219.897 ms). - stratified:
gpqa_diamond: score parity (0.2708vs0.2708), runtime much slower (+1881.776 ms).ifeval: runtime better score (+0.0347) and much faster latency (-1441.983 ms).
- summary artifact:
- Runtime awareness retries remain non-promotable on this exact profile (
2026-03-04):- Arm B/C retries did not improve
gpqa_diamond, - both regressed
ifevaland added latency in tested settings (adaptive, summary-mode uncertainty, no token-logprobs).
- Arm B/C retries did not improve
- Strict benchmark guard is now enforced for runtime HTTP runs (
2026-03-02):TRENI_HTTP_REQUIRE_INFERENCE=1returns hard failure (502 inference_required) when inference is unused/empty, eliminating silent heuristic-fallback/zero-filled artifact contamination. - Phase 5 paper-loop harness bug fix landed (
2026-03-03): inpapermode, retry now commits the refined pass directly (instead of passing through confidence-margin replacement gating).- Code:
/Users/andrewcorrea/treni/scripts/phase5_awareness_realbench.py - First live sanity artifacts:
- vLLM (
gpqa_diamond+ifeval,8samples/task):phase5_awareness_realbench_qwen35-paperfix-sanity1_20260303T201156Z.json(overall B-A=0.0,gpqa +0.125,ifeval -0.125). - runtime (
same set):phase5_awareness_realbench_qwen35-paperfix-sanity2-runtime_20260303T201744Z.json(overall B-A=-0.125, retry rate100%).
- vLLM (
- Code:
- Runtime "all-zero" sanity artifact (
phase5_awareness_realbench_qwen35-paperfix-sanity1-runtime_20260303T201620Z.json) was an infra contamination case, not a scoring result:- vLLM and runtime were co-resident on single A10G,
- runtime hit GPU OOM on embedding upload and strict guard returned
502 inference_requiredfor all requests.
- Paper trigger calibration issue is now isolated on runtime (
2026-03-03):- with default paper thresholds,
max_entropytriggered retries on all samples (16/16), - raising only perplexity threshold (
1.4 -> 1.8 -> 2.2) had no effect on behavior/outcome, - raising entropy threshold (
1.5 -> 7.0) reduced retries (16 -> 9) but still produced no score uplift (overall B-A=0.0) and added latency.
- with default paper thresholds,
- Runtime summary-mode calibration fix is now live (
2026-03-03):- retry logic now detects
runtime_summaryuncertainty payloads and uses guarded vote triggering (paper_summary_*thresholds) instead of entropy-only firing. - artifact (
8/tasksanity):phase5_awareness_realbench_qwen35-papersummaryfix-runtime-sanity2_20260303T204120Z.json- retry dropped (
16 -> 9), - quality moved from negative to parity (
overall B-A: -0.125 -> 0.0), - latency overhead remains material (
~+1386 msmean).
- retry dropped (
- higher-N confirmation (
32/task):phase5_awareness_realbench_qwen35-papersummaryfix-runtime-conf045-s32_20260303T204751Z.json- mixed result (
gpqa +0.03125,ifeval -0.0625, overallB-A=-0.015626).
- mixed result (
- retry logic now detects
- Task-aware summary policy is now the first repeatable positive awareness result on this Qwen3.5 runtime track (
2026-03-03):- policy: keep summary-mode paper retries for
gpqa_diamond, disable summary-mode retries forifeval, - one larger run (
32/task):phase5_awareness_realbench_qwen35-papersummaryfix-runtime-conf045-s32-ifevaloff_20260303T222841Z.json- overall
B-A=+0.015624, latency delta+618.068 ms, gpqa +0.03125,ifeval +0.0.
- overall
- 3-seed repeatability (
16/task):...ifevaloff-rpt-s7/s17/s27...- overall delta mean
+0.020833(range0.0to+0.03125), - retries constrained to
gpqaonly, mean retry rate~0.2917.
- overall delta mean
- policy: keep summary-mode paper retries for
- Late Phase 5 policy pass (
2026-03-03) reduced awareness latency overhead without losing quality signal:- root cause isolated: most
gpqa_diamondretries wereinvalid_parsewith high first-pass confidence, and those retries were usually non-productive. - harness changes:
- compact invalid-parse recovery prompt (
build_format_recovery_messages), - new confidence gate for invalid-parse retries on closed-form tasks (
--invalid-parse-retry-confidence-max).
- compact invalid-parse recovery prompt (
- code:
/Users/andrewcorrea/treni/scripts/phase5_awareness_realbench.py - 3-seed repeatability (
16/task,s7/s17/s27) withinvalid_parse_retry_confidence_max=0.73:- artifacts:
phase5_awareness_realbench_qwen35-papersummaryfix-runtime-compact-invmax073-s16_20260303T232029Z.json,...-rpt-s17_20260303T232254Z.json,...-rpt-s27_20260303T232516Z.json - quality delta unchanged vs prior baseline:
overall B-A mean = +0.020833, - latency overhead reduced:
+712.276 ms -> +404.603 ms, - GPQA retry rate reduced:
0.5833 -> 0.2917.
- artifacts:
- higher-N confirmation (
32/task,s7):- artifact:
phase5_awareness_realbench_qwen35-papersummaryfix-runtime-compact-invmax073-s32_20260303T232755Z.json - same quality delta as prior
s32baseline (overall B-A=+0.015624) with lower latency overhead (+618.068 ms -> +326.187 ms).
- artifact:
- root cause isolated: most
- Qwen3.5 strict runtime-vs-vLLM matrix has a new Arm A-only canonical rerun (
2026-03-03,phase5_qwen35_runtime_vs_vllm_matrix_20260303T104038Z.json) after decoder gate-layout fix:- score gap narrowed (
runtime 0.15625vsvLLM 0.19097, delta-0.03472, CI includes near-parity), - latency is still behind (
runtime 1723.685 msvsvLLM 958.757 ms, delta+764.928 ms).
- score gap narrowed (
- Qwen3.5 strict runtime-vs-vLLM canonical matrix is now completed (
2026-03-02) after decoder-path unblock (phase5_qwen35_runtime_vs_vllm_matrix_20260302T222013Z.json): runtime is currently behind on both score (0.0503vs0.2170) and latency (1881.188 msvs178.093 ms). - Follow-up Q/K norm fix check (
qnorm-check1,2026-03-02) did not resolve the Qwen3.5 gap (phase5_qwen35_runtime_vs_vllm_matrix_20260302T225529Z.json): runtime remained far slower and still produced malformed repetitive outputs on direct probes, narrowing the primary blocker to missinglinear_attn(GatedDeltaNet) parity. - Qwen3.5 serving path is now unblocked on AWS via vLLM nightly (
0.16.1rc1.dev...) with--language-model-only; stable endpoint validated on127.0.0.1:18081(2026-03-02). - Infra blocker/fix (
2026-03-02): root disk hit100%and broke vLLM startup (No usable temporary directory); cache/venv cleanup restored ~21GB free and launch stability. - Phase 5 A/B/C fairness fix landed (
2026-03-02): all arms now reuse the exact same first completion per example before awareness actions. - Paper-loop alignment landed in Phase 5 harness (
2026-03-02): cloned reference repo (third_party/weave-logprobs-reasoning-loop) and ported multi-signal retry triggers (perplexity,max_entropy,low_confidence_tokens) plus per-call uncertainty traces. - Paper-mode smoke validation ran end-to-end on AWS Qwen3.5 nightly (
2026-03-02): trigger reason fields and loop traces are present in artifactphase5_awareness_realbench_qwen35-paper-smoke_20260302T191420Z.json. - Full Qwen3.5 paper-mode run (
r4,2026-03-02) confirms integration but not uplift at current thresholds: overallB-A=-0.046875,C-A=0.0, with extra latency from retries. - Adaptive uncertainty fix (
2026-03-02) reduced over-triggering and improved tradeoffs on Qwen3.5:r5(...r5-adaptive...):B-A=-0.015625,C-A=0.0, with substantially lower latency overhead thanr4.- stricter
r6variant reachedB-A=0.0but regressedC-A(-0.03125), sor5adaptive defaults remain preferred.
- Qwen3.5 Phase 5 run (
r3) after fairness fix shows no awareness regressions (allB-AandC-Adeltas are0.0), but no quality uplift yet. - Decode-stop semantics are now aligned to end-marker stopping (not
im_start), with token-level control-fragment filtering default-on and sanitize still opt-in (2026-03-02); AWS qwen05 probes no longer emit the prior"<|im"leak. - Tokenizer encode parity fix landed for chat templates (
2026-03-02):<|...|>control tokens are now emitted as atomic special tokens in BPE path instead of punctuation fragments. - HTTP fallback behavior fix landed (
2026-03-02): when inference succeeds but content is empty, API now returns empty assistant content instead of synthetic route-classifier text. - qwen05 deterministic MCQ token-0 stop parity gap is now resolved (
2026-03-02) via Qwen default system preamble injection for user-only chats in runtime HTTP template build. - Post-fix qwen05 external-cold validation is complete (
2026-03-02):- runtime now returns non-empty completions on the prior failing path (
external_cold_qwen05_templatefix_20260302T154019Z.json), - TTFT remains strongly ahead of vLLM in this profile (
1.703 msvs49.759 ms).
- runtime now returns non-empty completions on the prior failing path (
- Phase 5 real-benchmark first canonical diagnostic pack is now complete (
2026-03-01,r5): after runtime prompt/tokenizer fixes,gpqa_diamondandifevalimproved materially (A=0.500,A=0.5625respectively), whilegsm8k/aime25remain at0.0in this setup. - Phase 5 matched-depth/matched-sample rerun on
qwenafter this fix (r9,2026-03-02) is now complete:gpqa_diamonddropped (A 0.500 -> 0.125) vsr5,gsm8krecovered materially (A 0.000 -> 0.625,C 0.000 -> 0.750),aime25remains weak but Arm C is non-zero (0.125),- current interpretation stays mixed-by-task (not a universal quality win yet).
- Tokenizer/runtime root-cause fixes landed for Phase 5 quality debugging: message aggregation (
system+user), prompt-cap default (32 -> 256), BPE merges,added_tokensload, and UTF-8 JSON escape handling. - Qwen-template auto A/B run (
r6,2026-03-01) regressed both quality and latency versusr5; template path remains opt-in and non-canonical. - Phase 5 HF-reference parity is now complete on the same sampled set (
phase5_hf_reference_qwen_r5_20260301T1900Z.json): runtime is higher on GPQA, slightly lower on IFEval, and tied at0.0on GSM8K/AIME (so math-task zeros are not runtime-only breakage in this setup). - Real-benchmark awareness A/B/C harness is now implemented (
2026-02-28) forgpqa_diamond,ifeval,gsm8k, andaime25(scripts/phase5_awareness_realbench.py+ run wrapper); first canonical diagnostic run pack is now published (r5). - Higher-N same-window runtime-vLLM rerun (
AB5, 2026-02-28) keeps runtime ahead on full request path (1184.812 msvs1318.675 ms,vLLM/runtime=1.113x) and cold-total first response (5.810xratio). - Post-AB5 full-depth gate sweep (
AB2+ delayed-LtAB3, 2026-02-28) did not unlock a new canonical toggle: delayed-Lt failed mixed-load confirmation andproj_fastremained mixed/noise. - Tuned delayed-Lt slow-gate rescue (
AB2, 2026-02-28) also stayed non-promotable: warm remained slightly positive but mixed stayed near-flat with p99 regression, so delayed-Lt is still non-canonical. - FFN-proj mixed-input fallback patch (
2026-02-28) removes repeated failed batched2 GEMM attempts under forced-Lt stress, but canonical re-gate still leavesf32_inputnon-promotable on default path. - Full-depth qwen-focused rerun on clean inference path (
pool=16384, no fallback) showed a positiveTRENI_DECODER_FFN_PROJ_U16_FAST_COMPUTE=1signal, but full foundation gate later rejected global promotion (canonical stays default-off). - Internal routing beats external routing on matched benchmark tasks.
- Cold start bottlenecks were decomposed stage-by-stage;
model_tensor_index_buildis no longer a dominant stage. - Remaining cold cost is now concentrated mostly in Qwen
decoder_tensor_upload. - External cold-start comparison (runtime vs PyTorch vs vLLM vs Ollama) now has a canonical G5 artifact with explicit request-path vs total-cold interpretation.
- After decoder loop and sampling fixes, parity-corrected 48-token request path now beats vLLM on TTFT and full latency in latest G5 run.
- First routing failure-amplification stress profile now shows external retry/timeout chains increase both latency and error rate vs internal.
- Routing matrix expansion on G5 confirms this trend across 6 profiles (baseline + escalating stress).
- Cross-host routing pilot (local client + SSH tunnel to G5 runtime/controller) now reproduces external-path degradation under stress.
- Split-host routing matrix (CPU router host + GPU runtime host) is now complete as canonical Track B evidence.
- Qwen cold upload now has a direct on/off ablation for GPU-side BF16/F16 conversion, showing large cold-path reduction.
- External-cold runtime-only rerun confirms the same fix improves startup+cold-total, not just first-hit request latency.
- Runtime-vLLM external-cold repeatability rerun now confirms the same direction with restored vLLM environment.
- External-cold all-backend repeatability (
runtime + PyTorch + vLLM + Ollama) is now complete after GPU-convert fix. - Runtime host-prefetch cold fix now removes the intermittent preload upload outlier while preserving request-path TTFT/full latency.
- Staged H2D upload (
TRENI_TENSOR_H2D_STAGING) is now benchmarked with chunk-size follow-up and is currently regressed on G5, so the path is parked opt-in/default-off. - Non-staging H2D chunk-size tuning (
TRENI_TENSOR_H2D_CHUNK_MB=0/64/128) was initially near-neutral on this profile; later2026-02-28full-depth AB3 reruns promoted default0(see newer entry). - Host page-touch pre-fault upload path (
TRENI_TENSOR_HOST_TOUCH) is now implemented and benchmarked; it shifts time from H2D to prefetch and regresses request latency in this profile, so it remains opt-in/default-off. - Upload sync diagnostics now isolate cold upload composition: conversion is measurable when synchronized, but H2D transfer remains the dominant stage.
- Synchronized host-register diagnostics now confirm no meaningful transfer benefit on this profile, so that lane is currently deprioritized.
- Decoder logits u16 mixed-precision path is now implemented/benchmarked; despite slight cold-upload reduction, request-path latency regresses and the lane remains parked.
- Tensor-cache hash lookup lane (
TRENI_TENSOR_CACHE_HASH) is now implemented/benchmarked and remains near-neutral in this profile with slight warmp99regression, so it stays opt-in/default-off. - Sampler direct-store lane (
TRENI_SAMPLE_DIRECT_STORE) is now implemented/benchmarked and regresses warm request latency in this profile, so it stays opt-in/default-off. - Decoder direct-out residual lane (
TRENI_DECODER_DIRECT_OUT_HIDDEN) initially regressed on warm-profile A/B (2026-02-24), but was later revalidated and promoted for the current full-depth lane (2026-02-27late cycle). - Multi-head seq1 attention lane (
TRENI_ATTN_SEQ1_USE_MULTIHEAD) is now implemented/benchmarked and shows clear wins on qwen and bart request paths; it is now default-on with a bounded max-kv guard. - External-cold repeatability after seq1 multi-head default promotion is now complete (
2026-02-24): runtime retained large margins vs PyTorch and vLLM while improving its own TTFT/full/cold-total vs prior host-prefetch baseline. - First step0 softmax/PV exp-reuse patch is now complete (
2026-02-24) and validated on the same external-cold 3-run set: runtime remained ahead with small additional gains (sub-ms on full/cold-total). - Second step0 shared-probability follow-up was tested on the same 3-run set and did not beat exp-reuse; it was reverted to keep the better path.
- Non-step0 decode-stage profiling is now wired (
TRENI_DECODE_STAGE_PROFILE) and first G5 run (2026-02-25) showsdecoder_stepN_logits_sampleis the dominant decode stage on qwen. - Uncertainty capture ablation on the same run profile (
TRENI_DEMO_CAPTURE_UNCERTAINTY=1/0) shows a measurable but secondary effect (~6.5 msfull-request delta at 64 tokens), confirming the main remaining hotspot is still logits+sample compute itself. - Additional decode split profiling (
2026-02-25) now isolatesdecoder_stepN_logits_projfrom sampling; logits projection remains dominant (~2.458 ms) and three immediate optimization probes were near-neutral/regressed, so those code paths were reverted. - Full-depth qwen check (
--layers 36,--pool-mb 16384) is now explicitly validated; in this modedecoder_stepN_layersdominates, and runtime-vLLM results must be interpreted separately from the fast--layers 2profile. - Runtime-vLLM cold rerun on the same host/profile (
2026-02-25) still shows clear runtime lead (TTFT/full/cold-total). - Full-depth FFN u16 weight path (
TRENI_DECODER_FFN_U16_PATH=1) now shows a material runtime uplift (TTFT -8.8 ms,full -328 ms,cold-total full -1329 ms) but still does not close the full request-path gap to vLLM in latest G5 A/B. - Full-depth
ATTN+FFN+LOGITSu16 path now further improves runtime means (TTFT -10.8 ms,full -372 ms,cold-total full -1374 msvs baseline), but request full is still slower than vLLM (~1.365xratio in latest 3-seed set). - Post-rebuild full-depth sanity reruns (
2026-02-26) remain aligned with the same residual-fused baseline (~1720 msrequest full), confirming no hidden regression from recent code/instrumentation changes. - Full-depth FFN sub-stage split (
2026-02-26) now showsffn_projis dominated by gate/up GEMMs (~0.101 + ~0.099 ms) while cast is minor (~0.005 ms); a batched gate+up trial regressed and was reverted. - Full-depth attention qkv fused-alias path (
TRENI_DECODER_ATTN_U16_QKV_FUSED) is now implemented and default-on in this lane (2026-02-26), with 3-seed gains in runtime-only (full -5.869 ms) and runtime-vLLM matrix (runtime full -6.542 ms). - Full-depth FFN activation-to-u16 fused path (
TRENI_DECODER_FFN_ACT_U16_FUSED) is now implemented and default-on in this lane (2026-02-26), with 3-seed gains in runtime-only (full -10.713 ms,cold_full -10.696 ms) and improved runtime-vLLM full ratio (1.3208x -> 1.3012x), while strict parity remains clean (checked=3,failed=0). - Full-depth follow-up probe cycle (
2026-02-27) closed additional speculative lanes:TRENI_DECODER_FFN_PROJ_U16_FUSED=1regressed slightly but consistently in both runtime-only and runtime-vLLM 3-seed sets.TRENI_LINEAR_U16_FAST_COMPUTE=1was near-neutral/slightly regressed in the initial runtime-only 3-seed A/B (later superseded by2026-02-28AB5 promotion evidence).TRENI_LINEAR_LT_WORKSPACE_MB=64andTRENI_LINEAR_USE_LT=0both regressed materially; canonical remains Lt on with zero workspace.
- Linear Lt runtime path is now shape-failure-scoped (no global disable on first Lt miss); 3-seed runtime-only and runtime-vLLM checks on the full-depth profile were near-neutral (
~0.05%full-latency movement), so this is a robustness fix, not a performance unlock. - Full-depth FFN projection batched2 lane (
TRENI_DECODER_FFN_PROJ_U16_BATCHED2) is now implemented, benchmarked, and promoted default-on (2026-02-27):- runtime-only 3-seed:
TTFT 15.189 -> 15.018 ms,full 1702.190 -> 1689.991 ms,cold_full 4708.109 -> 4696.805 ms. - runtime-vLLM 3-seed (runtime leg):
TTFT 15.207 -> 15.032 ms,full 1704.091 -> 1691.116 ms,cold_full 4710.111 -> 4697.207 ms. - stage profile corroboration (
offvson):decoder_step_profile_ffn_proj_mean 0.205 -> 0.196 ms/layer,decoder_stepN_layers_mean 19.140 -> 18.447 ms. - strict parity remains clean in explicit-on and default-on reports (
checked=3,failed=0).
- runtime-only 3-seed:
- Full-depth direct-out hidden lane (
TRENI_DECODER_DIRECT_OUT_HIDDEN) is now promoted default-on for this profile (2026-02-27, late cycle):- runtime-only 3-seed:
full 1690.855 -> 1684.908 ms,infer 1668.381 -> 1662.753 ms. - strict parity remains clean (
week3_parity_report_directouthidden_default_20260227T184738Z.json,checked=3,failed=0).
- runtime-only 3-seed:
- Full-depth fused qkv split+bias lane (
TRENI_DECODER_QKV_SPLIT_BIAS_FUSED) is now implemented and promoted default-on (2026-02-27, late cycle):- runtime-only 3-seed:
TTFT 14.951 -> 14.687 ms,full 1684.135 -> 1663.776 ms,cold_full 4690.132 -> 4669.847 ms,infer 1662.833 -> 1641.322 ms. - strict parity remains clean (
week3_parity_report_qkvsplitbias_default_20260227T190739Z.json,checked=3,failed=0).
- runtime-only 3-seed:
- External-cold harness now captures completion-length signals and supports fixed-token vLLM fairness (
ignore_eos, streamed usage capture):- latest fixed-length runtime-vLLM set confirms matched
completion_tokens=64for both sides. - results: runtime
TTFT=14.685 ms,full=1662.478 ms; vLLMTTFT=50.272 ms,full=1293.215 ms. - interpretation: runtime keeps a strong TTFT advantage, but request full still trails in this full-depth configuration.
- latest fixed-length runtime-vLLM set confirms matched
- Logits-only fast-compute hook follow-up (
TRENI_DECODER_LOGITS_U16_FAST_COMPUTE,2026-02-27night) is now complete:- runtime-only 3-seed means: off
full=1661.945 msvs onfull=1662.713 ms(no win; slight regression). - strict parity after hook integration remained clean (
week3_parity_report_logitsfast_hook_20260227T193756Z.json,checked=3,failed=0). - decision: keep this knob disabled in canonical full-depth lane.
- runtime-only 3-seed means: off
- U16 tensor-cache unlock (
TRENI_TENSOR_CACHE_U16, default-on) is now complete and claim-safe:- runtime-only 3-seed A/B (
off/on) shows a large request-path drop:full 1661.982 -> 1189.452 ms(-472.529 ms),infer 1640.118 -> 1168.883 ms. - same-window runtime-vLLM A/B (
off/on, 2 seeds each) flips request-full ordering:- off: runtime
1663.314 msvs vLLM1325.189 ms(runtime slower) - on: runtime
1192.145 msvs vLLM1290.816 ms(runtime faster)
- off: runtime
- log mechanism check confirms measured-request upload collapse:
- off:
decoder_tensor_upload ~476 ms,decoder_tensor_h2d ~468 ms - on:
decoder_tensor_upload ~5 ms,decoder_tensor_h2d 0 ms
- off:
- strict parity remains clean on final default-on build (
week3_parity_report_u16cache_toggle_default_20260227T200652Z.json,checked=3,failed=0).
- runtime-only 3-seed A/B (
- Late-night FFN retest cycle (
2026-02-27) is complete with no new canonical promotion:TRENI_LINEAR_BATCHED2_USE_LT=1regressed materially in runtime-only full-depth A/B (full +12.469 ms).- higher-N
TRENI_DECODER_FFN_PROJ_U16_BATCHED2_F32_INPUT=1+TRENI_DECODER_FFN_PROJ_U16_FAST_COMPUTE=1remained near-noise (full -0.198 ms). - fused-path bias-deferral expansion for
TRENI_DECODER_FFN_PROJ_U16_FUSED=1produced only near-noise movement (full -0.383 ms). - consolidated artifact:
external_cold_layers36_ffn_followup_summary_20260227T223458Z.
- Fast-profile logits fast-compute AB8 rerun (
2026-02-28,--layers 2) remains near-noise (full -0.299 ms), with stage profile unchanged (decoder_stepN_logits_proj_mean ~1.261 ms), so no promotion from this lane. - Mixed-load repeatability rerun (
2026-02-28, canonical lane,3x120requests) is stable: mean122.247 ms, p95198.518 ms, p99199.608 ms. - Strict parity follow-up on the latest patched build (
2026-02-28) passed (checked=3,failed=0). - Runtime benchmark stage parser fix (
2026-02-28):phase2_runtime_benchmark.pynow correctly parses decimaltiming stage=... ms=...values (previous regex truncated to integer prefixes). Request-level TTFT/infer/full metrics were unaffected; stage telemetry is now reliable for hotspot ranking. - Full-depth parser-fixed profile reruns (
cold_profile_qwen_layers36_fixparse_20260228T011037Z,warm_profile_qwen_layers36_fixparse_20260228T011037Z) reconfirm FFN dominance in layer compute (ffn_proj ~0.366 ms/layer,ffn_down_resid ~0.190 ms/layer,step_total ~0.705 ms/layer). - Full-depth warm AB3 for
TRENI_DECODER_FFN_PROJ_U16_FAST_COMPUTE=0/1(ffn_fast_compute_ab3_20260228T011146Z_summary) regressed slightly (request +0.317 ms,infer +0.305 ms) with no stage win in that cycle; lane stayed non-canonical until later clean-path reruns. - Strided batched-Lt fallback for batched2 FFN was implemented and tested (
batched2lt_strided_ab3_20260228T011651Z_summary); warm AB3 was near-noise and runtime-only external-cold sanity was slightly worse, so the path remains opt-in and not promoted. - FFN gate/up dual-bias fused add (
TRENI_DECODER_FFN_BIAS_PAIR_FUSED) now has a full-depth A/B set (2026-02-28):- warm AB3 showed a small request-path improvement (
request -0.229 ms,p99 -0.390 ms,infer -0.090 ms) with near-flat TTFT (+0.009 ms); - cold follow-up (3 seeds each) regressed slightly (
full +1.928 ms,infer +1.875 ms), so the path is currently non-canonical.
- warm AB3 showed a small request-path improvement (
- Batched2
seq1split-GEMM lane (TRENI_LINEAR_BATCHED2_SPLIT_SEQ1) now has a full-depth warm/cold AB3 set (2026-02-28):- warm AB3 was near-noise/slightly worse (
request +0.014 ms,infer +0.105 ms,p99 +0.124 ms); - cold AB3 improved slightly (
full -2.070 ms,infer -2.002 ms,ttft -0.021 ms); - decision: keep opt-in only, not canonical, because warm path does not improve.
- warm AB3 was near-noise/slightly worse (
- Batched2 dup-input strided lane (
TRENI_LINEAR_BATCHED2_DUP_INPUT) now has a full-depth warm/cold AB3 set (2026-02-28):- warm AB3 regressed slightly on means (
request +0.317 ms,infer +0.293 ms,ttft +0.009 ms) with minor p99 improvement (-0.208 ms); - cold AB3 also regressed (
full +1.307 ms,infer +1.388 ms,ttft +0.010 ms); - decision: keep opt-in and non-canonical.
- warm AB3 regressed slightly on means (
- Batched2 dup-input v2 kernel swap probe (
2026-02-28) was run as a warm AB2 gate set (batched2_dupinput_v2warm_ab2_20260228T032741Z) and regressed all warm means (request +0.438 ms,infer +0.381 ms,p99 +0.217 ms); probe was rejected and reverted before AB3 expansion. - FFN projection fused-lane gate rerun (
TRENI_DECODER_FFN_PROJ_U16_FUSED=0/1, warm AB2,2026-02-28) remained near-flat/slightly worse on means (request +0.149 ms,infer +0.173 ms); no AB3 expansion. - FFN projection batched2 f32-input gate rerun (
TRENI_DECODER_FFN_PROJ_U16_BATCHED2_F32_INPUT=0/1, warm AB2,2026-02-28) regressed (request +0.236 ms,infer +0.248 ms,p99 +0.512 ms); no AB3 expansion. - Linear u16 compute16f gate probe (
2026-02-28, warm AB2,TRENI_LINEAR_U16_FORCE_COMPUTE_16F=0/1) regressed (request +0.210 ms,infer +0.240 ms,p99 +0.594 ms) and was rejected/reverted; no AB3 expansion. - Explicit-u16 full-depth warm rerun (
qwen,layers=36,2026-02-28) confirms active decode split in this lane:decoder_step_profile_total_mean ~0.402 ms,ffn_proj ~0.196 ms,ffn_down_resid ~0.099 ms. - Experimental FFN gate/up pair-pack lane (
TRENI_DECODER_FFN_PAIR_PACK_U16,ffn_pair_pack_gate_ab2_20260228T040616Z) now has AB3 results:- warm AB3 delta (
on-off): request-0.423 ms, infer-0.442 ms, p99-0.673 ms; - both off/on runs already had contiguous gate/up pair active, so this is not a causal promotion signal.
- decision: keep lane default-off and experimental.
- warm AB3 delta (
- Batched2 Lt rerun on explicit-u16 lane (
TRENI_LINEAR_BATCHED2_USE_LT) is now split by warm/cold evidence:- warm AB3 (
batched2_use_lt_u16lane_gate_ab2_20260228T041041Z): request-0.313 ms, infer-0.468 ms, p99-0.511 ms; - cold AB3 (
batched2_use_lt_u16lane_cold_ab2_20260228T041359Z): full+1.165 ms, infer+1.424 ms. - fixed-on decision: keep non-canonical.
- warm AB3 (
- Adaptive delayed batched2 Lt policy (
TRENI_LINEAR_BATCHED2_USE_LT_ENABLE_AFTER_MS) has warm/cold wins but is not canonical (2026-02-28):5000msAB3 (batched2_lt_enable_after_ms5000_warm_ab3_20260228T104525Z,batched2_lt_enable_after_ms5000_cold_ab3_20260228T104712Z) stayed mixed (warm gain, cold full+0.422 ms).10000msAB3 (batched2_lt_enable_after_ms10000_warm_ab3_20260228T105028Z,batched2_lt_enable_after_ms10000_cold_ab3_20260228T105213Z) is net-positive:- warm delta: request
-0.363 ms, infer-0.326 ms, p99-0.696 ms; - cold delta: startup
-4.307 ms, full-0.635 ms, infer-0.347 ms, TTFT-0.070 ms.
- warm delta: request
- strict parity pass:
week3_parity_report_batched2_lt_delay10000_20260228T105329Z.json(checked=3,failed=0). - default-path strict parity smoke (without explicit batched2 Lt env overrides) also passed:
week3_parity_report_batched2_lt_defaultdelay_20260228T110825Z.json. - same-window mixed-load A/B (
mixed_load_defaultdelay_onoff_ab3_20260228T115010Z.json) regressed with delayed-on (mean +0.846 ms,p95 +1.627 ms,p99 +0.679 ms). - parser defaults remain off:
TRENI_LINEAR_BATCHED2_USE_LT=0andTRENI_LINEAR_BATCHED2_USE_LT_ENABLE_AFTER_MS=0. - post-revert strict default-path parity pass:
week3_parity_report_postrevert_defaults_20260228T115543Z.json.
- Parser-default foundation rerun pack (
foundation_defaultdelay_pack_20260228T114315Z) is now published:- warm AB3 means: request
147.258 ms, p99247.617 ms, infer128.450 ms, TTFT16.999 ms; - cold AB3 means: startup
425.532 ms, full598.787 ms, infer580.173 ms, TTFT12.210 ms; - mixed repeatability stayed worse than prior canonical summary (
mixed_load_repeatability_compare_defaultdelay_vs_prev_20260228T114748Z.json: mean+2.841 ms, p95+5.587 ms, p99+5.140 ms), which aligns with keeping delayed-on non-canonical.
- warm AB3 means: request
- Experimental FFN batched2 Lt prewarm path (
TRENI_DECODER_FFN_BATCHED2_LT_PREWARM) is now implemented and benchmarked:- fixed-Lt warm AB2 (
batched2_lt_prewarm_warm_ab2_20260228T042453Z): request-0.328 ms, infer-0.394 ms; - fixed-Lt cold AB3 (
batched2_lt_prewarm_cold_ab3_20260228T042649Z): full-1.497 ms, infer-1.406 ms.
- fixed-Lt warm AB2 (
- Direct same-window combo A/B (
lt=0,prewarm=0vslt=1,prewarm=1) remains mixed:- combined summary (
batched2_lt_prewarm_combo_summary_20260228T042733Z.json) shows warm AB3 regression (request +0.198 ms,infer +0.178 ms) despite cold AB3 improvement (full -1.099 ms,infer -0.819 ms). - decision: keep prewarm path default-off and non-canonical.
- combined summary (
TRENI_DECODER_FFN_DOWN_U16_FAST_COMPUTEnow has canonical full-depth evidence and is promoted default-on (2026-02-28):- warm AB3 (
ffn_down_fast_compute_gate_ab3_20260228T044546Z): request-0.565 ms, infer-0.566 ms, p99-1.405 ms, TTFT-0.030 ms. - cold AB3 (
ffn_down_fast_compute_cold_ab3_20260228T044753Z): startup-8.405 ms, full-0.351 ms, infer-0.406 ms, TTFT-0.028 ms. - strict parity pass (
week3_parity_report_ffn_down_fast_20260228T044846Z.json):checked=3,failed=0.
- warm AB3 (
- Post-promotion FFN retest matrix (
2026-02-28) is complete and did not produce a second promotion:- new structural stacked-GEMM lane (
TRENI_LINEAR_BATCHED2_STACKED_SEQ1) regressed in warm AB3 (request +1.259 ms,infer +1.229 ms,p99 +2.830 ms) and stayed near-flat/slightly worse in cold AB3 (full +0.030 ms), so it remains experimental/default-off. TRENI_LINEAR_BATCHED2_SPLIT_SEQ1AB3 regressed warm (request +0.964 ms) and cold (full +1.496 ms).TRENI_LINEAR_BATCHED2_USE_LTfixed-on AB3 improved warm (request -0.855 ms) but still regressed cold startup/full (startup +10.474 ms,full +0.330 ms); delayed-on improved warm/cold but still regressed mixed-load, so lane remains non-canonical.- combo
lt=1 + prewarm=1gave AB3 gains but failed AB5 cold confirmation (startup +3.199 ms,full +1.152 ms). TRENI_DECODER_FFN_PROJ_U16_FAST_COMPUTE=1remained non-canonical at that stage.
- new structural stacked-GEMM lane (
- Follow-up rerun cycle (
2026-02-28late) promotedTRENI_LINEAR_U16_FAST_COMPUTEafter higher-N validation:- warm+mixed AB5 (
linearfast_ab5_20260228T124736Z/summary_ab5.json) was positive in both modes:- warm
on-off: request-0.139 ms, p95-0.128 ms, p99-0.009 ms; - mixed
on-off: request-0.139 ms, p95-0.156 ms, p99-0.208 ms.
- warm
- cold AB3 (
linearfast_cold_ab3_20260228T124510Z/summary_ab3.json) stayed near-flat on full latency (+0.302 ms) with better startup (-4.207 ms) and TTFT (-0.019 ms). - strict parity passed (
week3_parity_report_linearfast_20260228T124557Z.json,checked=3,failed=0). - post-default strict parity smoke also passed (
week3_parity_report_post_linearfast_default_20260228T125804Z.json). - same-window default-vs-forced-off sanity (
linearfast_default_sanity_20260228T125957Z) is directionally positive on mixed request path (mean -0.603 ms,p95 -0.984 ms,p99 +0.029 ms). - runtime parser default is now
TRENI_LINEAR_U16_FAST_COMPUTE=1.
- warm+mixed AB5 (
- Full-depth FFN projection fast-compute rerun (
2026-02-28, late 8) completed on clean path (TRENI_POOL_MB=16384, classifier-disabled HTTP lane):- profiled AB3 (
ffnprojfast_fullstep_ab3_20260228T160255Z/summary_ab3.json),on-off: request-0.370 ms, infer-0.348 ms, p99-0.533 ms, TTFT-0.045 ms. - non-profiled warm AB3 (
ffnprojfast_fullwarm_ab3_20260228T160358Z/summary_ab3.json),on-off: request-0.249 ms, infer-0.225 ms, p99-0.328 ms, TTFT-0.015 ms. - strict parity passed with explicit candidate env and on temporary promoted build:
week3_parity_report_ffnprojfast_candidate_20260228T160459Z.jsonweek3_parity_report_ffnprojfast_default_20260228T160639Z.json
- post-promotion sanity AB3 (
ffnprojfast_default_sanity_ab3_20260228T160557Z/summary_ab3.json) stayed near-flat and directionally positive on means (default-force_offrequest-0.094 ms, infer-0.093 ms), with tiny p99 increase (+0.057 ms). - interim decision in that cycle: candidate looked positive and moved to full foundation validation.
- profiled AB3 (
- Full foundation validation then rejected global promotion (
2026-02-28, late 9):- foundation pack (
foundation_ffnprojfastdefault_pack_20260228T194204Z/summary_ab3.json) was slower versus prior canonical in all modes (warm/cold/mixed means). - same-window foundation gate AB2 (
foundation_ffnprojfast_gate_ab2_20260228T195240Z/summary_gate_ab2.json):default-force_offwarm request+0.489 ms, cold full+0.746 ms, mixed mean+0.004 ms(tails improved).
- final decision: keep parser canonical default
TRENI_DECODER_FFN_PROJ_U16_FAST_COMPUTE=0and retain the lane as opt-in.
- foundation pack (
- Canonical rerun on the promoted default (
2026-02-28late 2) is now published:- foundation pack:
foundation_linearfastdefault_pack_20260228T134157Z(summary_ab3.json). - versus prior parser-default foundation (
20260228T114315Z): warm/cold were near-flat/slightly slower, while mixed improved (request -0.629 ms,p95 -1.281 ms,p99 -0.163 ms).
- foundation pack:
- Same-window runtime-vLLM full-depth AB3 rerun on updated canonical lane (
aws_speedpass_runtime_vllm_linearfastdefault_ab3_20260228T134630Z) now shows runtime ahead on request full latency:- runtime
1185.186 msvs vLLM1305.971 ms(vLLM/runtime full = 1.102x). - cold-total-first-response and cold-total-first-token remain dominated by vLLM process startup in this harness profile (
5.807x,7.648xover runtime respectively).
- runtime
- Batched2-Lt fast-fallback short-circuit experiment (
2026-02-28) was evaluated and reverted:- isolation AB3 (
fastfallback_isolation_ab3_20260228T140122Z/summary_ab3.json) showed warm regression (request +1.155 ms,p95 +2.124 ms,p99 +1.504 ms) and mixed near-flat/slightly worse (mean +0.144 ms,p95 +0.569 ms), despite cold full improvement (-0.846 ms). - decision: keep reverted (non-canonical).
- post-revert strict parity remains clean (
week3_parity_report_post_fastfallback_revert_20260228T140626Z.json).
- isolation AB3 (
TRENI_TENSOR_H2D_CHUNK_MBwas re-tested on current full-depth canonical lane (2026-02-28) and promoted default to0(no chunking):- cold AB3 (
h2d_chunk_cold_ab3_20260228T142114Z/summary_ab3.json),chunk0 - chunk64: startup-4.022 ms, full-2.562 ms, infer-2.542 ms, TTFT-0.060 ms;decoder_tensor_h2d -3.347 ms. - warm+mixed AB3 (
h2d_chunk_warm_mixed_ab3_20260228T142258Z/summary_ab3.json),chunk0 - chunk64: warm request-0.442 ms; mixed request-0.044 ms. - strict parity after promotion passed (
week3_parity_report_h2dchunk0_default_20260228T142805Z.json). - single-run sanity (
h2d_chunk_default_vs64_sanity_20260228T142845Z) showed small mixed sensitivity (default-force64 mean +0.340 ms), so this remains on repeatability watch.
- cold AB3 (
- Higher-N same-window runtime-vLLM rerun (
AB5, updated defaults, 2026-02-28) now tightens the full-depth claim:- run root:
benchmarks/phase2_external_cold/results/aws_speedpass_runtime_vllm_newdefaults_ab5_20260228T145502Z - summary:
summary_ab5.json/summary_ab5.md - means:
- runtime: full
1184.812 ms, TTFT14.640 ms, cold-total full4190.848 ms - vLLM: full
1318.675 ms, TTFT50.309 ms, cold-total full24350.818 ms
- runtime: full
- ratios (
vLLM/runtime): full1.113x, TTFT3.436x, cold-total full5.810x. - compare vs prior AB3 (
compare_vs_prev_linearfastdefault_ab3.json/.md): runtime full improved slightly (-0.375 ms) and full-ratio direction strengthened (1.102x -> 1.113x).
- run root:
- Post-AB5 full-depth gate sweep on current defaults (
2026-02-28) is now complete and does not add a new canonical lane:- gate artifact:
benchmarks/phase2_runtime/results/aws_speedpass/fulldepth_gate_newdefaults_20260228T150709Z/summary_gate_ab2.json - delayed-Lt was directionally positive in AB2 and advanced to AB3 (
warm request -0.384 ms,mixed request -0.256 ms), whileTRENI_DECODER_FFN_PROJ_U16_FAST_COMPUTE=1remained mixed/noise (warm p99 +0.129 ms,mixed p99 +0.022 ms). - delayed-Lt AB3 confirmation artifact:
benchmarks/phase2_runtime/results/aws_speedpass/fulldepth_delayedlt_ab3_20260228T151322Z/summary_ab3.json - AB3 result split:
- warm
on-off: request-0.330 ms, infer-0.270 ms, p99-0.098 ms; - mixed
on-off: request+0.173 ms, infer+0.191 ms, p99+0.291 ms.
- warm
- decision at that stage: keep delayed-Lt non-canonical on defaults (
TRENI_LINEAR_BATCHED2_USE_LT=0,TRENI_LINEAR_BATCHED2_USE_LT_ENABLE_AFTER_MS=0);TRENI_DECODER_FFN_PROJ_U16_FAST_COMPUTE=0stayed canonical.
- gate artifact:
- Full-depth stage-profile refresh (
external_cold_layers36_stageprofile_20260227T175604Z) reconfirms remaining decode hotspot hierarchy:decoder_stepN_layers_mean ~19.107 ms(dominant)decoder_stepN_logits_proj_mean ~1.260 msdecoder_stepN_sample_mean ~0.331 ms- layer split still led by FFN projection (
decoder_step_profile_ffn_proj_mean ~0.204 ms/layer).
- Phase 3 loop-capability now has a canonical G5 set (baseline+stress, 3 seeds each) with consolidated summary artifacts.
- Canonical Phase 3 shows internal loops keep 100% success while external loops lose success in tool-state adaptation and pay large hop/retry latency amplification.
- Phase 3 uncertainty ablation now has a first baseline matrix (
runs=8) showing success drops when uncertainty-aware branching is disabled on either internal or external path. - Phase 3 uncertainty ablation now has 3-seed baseline+stress repeatability with consolidated baseline-vs-stress comparison.
- Runtime now exports request uncertainty in HTTP responses, and Phase 3 C2 harness now supports
runtime_nativeuncertainty source. - Runtime response contract now includes unified
awareness(route+generation) while preserving legacyuncertainty; Phase 3 runtime-native client now consumes the unified payload first. - Runtime-native C2 calibrated rerun (
calib1) is complete; with zero-fallback probes, uncertainty-on deltas are again positive in baseline+stress. - Phase 4 kickoff on Lambda A100/H100 is complete for Track C loops; both hardware classes preserve 100% internal success and show the same external latency amplification pattern.
- Phase 4 full Lambda reruns are now complete (A100 + H100): Phase 2 cold/hot, routing matrix, and C2 runtime-native calibrated sets are locked with raw artifacts.
- Paper-grade package is now generated from the canonical G5 + Lambda A100 + Lambda H100 sets (
benchmarks/paper_package/latest). - Paper package now includes manuscript-ready assets (
manuscript/captions.md,manuscript/claims.md,manuscript/figure_manifest.json, mermaid figure specs). - Internet multi-hop commercial routing matrices are now available (Fly.io controller/tool hops to OpenAI + OpenRouter).
- Track B commercial control set has now been rerun with fairness-hardened harness controls (interleaved order + deterministic defaults + token normalization + strict tool parity on tool tasks), narrowing claim scope to task-family-stratified statements.
- AWS G5 speedpass validation is now complete for the new kernel/cold pass: disabling per-tensor upload sync delivers the measurable gain (
~1.03xcold full,~1.01xwarm mean,~1.03xwarm p99), while initialcublasLtwas near-parity on warm/full and did not improve TTFT. - AWS G5 TTFT-focused kernel pass is now complete: softmax-only was near-parity, then row-parallel norm kernels (
rmsnorm/layernorm) delivered a clear lift (~1.20xcold TTFT,~1.18xwarm mean in bestlt1_sync0config). - AWS G5 TTFT follow-up is now complete:
seq_q=1tiny attention kernels plus direct K/V cache writes further improved TTFT/warm path and materially reduced Bart TTFT (16.573 -> 12.842 ms). - Week-3 parity is now fully locked on AWS after two fixes: parser handles interleaved stderr/stdout runtime logs, and a rebuilt parity container (
qwen+bart+minilm) removed invalidminilmoffsets so strict external-HF parity passes. - Runtime now has a strict attention backend selector (
TRENI_ATTN_BACKEND) plus an A/B harness (customvscudnn_sdpaproxy) and Phase 3 now supports file-backedrealistic_v1fixtures to reduce synthetic benchmark bias. - AWS G5 attention backend A/B rerun with reversed call order confirms near-parity between
customandcudnn_sdpaproxy paths; earlier large cold delta was call-order/cache bias. - Phase 3 realistic-v1 reruns are now complete (baseline+stress, 3 seeds each) with strong internal-loop advantage preserved; realistic-v1 uncertainty ablation baseline+stress pair is also published.
- Attention runtime now caches backend env config once per process and includes a seq1 hybrid tuning matrix (
custom,qk-cublas,pv-cublas,both-cublas) with warm/cold tradeoff data. - AWS G5 seq1 fused-softmax/PV follow-up is now complete: default custom request path improved again on warm and cold (
seq1_hybrid_fused_20260222T192656Z). - H100 fused cuDNN SDPA probe pack is now published; current backend descriptor path still yields no viable fused SDPA engine configs (
cudnn_sdpa_h100_probe_20260222T1935Z). - True fused cuDNN frontend SDPA path is now integrated and validated on G5 (
attn_backend_ab_frontend_20260222T220111Z): warm path is near parity on fixed warmed shapes, but cold/mixed still regress due expensive frontend plan-build misses. - Fused frontend profiling now quantifies miss root cause (
cudnn_frontend_profile_probe_20260222T2204Z): plan-build misses are~705 mseach on A10G, while pack/execute/unpack costs are negligible. - Frontend A/B harness now hard-fails contamination when fused marker is absent or runtime was compiled with
TRENI_WITH_CUDNN=0. - Frontend repeatability matrix is now complete (
attn_backend_frontend_matrix_20260222T221948Z,repeats=3forwarm_fixed+mixed_churn): custom wins all tracked metrics (3/3per metric) in both profiles. - Frontend claim-strength report is now published (
attn_backend_frontend_claim_report_20260222T222958Z) with paired delta CI95 summaries for each latency metric/profile. - Grouped commercial root-cause report is now published (
commercial_gap_root_cause_20260222T222958Z) and indicates current fairness splits are still parity/noise dominated at present sample sizes. - Fused frontend miss tracing is now explicit (
TRENI_ATTN_CUDNN_FRONTEND_TRACE_MISSES) and confirms misses are concentrated in decode-stepseq_q=1shape growth (seq_kv=2..10in probe). - Startup multi-prompt preload mitigation (
TRENI_HTTP_PRELOAD_PROMPTS) is now benchmarked on G5 and materially reduces mixed-churn/full-latency spikes for fused frontend while keeping custom faster overall. - Hybrid shape-gated frontend policy is now validated on G5 (
2026-02-23): startup prebuild overhead drops from~7.0 sto~2.0 swhile no-preload fused TTFT/full remain low and strict inference-valid on the fixed harness profile; bounded-gate follow-up removes broader-shape miss cascades by routing out-of-window shapes to custom. - Coverage-instrumented frontend reruns are now published (
2026-02-23): runtime exports per-request attention backend counters/shares, and high fused-coverage profiles show current fused path is still slower than custom on both warm and cold request paths. - Execution decision (
2026-02-23): park cuDNN/frontend optimization and prioritize custom-kernel best-path work. - Custom lane implementation update (
2026-02-23): added seq1 microfused attention path (TRENI_ATTN_SEQ1_USE_MICROFUSED) plus cached cuBLAS stream binding. - G5 seq1 microfused A/B (
2026-02-23, qwen+bart,max_kv=64and16) shows no net win vs custom baseline; warm mean/TTFT regress while only isolated bart p99 improves in one profile. Path remains opt-in and defaults off.- summary artifact:
benchmarks/phase2_runtime/seq1_microfused_ab/seq1_microfused_ab_summary_20260223T014848Z.md.
- summary artifact:
- G5 stream-cache A/B (
2026-02-23, qwen+bart) forTRENI_LINEAR_STREAM_CACHE+TRENI_ATTN_STREAM_CACHEis near-neutral in short runs; keep enabled by default and focus on higher-impact kernel/cold-path work.- summary artifact:
benchmarks/phase2_runtime/results/stream_cache_ab_summary_20260223T015222Z.md.
- summary artifact:
- G5 registry/model-index hash A/B (
2026-02-23, qwen profile) forTRENI_REGISTRY_LOOKUP_HASH+TRENI_MODEL_INDEX_NAME_HASHshowed no meaningful cold/setup improvement in this run set; kept as opt-in and defaults off.- summary artifact:
benchmarks/phase2_runtime/results/registry_hash_ab_summary_20260223T020353Z.md.
- summary artifact:
- Cold-start harness fix (
2026-02-23): startup health polling moved to 50ms cadence, removing prior ~1s quantization fromstartup_to_healthy_ms. - Startup-smoke A/B with high-fidelity polling (
startup_smoke_ab_hf_20260223T030059Z) shows skipping startup smoke is a material cold win:- startup-to-healthy:
488.027 -> 404.184 ms(-17.18%) - start-to-first-response (startup + first full):
705.454 -> 622.167 ms(-11.81%) - request-path TTFT/full are near-flat (expected; this is startup-stage, not decoder-step optimization).
- runtime default now matches this policy (
TRENI_SKIP_STARTUP_SMOKE=1unless explicitly set false).
- startup-to-healthy:
- Additional custom-cold knob probes (
TRENI_TENSOR_ENV_CACHE,TRENI_TENSOR_H2D_CHUNK_MB,TRENI_TENSOR_HOST_REGISTER) were run on G5 and were near-neutral on this profile; no new canonical promotion from those knobs.- consolidated artifact:
benchmarks/phase2_runtime/results/cold_path_knob_probe_20260223T0303Z.md.
- consolidated artifact:
- Per-tensor upload hotspot profiling (
TRENI_TENSOR_UPLOAD_TOPK) is now wired into runtime and first qwen cold probe showsmodel.embed_tokens.weightas the dominant cold upload stage contributor (~79.3 ms, ~63.8%share in that probe). - Container-level readahead hint (
TRENI_CONTAINER_WILLNEED) is now benchmarked on G5 and shows a modest, repeatable cold-total improvement in 8-run A/B (~-1.94%start-to-first-response). - Runtime default now enables this readahead hint (
TRENI_CONTAINER_WILLNEED=1unless explicitly disabled). - Combined readahead + host-register (
TRENI_CONTAINER_WILLNEED=1,TRENI_TENSOR_HOST_REGISTER=1) did not add clear gain beyond readahead-only profile on current G5 runs.- consolidated artifact:
benchmarks/phase2_runtime/results/cold_upload_hotspot_summary_20260223T1915Z.md.
- consolidated artifact:
- Staged H2D upload follow-up (
TRENI_TENSOR_H2D_STAGING) is now complete on G5:min64/chunk32(8-run A/B) regressed full latency+21.22%anddecoder_tensor_h2d+38.68%.min64/chunk128(3-run probe) regressed further (full +44.43%,decoder_tensor_h2d +76.92%).- decision: park staged-H2D path for now; keep it opt-in/default-off and continue cold-path work on non-staging custom upload/H2D.
- consolidated artifact:
benchmarks/phase2_runtime/results/h2d_staging_followup_summary_20260224T101324Z.md.
- Non-staging H2D chunk matrix (
TRENI_TENSOR_H2D_CHUNK_MB=0/64/128, 8 runs each) is now complete on G5 and was near-neutral across request and upload metrics.- decision: keep current chunk default policy; prioritize structural upload/H2D and
decoder_step0_layerswork. - consolidated artifact:
benchmarks/phase2_runtime/results/h2d_chunk_matrix_summary_20260224T101730Z.md.
- decision: keep current chunk default policy; prioritize structural upload/H2D and
- Host page-touch pre-fault A/B (
TRENI_TENSOR_HOST_TOUCH=1,TRENI_TENSOR_HOST_TOUCH_MIN_MB=256, 8 runs) is now complete on G5.decoder_tensor_h2dimproved (-31.13 ms) but prefetch/upload increased, causing net request regression (full +7.73%,infer +8.22%).- decision: keep host-touch path opt-in/default-off and continue cold-path work on non-regressing upload changes.
- consolidated artifact:
benchmarks/phase2_runtime/results/host_touch_ab_summary_20260224T102444Z.md.
- Upload sync probe (
TRENI_TENSOR_UPLOAD_SYNC=0/1, 3 runs each) now quantifies upload composition under synchronized timing.- conversion rises to ~
6 mswith sync enabled, but H2D remains ~118 msand dominant. - decision: keep transfer-path optimization as the primary cold-upload focus.
- consolidated artifact:
benchmarks/phase2_runtime/results/upload_sync_probe_summary_20260224T102618Z.md.
- conversion rises to ~
- Synchronized host-register probe (
TRENI_TENSOR_HOST_REGISTER=0/1, withTRENI_TENSOR_UPLOAD_SYNC=1) is now complete.- transfer-stage metrics stayed effectively flat and request path slightly regressed.
- decision: deprioritize host-register optimization lane for current cold-upload work.
- consolidated artifact:
benchmarks/phase2_runtime/results/host_register_sync_probe_summary_20260224T102915Z.md.
- Decoder logits u16 A/B (
TRENI_DECODER_LOGITS_U16_PATH=0/1) is now complete with valid inference in both arms.- upload/setup moved slightly in the right direction, but request-path metrics regressed materially (
ttft,infer,full). - fix2 pilot follow-up after mixed-precision path adjustment still regressed request path, confirming the same direction.
- decision: keep logits-u16 path opt-in/default-off and park for now.
- consolidated artifact:
benchmarks/phase2_runtime/results/logits_u16_ab_fix1_summary_20260224T105532Z.md.
- upload/setup moved slightly in the right direction, but request-path metrics regressed materially (
- Tensor-cache hash A/B (
TRENI_TENSOR_CACHE_HASH=0/1) is now complete (mixed + warm 3-seed follow-up).- warm 3-seed request deltas are near-neutral, with slight
p99regression when enabled (+0.149 ms). - decision: keep tensor-cache hash path opt-in/default-off.
- artifacts:
benchmarks/phase2_runtime/results/tensor_cache_hash_ab_20260224T113911Z/benchmarks/phase2_runtime/results/tensor_cache_hash_warm3_20260224T114126Z/
- warm 3-seed request deltas are near-neutral, with slight
- Sampler direct-store A/B (
TRENI_SAMPLE_DIRECT_STORE=0/1) is now complete (3-seed warm).- enabled path regressed warm request metrics (mean
+0.062 ms, p95+0.076 ms, p99+0.143 ms). - decision: keep sampler direct-store opt-in/default-off.
- artifact:
benchmarks/phase2_runtime/results/sample_direct_store_ab_20260224T114633Z/.
- enabled path regressed warm request metrics (mean
- Decoder direct-out residual A/B (
TRENI_DECODER_DIRECT_OUT_HIDDEN=0/1) is now complete (3-seed warm).- enabled path regressed warm request and infer metrics (mean
+0.540 ms, p95+0.495 ms, p99+0.444 ms, infer+0.150 ms). - decision at that time: keep decoder direct-out path opt-in/default-off.
- superseded for current full-depth lane by
2026-02-27late-cycle rerun (direct-out promoted default-on there). - artifact:
benchmarks/phase2_runtime/results/direct_outhidden_ab_20260224T115051Z/.
- enabled path regressed warm request and infer metrics (mean
- Consolidated summary artifact for these custom-path probes:
benchmarks/phase2_runtime/results/custom_path_probe_summary_20260224T115602Z.md.
- Multi-head seq1 attention A/B (
TRENI_ATTN_SEQ1_USE_MULTIHEAD=0/1) is now complete and directionally strong.- qwen warm (3-seed): request mean
1.041x, p991.042x, infer1.074x. - qwen mixed (3-seed): request mean
1.036x, p991.045x, infer1.074x, cold wall1.010x. - bart warm (3-seed): request mean
1.097x, p991.112x, TTFT1.429x, infer1.185x. - default sanity run (no env override) remains faster than forced-off.
- decision: promote this path to default-on (
TRENI_ATTN_SEQ1_USE_MULTIHEAD=1,TRENI_ATTN_SEQ1_MULTIHEAD_MAX_KV=2048) while retaining off-switch fallback. - artifacts:
benchmarks/phase2_runtime/results/seq1_multihead_ab_20260224T125127Z/benchmarks/phase2_runtime/results/seq1_multihead_bart_ab_20260224T125404Z/benchmarks/phase2_runtime/results/seq1_multihead_step0_probe_20260224T125508Z/benchmarks/phase2_runtime/results/seq1_multihead_default_sanity_20260224T125713Z/benchmarks/phase2_runtime/results/seq1_multihead_ab_summary_20260224T125619Z.md
- qwen warm (3-seed): request mean
- External-cold repeatability rerun after seq1 multi-head default promotion (
2026-02-24, 3 runs, runtime + PyTorch + vLLM) is now complete.- runtime means: startup
1003.315 ms, TTFT4.022 ms, request full239.277 ms, cold-total first response1242.592 ms. - runtime-normalized ratios: PyTorch
127.900xTTFT /9.378xfull /6.320xcold-total; vLLM12.350xTTFT /4.139xfull /19.333xcold-total. - runtime delta vs prior host-prefetch repeatability means (
2026-02-19): TTFT5.130 -> 4.022 ms, full316.403 -> 239.277 ms, cold-total1320.240 -> 1242.592 ms. - note: Ollama was skipped for this rerun because service/model were not installed on this host environment.
- artifact:
benchmarks/phase2_external_cold/results/external_cold_seq1mh_default_repeatability_20260224T192020Z.md.
- runtime means: startup
- Step0 optimization follow-up (
2026-02-24): seq1 multi-head softmax/PV now reuses normalized probabilities and avoids repeatedexpin the inner PV accumulation loop.- 3-run external-cold repeatability (
runtime + PyTorch + vLLM) runtime deltas vs seq1mh baseline:- TTFT
4.022 -> 4.018 ms - request full
239.277 -> 238.400 ms - cold-total first response
1242.592 -> 1241.688 ms
- TTFT
- interpretation: positive but small gain; further
decoder_step0_layerswork is still required for material uplift. - artifact:
benchmarks/phase2_external_cold/results/external_cold_step0expfix_repeatability_20260224T194226Z.md.
- 3-run external-cold repeatability (
- Step0 shared-probability follow-up (
2026-02-24) was run and compared against exp-reuse baseline:- runtime deltas vs exp-reuse means: TTFT
+0.001 ms, request full+0.278 ms, cold-total first response+0.282 ms. - decision: revert this follow-up and keep exp-reuse as current best state.
- artifact:
benchmarks/phase2_external_cold/results/external_cold_step0shared_repeatability_20260224T194913Z.md.
- runtime deltas vs exp-reuse means: TTFT
- Decode-stage and uncertainty update (
2026-02-25):- first non-step0 profile artifact:
benchmarks/phase2_external_cold/results/external_cold_stepn_profile_20260225T001334Z.json - key stage means (qwen, 64 tokens, no preload):
decoder_stepN_logits_sample_mean=2.671 msdecoder_stepN_layers_mean=1.360 ms
- uncertainty A/B artifacts:
benchmarks/phase2_external_cold/results/external_cold_uncert_on_20260225T001702Z.jsonbenchmarks/phase2_external_cold/results/external_cold_uncert_off_20260225T001704Z.json
- uncertainty A/B deltas (on -> off):
- request full
479.889 -> 473.367 ms - infer
461.771 -> 454.878 ms decoder_stepN_logits_sample_mean 2.671 -> 2.562 ms
- request full
- interpretation: uncertainty overhead exists but is not the primary decode bottleneck in this profile.
- first non-step0 profile artifact:
- Runtime-vLLM cold rerun (
2026-02-25, same profile, uncertainty-off runtime):- artifact:
benchmarks/phase2_external_cold/results/external_cold_runtime_vllm_uncertoff_20260225T001929Z.json - runtime: TTFT
3.929 ms, full472.724 ms, cold-total full1476.116 ms - vLLM: TTFT
49.577 ms, full1311.481 ms, cold-total full24344.013 ms - interpretation: runtime remains decisively ahead in this cold-first-hit comparison.
- artifact:
- Decode stepN logits split + immediate kernel probes (
2026-02-25, qwen, 64 tokens, no preload):- split artifacts:
benchmarks/phase2_external_cold/results/external_cold_stepn_split_20260225T081450Z.jsonbenchmarks/phase2_external_cold/results/external_cold_stepn_split_revert_20260225T082055Z.json
- split result:
decoder_stepN_logits_proj_mean=2.458 msdecoder_stepN_sample_mean=0.106 ms- conclusion: decode hotspot is logits projection, not sampling.
- probe A/Bs (all near-neutral; no sustained gain):
lt16path (external_cold_stepn_lt16_off/on_20260225T081717Z/081718Z)- fast16/tensor-op GEMMEx (
external_cold_stepn_split_fast16_20260225T082158Z) - direct-u16-input probe (
external_cold_stepn_u16direct_off/on_20260225T082445Z/082447Z) lt_u16workspace probe (external_cold_stepn_ltu16ws_off/on_20260225T082735Z/082737Z)
- decision: all four experimental lanes reverted; baseline path remains canonical while next optimization focuses on deeper logits-projection architecture changes.
- split artifacts:
- Full-depth qwen rerun + runtime-vLLM comparison (
2026-02-25,--layers 36,--pool-mb 16384, no preload):- runtime-only profile artifact:
benchmarks/phase2_external_cold/results/external_cold_stepn_split_layers36_pool16g_20260225T083216Z.json
- runtime-vLLM artifact:
benchmarks/phase2_external_cold/results/external_cold_runtime_vllm_layers36_pool16g_20260225T083306Z.json
- full-depth runtime stage means (profiled run):
decoder_stepN_layers_mean=24.306 msdecoder_stepN_logits_proj_mean=2.458 msdecoder_stepN_total_mean=26.875 ms
- runtime-vLLM request-path comparison (non-profiled run):
- runtime: TTFT
26.775 ms, full2983.780 ms, cold-total full3987.092 ms - vLLM: TTFT
49.998 ms, full1315.478 ms, cold-total full24346.938 ms
- runtime: TTFT
- interpretation: full-depth runtime still wins TTFT and cold-total, but currently loses first-request full latency to vLLM in this configuration.
- runtime-only profile artifact:
- Full-depth preload follow-up (
2026-02-25,--layers 36,--pool-mb 16384, preload on):- artifacts:
benchmarks/phase2_external_cold/results/external_cold_runtime_vllm_layers36_pool16g_preload_20260225T150209Z.jsonbenchmarks/phase2_external_cold/results/external_cold_runtime_vllm_layers36_pool16g_preload64_20260225T150410Z.json
- key result: preload converts request cache deltas from misses to hits (
cache_hit_delta=434,cache_miss_delta=0) and drops runtime full latency to~2135 ms, but still above vLLM request full (~1263-1280 msin these runs). - implication: remaining gap is layer/decode compute, not upload misses.
- artifacts:
- Full-depth hybrid/path probes (
2026-02-25) did not improve the layer-compute gap:- seq1 hybrid matrix (
defaultvsqkvspvvsboth) artifacts:external_cold_layers36_hybrid_default_20260225T150806Z.jsonexternal_cold_layers36_hybrid_qk_20260225T150811Z.jsonexternal_cold_layers36_hybrid_pv_20260225T150816Z.jsonexternal_cold_layers36_hybrid_both_20260225T150821Z.json
- result: default custom path remained best (
infer ~2113 ms); all hybrid variants regressed (~2459-2556 ms). - direct-u16-input full-depth A/B (
external_cold_layers36_preload_a2_u16direct_off/on_20260225T150710Z/150715Z) was near-neutral/regressed and was reverted.
- seq1 hybrid matrix (
- Full-depth FFN u16 path follow-up (
2026-02-25) is now complete:- artifacts:
benchmarks/phase2_external_cold/results/external_cold_layers36_preload64_ab2_base_20260225T1628Z.jsonbenchmarks/phase2_external_cold/results/external_cold_layers36_preload64_ab2_ffnu16_20260225T1628Z.json
- runtime deltas (
ffnu16 - base):- TTFT
26.872 -> 18.077 ms(-8.795 ms) - request full
2148.336 -> 1820.345 ms(-327.991 ms) - cold-total full
6155.513 -> 4826.635 ms(-1328.878 ms)
- TTFT
- vLLM request full in matched runs:
1300.232/1317.144 ms. - interpretation: full-depth gap narrowed substantially, but remains open on request full latency.
- artifacts:
- Full-depth attention/logits u16 expansion (
2026-02-25) is now complete (3-seed matrix):- artifacts:
benchmarks/phase2_external_cold/results/external_cold_layers36_preload64_ab3_base_s{1,2,3}_20260225T1640Z.jsonbenchmarks/phase2_external_cold/results/external_cold_layers36_preload64_ab3_attnffnu16_s{1,2,3}_20260225T1640Z.jsonbenchmarks/phase2_external_cold/results/external_cold_layers36_preload64_ab3_attnffnlogitsu16_s{1,2,3}_20260225T1700Z.json
- mean results:
- baseline runtime:
TTFT=26.863 ms,full=2147.754 ms,cold_full=6154.978 ms ATTN+FFN u16:TTFT=17.080 ms,full=1791.873 ms,cold_full=4797.910 msATTN+FFN+LOGITS u16:TTFT=16.104 ms,full=1775.313 ms,cold_full=4780.830 ms
- baseline runtime:
- runtime/vLLM full-latency ratio improved from
1.653x(baseline) to1.365x(best), but request full latency still trails vLLM in this full-depth setup.
- artifacts:
- Full-depth decode-input reuse + u16-Lt follow-up (
2026-02-25) is now complete:- regressing fused
gate+upFFN trial was explicitly reverted after measured slowdown. - shared decode-input pre-cast reuse for q/k/v + gate/up was implemented and validated.
- u16 cublasLt cached path (dtype-aware, safe fallback) was implemented and validated.
- new 3-seed means (
precastreuse+u16lt):- runtime
TTFT=15.522 ms,full=1729.351 ms,cold_full=4735.345 ms - delta vs prior best (
ATTN+FFN+LOGITS u16):TTFT -0.582 ms,full -45.962 ms,cold_full -45.485 ms
- runtime
- runtime/vLLM full-latency ratio improved further to
~1.323xin latest matched 3-seed set.
- regressing fused
- FAST_16 compute-mode follow-up (
2026-02-25) was tested on top of u16-Lt:- strict Week 3 parity remained clean (
checked=3,failed=0). - request-full changes were small and one repeatability run showed a large startup outlier on both runtime and vLLM.
- decision: do not promote FAST_16 as canonical yet; keep non-fast compute on the stable u16-Lt lane.
- strict Week 3 parity remained clean (
- Residual-fused u16-Lt follow-up (
2026-02-26) is now complete:- implemented u16 no-bias residual-accumulate path for decoder
o_projandffn_down(Lt fused when available, safe fallback otherwise). - strict Week 3 parity remained clean (
checked=3,failed=0). - new 3-seed means (
residfuse+u16lt):- runtime
TTFT=15.400 ms,full=1719.302 ms,cold_full=4725.923 ms - delta vs prior
precastreuse+u16lt:TTFT -0.122 ms,full -10.049 ms,cold_full -9.422 ms
- runtime
- profiler corroboration:
decoder_step_profile_o_proj_resid_meananddecoder_step_profile_ffn_down_resid_meandropped, anddecoder_stepN_layers_meanmoved down accordingly.
- implemented u16 no-bias residual-accumulate path for decoder
- cuBLASLt workspace probe (
TRENI_LINEAR_LT_WORKSPACE_MB) was run and rejected for this lane (2026-02-26):- trial artifacts:
benchmarks/phase2_external_cold/results/external_cold_layers36_trial_ltws0_20260226T105356Z.jsonbenchmarks/phase2_external_cold/results/external_cold_layers36_trial_ltws32_20260226T105401Z.json
- request full regressed with workspace enabled (
1711.213 -> 1754.568 ms), so no promotion.
- trial artifacts:
- Full-depth FFN activation-to-u16 fused follow-up (
TRENI_DECODER_FFN_ACT_U16_FUSED) is now complete and promoted default-on (2026-02-26):- runtime-only 3-seed A/B:
- off:
TTFT=15.333 ms,full=1715.700 ms,cold_full=4721.653 ms - on:
TTFT=15.193 ms,full=1704.987 ms,cold_full=4710.958 ms - delta (on-off):
TTFT -0.140 ms,full -10.713 ms,cold_full -10.696 ms
- off:
- runtime-vLLM 3-seed A/B (same host/window):
- off ratio:
runtime/vLLM full = 1.3208x - on ratio:
runtime/vLLM full = 1.3012x
- off ratio:
- strict parity remained clean in explicit-on and default-on runs:
benchmarks/phase2_runtime/results/aws_speedpass/week3_parity_report_ffnactu16_20260226T1100.jsonbenchmarks/phase2_runtime/results/aws_speedpass/week3_parity_report_ffnactu16_default_20260226T1108.json
- runtime-only 3-seed A/B:
Timeline
Latest Key Numbers
Warm Path (G5)
- Warm steady-state request mean:
~80.8 ms - Warm steady-state p99:
~89.6 ms
Frontend Coverage-Instrumented Reruns (G5, 2026-02-23)
- Matrix with bounded hybrid gate (
attn_backend_frontend_matrix_20260223T011158Z):- warm fixed fused share:
~0.030303(custom handles~0.969697of calls) - warm fixed TTFT: custom
4.194 msvs fused-profile4.269 ms - mixed warm mean: custom
48.302 msvs fused-profile47.592 ms(near parity in that bounded-coverage setting)
- warm fixed fused share:
- High fused-coverage warm profile (
fused_coverage_profiles_20260223T011504Z):- fused share
~0.878788 - request mean: custom
20.292 msvs fused22.310 ms(~1.099xslower on fused) - TTFT: custom
4.196 msvs fused4.496 ms
- fused share
- High fused-coverage cold profile (
fused_coverage_cold_profiles_20260223T011534Z):- fused share
~0.9 - cold TTFT: custom
4.215 msvs fused704.176 ms - cold full: custom
246.306 msvs fused6595.157 ms
- fused share
- Plain interpretation:
- bounded gating avoids most regressions by keeping the majority of calls on custom.
- when fused is exercised heavily, current frontend implementation still loses; dynamic shape-plan reuse/coverage remains the blocker.
AWS G5 TTFT Kernel Pass (2026-02-22, lt0_sync0 baseline -> best norm+softmax+lt1_sync0)
- Cold TTFT:
16.738 ms -> 13.974 ms(1.198xfaster). - Cold full latency:
424.685 ms -> 396.814 ms(1.070xfaster). - Warm mean latency:
174.237 ms -> 147.269 ms(1.183xfaster). - Warm p99 latency:
1035.823 ms -> 936.297 ms(1.106xfaster). - Per-model cold TTFT deltas:
qwen:39.537 -> 29.411 ms(largest gain).donut:3.505 -> 2.619 ms.bart: near-flat (16.523 -> 16.573 ms), remaining hotspot to isolate.
AWS G5 TTFT Follow-Up (2026-02-22, best norm+softmax+lt1_sync0 -> default seq_q=1 tiny-kernel path)
- Cold TTFT:
13.974 ms -> 12.504 ms(1.118xfaster). - Cold full latency:
396.814 ms -> 390.099 ms(1.017xfaster). - Warm mean latency:
147.269 ms -> 143.230 ms(1.028xfaster). - Warm p99 latency:
936.297 ms -> 924.276 ms(1.013xfaster). - Bart cold TTFT:
16.573 ms -> 12.842 ms(1.29xfaster). - Profiling signal (
TRENI_STEP0_PROFILE=1) showed Bart step0 dominated bydecoder_step0_layers, which motivated this path. - 3-seed repeatability on the new default path:
- cold TTFT
12.563 ± 0.037 ms - cold full
390.961 ± 0.270 ms - warm mean
143.297 ± 0.222 ms - warm p99
925.668 ± 1.070 ms
- cold TTFT
- Week-3 parity status update (strict trace rerun):
- parser fix correctly classifies fallback/failure markers in interleaved runtime logs.
- debug rerun found old-container root cause: out-of-bounds
embeddings.word_embeddings.weightoffset forminilminmonolith_phase3.bin. - rebuilt parity container (
monolith_phase3_qbm.bin, qwen+bart+minilm) now passes strict external-HF parity:checked=3,failed=0,missing=0. - runtime on/off Bart step0 logits A/B stayed numerically aligned (max abs diff
~2e-6, cosine~1.0).
AWS G5 Attention Backend A/B (2026-02-22, deconfounded)
- First-order run (
attn_backend_ab_20260222T143605Z) showed large cold infer/full gap caused by run order (custom executed first after build/startup). - Reverse-order rerun (
attn_backend_ab_rev_20260222T144736Z) removed that bias:- cold TTFT:
6.460 mscustom vs6.447 mscudnn proxy (1.002xcustom/cudnn). - cold full:
147.789 mscustom vs146.707 mscudnn proxy (1.007x). - warm mean:
53.545 mscustom vs53.341 mscudnn proxy (1.004x). - warm p99:
82.031 mscustom vs80.754 mscudnn proxy (1.016x).
- cold TTFT:
- Interpretation:
- legacy
cudnn_sdpaproxy path is near-parity/slightly faster by a small margin. - runtime now keeps proxy behavior explicit opt-in (
TRENI_ATTN_ALLOW_SDPA_PROXY=1). - this section is proxy-only; true fused frontend results are reported in the next section.
- legacy
AWS G5 True Fused cuDNN Frontend A/B (2026-02-22, fixed qwen, warmed query set)
- Artifact:
attn_backend_ab_frontend_20260222T220111Z. - Warm (
http_warmup_runs=8,http_runs=8,--http-model qwen):- request mean: custom
19.324 msvs fused21.503 ms(custom/fused=0.899) - request p99: custom
22.087 msvs fused24.875 ms(custom/fused=0.888) - infer mean: custom
18.803 msvs fused20.976 ms(custom/fused=0.896) - TTFT mean: custom
4.199 msvs fused4.498 ms(custom/fused=0.934)
- request mean: custom
- Cold first hit:
- TTFT: custom
4.220 msvs fused710.641 ms - full latency: custom
250.929 msvs fused6610.148 ms
- TTFT: custom
- Profile probe (
TRENI_ATTN_CUDNN_FRONTEND_PROFILE=1) showed:avg_build_ms_per_miss ~= 704.8 msavg_pack_ms ~= 0.010 msavg_exec_ms ~= 0.021-0.048 msavg_unpack_ms ~= 0.005 ms
- Interpretation:
- fused path is active and measurable.
- warm path is close to custom when shapes are warmed.
- cold/mixed penalty is dominated by shape-plan miss compilation, not kernel execution.
AWS G5 Frontend Repeatability Matrix (2026-02-22, repeats=3)
- Artifact:
attn_backend_frontend_matrix_20260222T221948Z. - Profiles:
warm_fixed(http_warmup_runs=8)mixed_churn(http_warmup_runs=0)
- Warm-fixed aggregate:
- custom request mean
19.271 +/- 0.050 msvs fused21.468 +/- 0.018 ms - custom infer
18.812 +/- 0.059 msvs fused20.984 +/- 0.026 ms - custom TTFT
4.198 +/- 0.001 msvs fused4.498 +/- 0.001 ms
- custom request mean
- Mixed-churn aggregate:
- custom request mean
47.864 +/- 0.018 msvs fused843.141 +/- 0.735 ms - custom infer
47.331 +/- 0.050 msvs fused842.542 +/- 0.747 ms - custom TTFT
4.197 +/- 0.002 msvs fused179.744 +/- 0.263 ms
- custom request mean
- Win counts:
- custom wins every tracked metric in both profiles (
3/3each metric).
- custom wins every tracked metric in both profiles (
- Interpretation:
- this now provides repeatable evidence that current custom path outperforms current fused frontend path under both stable warmed traffic and shape churn.
AWS G5 Frontend Claim-Strength (2026-02-22)
- Artifact:
attn_backend_frontend_claim_report_20260222T222958Z. - Paired-delta CI95 summary (
frontend - custom; positive means custom faster):- warm-fixed request mean delta:
+2.197 msCI95[2.125, 2.238]. - warm-fixed TTFT delta:
+0.300 msCI95[0.299, 0.301]. - mixed-churn request mean delta:
+795.277 msCI95[794.408, 795.747]. - mixed-churn TTFT delta:
+175.546 msCI95[175.300, 175.820].
- warm-fixed request mean delta:
- Interpretation:
- effect direction is consistent and large in both profiles.
- repeat count is still small (
n=3/profile), so this should be treated as strong directional evidence with low-variance replication, not final high-N significance.
AWS G5 Frontend Miss-Mitigation (Updated Canonical, 2026-02-22)
- Artifacts:
- baseline matrix:
attn_backend_frontend_matrix_20260222T230445Z(no_preload) - candidate matrix:
attn_backend_frontend_matrix_20260222T231139Z(startup_preload_benchmark_queries) - compare report:
attn_backend_frontend_missmit_compare_20260222T231335Z - exact cold-prompt probe:
preload_exact_prompt_probe_20260222T231050Z.json
- baseline matrix:
- Mitigation:
- fixed runtime splitter bug so
TRENI_HTTP_PRELOAD_PROMPTSexecutes all prompts. - used preload prompts matched to benchmark cold/warm query set.
- fixed runtime splitter bug so
- Mixed-churn improvements (
no_preload->startup_preload_benchmark_queries):- fused request mean:
843.242 -> 22.433 ms(37.590x) - fused infer mean:
842.684 -> 21.965 ms(38.365x) - fused warm TTFT:
179.541 -> 4.497 ms(39.928x) - fused cold TTFT:
704.521 -> 4.495 ms(156.723x) - fused cold full latency:
6593.495 -> 25.785 ms(255.707x)
- fused request mean:
- Exact cold-prompt probe:
- fused first-hit TTFT:
4.499 ms - fused first-hit full latency:
26.090 ms
- fused first-hit TTFT:
- Interpretation:
- with matched preload coverage, the previous fused cold/mixed miss penalty is removed for this harness.
- custom still has a small warmed-path lead, but first-hit TTFT no longer regresses on the canonical prompt set.
- still open: generalize this behavior without curated prompt-list preload.
AWS G5 Frontend Shape-Prebuild Probe (No Preload Prompts, 2026-02-22)
- Artifacts:
- cold probe (startup prebuild enabled):
prebuild_startup_nopreload_probe_20260222T232932Z.json - matrix probe (
repeats=1):attn_backend_frontend_matrix_20260222T233003Z - compare (
no_preload->shape_prebuild_nopreload):attn_backend_frontend_missmit_compare_20260222T233116Z
- cold probe (startup prebuild enabled):
- Mitigation:
- startup shape prebuild via:
TRENI_ATTN_CUDNN_FRONTEND_PREBUILD_SEQ1_MAX_KV=16TRENI_ATTN_CUDNN_FRONTEND_PREBUILD_HEAD_DIM=128
- no prompt preload list used.
- startup shape prebuild via:
- Key numbers:
- cold probe startup->healthy:
11017.541 ms - cold probe fused TTFT:
5.814 ms - cold probe fused full request latency:
255.434 ms - mixed-churn fused deltas (
no_preload->shape_prebuild_nopreload):- cold TTFT:
704.521 -> 5.805 ms(121.364x) - cold full:
6593.495 -> 255.267 ms(25.830x) - warm request mean:
843.242 -> 51.482 ms(16.379x) - warm TTFT:
179.541 -> 4.824 ms(37.218x)
- cold TTFT:
- cold probe startup->healthy:
- Interpretation:
- this is the first prompt-independent mitigation that removes fused request-path spikes.
- current tradeoff is startup compile burst (
http_attn_prebuild), so next work is lowering startup overhead while preserving these request-path gains.
- Follow-up tuning (
seq_kv_max: 16 -> 10) artifact:prebuild_startup10_nopreload_probe_20260222T235944Z.json- startup->healthy:
11017.541 -> 7011.472 ms(1.571xfaster startup) - request TTFT:
5.814 -> 5.826 ms(near-identical) - request full latency:
255.434 -> 254.936 ms(near-identical)
- startup->healthy:
- Tuned matrix confirmation:
- tuned matrix (
seq_kv_max=10):attn_backend_frontend_matrix_20260223T000256Z - compare vs
seq_kv_max=16:attn_backend_frontend_missmit_compare_20260223T000343Z - warm-fixed fused request mean:
22.556 -> 22.265 ms - mixed fused request mean:
51.482 -> 50.974 ms
- tuned matrix (
- Lower-range startup probe (
seq_kv_max=8):prebuild_startup8_nopreload_probe_20260223T000600Z.json- startup->healthy:
6010.381 ms - request TTFT:
703.771 ms(regression) - request full latency:
1660.576 ms(regression) - interpretation:
seq_kv_max=8under-covers this benchmark prompt profile;10is the current minimum safe tuned range.
- startup->healthy:
- Heuristic probe (
TRENI_ATTN_CUDNN_FRONTEND_HEUR_MODE):AandBremained near-identical on startup/build behavior in this path.FALLBACKproduced no valid engine configs for the current frontend descriptor onsm86.
AWS G5 Frontend Hybrid Shape-Gate Follow-Up (2026-02-23)
- Artifacts:
- startup probes (3 runs):
prebuild_hybrid10_nopreload_probe_r{1,2,3}_20260223T002214Z.json - matrix (
repeats=3):attn_backend_frontend_matrix_20260223T001959Z - compare vs prior tuned no-gate matrix:
attn_backend_frontend_missmit_compare_20260223T002153Z
- startup probes (3 runs):
- Policy:
TRENI_ATTN_CUDNN_FRONTEND_PREBUILD_SEQ1_MIN_KV=10TRENI_ATTN_CUDNN_FRONTEND_PREBUILD_SEQ1_MAX_KV=10TRENI_ATTN_CUDNN_FRONTEND_PREBUILD_HEAD_DIM=128TRENI_ATTN_CUDNN_FRONTEND_MIN_SEQ_KV=10
- Startup probe aggregate (
qwen, no preload prompts, fused frontend):- startup->healthy:
2004.840 +/- 0.146 ms - request TTFT:
4.955 +/- 0.011 ms - request full latency:
242.673 +/- 0.352 ms
- startup->healthy:
- Delta vs prior tuned no-gate probe (
prebuild_startup10_nopreload_probe_20260222T235944Z):- startup->healthy:
7011.472 -> 2004.840 ms(3.497xfaster) - request TTFT:
5.826 -> 4.955 ms(1.176xfaster) - request full latency:
254.936 -> 242.673 ms(1.051xfaster)
- startup->healthy:
- Matrix deltas vs prior tuned no-gate matrix (
attn_backend_frontend_matrix_20260223T000256Z):- warm-fixed fused request mean:
22.265 -> 20.354 ms(1.094xfaster) - mixed fused request mean:
50.974 -> 47.904 ms(1.064xfaster) - cold fused TTFT:
5.819 -> 4.959 ms(1.173xfaster) - cold fused full latency:
254.146 -> 242.569 ms(1.048xfaster)
- warm-fixed fused request mean:
- Broader-shape sanity artifact:
hybrid_shape_sanity_20260223T002857Z.json- startup->healthy stayed
~2004 ms,inference.used=truefor all 5 requests. - long-prompt/full-latency regressions were observed as seq1 shapes exceeded the prebuilt window (
seq_kv=11..30miss lines in log head), confirming remaining dynamic-shape work.
- startup->healthy stayed
- Bounded-gate follow-up artifact:
hybrid_shape_sanity_maxgate_20260223T003453Z.json- added
TRENI_ATTN_CUDNN_FRONTEND_MAX_SEQ_KV=10. - no fused miss lines were observed, and the same broader-shape set stayed inference-valid with low TTFT.
- mean full latency over the 5-shape set dropped from
9974.576 msto274.072 ms(36.395xfaster).
- added
- Fixed-profile confirmation after max gate:
- matrix (
repeats=3):attn_backend_frontend_matrix_20260223T003611Z - compare vs prior hybrid:
attn_backend_frontend_missmit_compare_20260223T003734Z(near-parity fixed-profile deltas).
- matrix (
- Interpretation:
- hybrid shape gating materially reduces startup compile cost while preserving low no-preload request-path latency.
- strict fused runs remain inference-valid with low-shape custom fallback in this harness.
- bounded max-gate removes broad-shape miss cascades now, but wider fused coverage without fallback still needs dynamic seq1 plan reuse/coverage.
Commercial Fairness Root-Cause Grouping (2026-02-22)
- Artifact:
commercial_gap_root_cause_20260222T222958Z. - OpenAI
gpt-5.2, model-only (paired_n=36):- latency delta mean (external-internal):
-69.311 ms, CI95[-193.985, 61.444]-> parity/noise. - external controller overhead mean:
2.081 msvs model-hop mean1406.971 ms.
- latency delta mean (external-internal):
- OpenAI
gpt-5.2, tool-only parity (paired_n=12):- latency delta mean:
+49.601 ms, CI95[-162.047, 274.981]-> parity/noise. - external controller overhead mean:
12.842 msvs model-hop mean2456.108 ms.
- latency delta mean:
- OpenRouter Sonnet 4.6, model-only (
paired_n=24):- latency delta mean:
+204.883 ms, CI95[-148.517, 683.114]-> parity/noise. - external controller overhead mean:
2.254 msvs model-hop mean2220.251 ms.
- latency delta mean:
- Interpretation:
- current commercial "loss" is not locked statistically in this dataset.
- dominant variance is upstream model-hop time; to claim directional differences, we need higher-N region/time-pinned reruns.
AWS G5 Seq1 Hybrid Tuning (2026-02-22)
- Warm matrix (
seq1_hybrid_20260222T1554Z):- default:
54.505 msmean,82.134 msp99. - qk-cublas:
54.572 msmean,81.776 msp99. - pv-cublas:
54.281 msmean,80.754 msp99. - both-cublas:
54.822 msmean,79.947 msp99.
- default:
- Cold sanity (
seq1_hybrid_20260222T1558Z):- default full
147.756 msvs pv-cublas full149.293 ms.
- default full
- Interpretation:
- pv-cublas gives the best warm mean/p99 in this pass, but slightly worsens cold full latency.
- default remains custom seq1 for the best overall cold/hot balance.
AWS G5 Seq1 Fused Follow-Up (2026-02-22)
- Follow-up artifact pack:
seq1_hybrid_fused_20260222T192656Z. - Code changes:
- fused
seq_q=1softmax+PV kernel in custom path - seq1 QK kernel launch retune (
64/128/256byhead_dim)
- fused
- Warm deltas vs prior matrix (
seq1_hybrid_20260222T1554Z):- default mean:
54.505 -> 52.535 ms(1.037x) - default p99:
82.134 -> 80.554 ms(1.020x) - pv-cublas mean:
54.281 -> 51.964 ms(1.045x) - pv-cublas p99:
80.754 -> 78.519 ms(1.028x)
- default mean:
- Cold deltas vs prior sanity (
seq1_hybrid_20260222T1558Z):- default TTFT:
6.447 -> 6.209 ms(1.038x) - default full:
147.756 -> 145.587 ms(1.015x) - pv-cublas TTFT:
6.450 -> 6.215 ms(1.038x) - pv-cublas full:
149.293 -> 147.937 ms(1.009x)
- default TTFT:
- Interpretation:
- this moved both warm and cold in the correct direction without changing model routing/task behavior.
- default custom path remains the balanced default; pv-cublas still leads warm-only in this slice.
H100 Fused cuDNN SDPA Probe (2026-02-22)
- Probe artifact pack:
cudnn_sdpa_h100_probe_20260222T1935Z. - Results:
- alignment sweep: all
cnt=0(align={16,32,64,128,256}). - shape/layout sweep:
tested=1440,supported=0. - debug traces show candidate engines (
8/9/10/11) but no viable configs after support checks:NOT_SUPPORTED_GRAPH_PATTERN(8/9/11)NOT_SUPPORTED_ARCH_MISMATCH(10, Blackwell-only).
- alignment sweep: all
- Interpretation:
- this H100 probe formulation still finds no viable fused engines.
- proxy path remains explicit opt-in only for legacy A/B.
Phase 3 Realistic-v1 Loop Pack (2026-02-22, 3 seeds baseline + 3 seeds stress)
- Summary artifact:
phase3_realistic_v1_summary_20260222T143919Z.json. - Baseline means:
- internal success
1.0000 - external success
0.9010 - external/internal latency ratio
15.8563x - external/internal steps ratio
1.8037x
- internal success
- Stress means:
- internal success
1.0000 - external success
0.9010 - external/internal latency ratio
75.3563x - external/internal steps ratio
1.8037x
- internal success
- Interpretation:
- moving to richer file-backed fixtures did not change direction; internal remains faster and more reliable.
- stress again amplifies external hop latency substantially while internal remains stable.
Phase 3 Realistic-v1 Uncertainty Ablation (2026-02-22, seed 7 baseline+stress)
- Comparison artifact:
phase3_uncertainty_compare_realistic_v1_s7_20260222T144116Z.json. - Success deltas with uncertainty enabled (
int_on_ext_onvs off-arms):normalized_logprob:- baseline: internal
+0.2500, external+0.2500 - stress: internal
+0.2500, external+0.2344
- baseline: internal
raw_logit_margin: same deltas as above.hybrid: same deltas as above.
- Interpretation:
- uncertainty-aware branching continues to provide positive success deltas on realistic-v1.
- under stress, external benefit remains positive but slightly reduced.
Routing (Internal vs External, G5)
- Internal mean:
94.849 ms - External mean:
97.927 ms - External/Internal:
1.032x(internal faster)
Routing Failure-Amplification Stress (G5, 2026-02-18)
- Internal mean:
76.071 ms - External mean:
109.806 ms(1.443xexternal/internal) - Internal error rate:
0.0000 - External error rate:
0.0833 - Error-rate amplification:
inf(external errors present, internal none) - External retry/failure signal: tool retries mean
0.182, taxonomytool_hop_failed=4
Routing Matrix Expansion (G5, 2026-02-19, 6 profiles)
- Baseline profile (
p00): ratio1.0420x, external error0.0000. - Mild fail profile (
p01): ratio1.0480x, external error0.0000. - Mild timeout profile (
p02): ratio1.1420x, external error0.0000. - Mixed moderate (
p03): ratio1.1640x, external error0.0417. - Mixed aggressive (
p04): ratio1.4360x, external error0.0833. - Mixed aggressive + retry2 (
p05): ratio1.4160x, external error0.0833. - Internal error rate stayed
0.0000across all profiles. - Matrix-wide mean ratio:
1.2080xexternal/internal.
Routing Cross-Host Pilot (2026-02-19, local client -> SSH tunnel -> G5)
- Baseline profile (
crosshost-p00-baseline, 12 runs):- internal mean:
1071.477 ms - external mean:
1059.478 ms - external/internal ratio:
0.989x - internal error rate:
0.0000 - external error rate:
0.0000
- internal mean:
- Mild-timeout profile (
crosshost-p02-timeout-mild, 12 runs):- internal mean:
1054.123 ms - external mean:
1123.393 ms - external/internal ratio:
1.066x - internal error rate:
0.0000 - external error rate:
0.0000 - external tool retries mean:
0.083
- internal mean:
- Stress profile (
crosshost-p04-stress, 12 runs):- internal mean:
1056.013 ms - external mean:
1100.010 ms - external/internal ratio:
1.042x - internal error rate:
0.0000 - external error rate:
0.0833 - external tool retries mean:
0.182
- internal mean:
- Interpretation:
- in cross-host conditions, stress again amplifies external-path latency and error rates while internal remains error-free.
Routing Split-Host Matrix (2026-02-19, canonical Track B)
- Topology:
- GPU host: runtime endpoint
- CPU host: external controller + tool services
- controller/tool calls runtime over VPC private network
- Profiles (
12 runseach):splithost-p00-baseline: ratio0.995x, ext error0.0000.splithost-p01_fail_mild: ratio0.998x, ext error0.0000, tool retries0.021.splithost-p02_timeout_mild: ratio1.042x, ext error0.0000, tool retries0.021.splithost-p03_mixed_moderate: ratio1.001x, ext error0.0417, tool retries0.065.splithost-p04_mixed_aggressive: ratio1.087x, ext error0.0833, tool retries0.182.splithost-p05_mixed_aggressive_retry2: ratio1.045x, ext error0.0833, tool retries0.091.
- Matrix-wide:
- external/internal latency ratio mean
1.028x - internal error mean
0.0000 - external error mean
0.0347
- external/internal latency ratio mean
- Interpretation:
- baseline remains near parity, while timeout/failure pressure amplifies external-path latency and error rate; internal path remains error-free across all profiles.
Internet Multi-Hop Matrix (2026-02-20, Fly.io + Commercial APIs)
- Topology:
- internal path: local client -> commercial API
- external path: local client -> Fly controller/tool -> same commercial API
- OpenAI (
gpt-5.2,runs=3, 3 profiles):- matrix ratio mean:
1.1123xexternal/internal - baseline:
1.110x - timeout-mild:
1.082x - mixed-aggressive:
1.145x - mixed-aggressive external error:
0.0833(internal0.0000)
- matrix ratio mean:
- OpenRouter (
openai/gpt-5.2,runs=3, 3 profiles):- matrix ratio mean:
0.7553xexternal/internal - baseline:
0.686x - timeout-mild:
0.891x - mixed-aggressive:
0.689x - mixed-aggressive external error:
0.1667(internal0.0000)
- matrix ratio mean:
- OpenRouter (
anthropic/claude-sonnet-4.6,runs=3, 3 profiles):- matrix ratio mean:
1.0277xexternal/internal - baseline:
1.236x - timeout-mild:
0.968x - mixed-aggressive:
0.879x - mixed-aggressive external error:
0.1667(internal0.0000)
- matrix ratio mean:
- Interpretation:
- OpenAI matrix supports the expected direction under internet hops: external path is slower and less reliable in stress.
- OpenRouter remains non-canonical for Track B direction claims in this topology due mixed/inverted profile direction and elevated errors.
Local Control Matrix (No Fly Scheduler Path, 2026-02-20)
- Topology:
- internal path: local client -> commercial API
- external path: local client -> local standalone controller/tool -> same commercial API
- OpenAI (
gpt-5.2,runs=8, 3 profiles):- matrix ratio mean:
0.9867xexternal/internal - profiles: baseline
0.995x, timeout-mild0.977x, mixed-aggressive0.988x - external error mean:
0.0313
- matrix ratio mean:
- OpenRouter (
anthropic/claude-sonnet-4.6,runs=8, 3 profiles):- matrix ratio mean:
1.0663xexternal/internal - profiles: baseline
1.055x, timeout-mild1.141x, mixed-aggressive1.003x - external error mean:
0.0313
- matrix ratio mean:
- Interpretation:
- higher-N controls reduce jitter and show mixed but informative behavior: OpenAI near parity, OpenRouter Sonnet trending
external > internal. - external stress errors remain present while internal stayed error-free.
- higher-N controls reduce jitter and show mixed but informative behavior: OpenAI near parity, OpenRouter Sonnet trending
Task-Family Parity Split (Local Control, Higher-N, 2026-02-20)
- OpenAI
gpt-5.2(runs=8):model_only: external/internal0.958x(near parity, slight inversion).tool_only: external/internal1.136x(external slower).- errors: internal
0.0, external0.0on both runs.
- OpenRouter
anthropic/claude-sonnet-4.6(runs=8):model_only: external/internal1.044x(external slower).tool_only: external/internal1.051x(external slower).- errors: internal
0.0, external0.0on both runs.
- Interpretation:
- task-family split removes ambiguity from mixed task composition.
- tool-required tasks consistently favor internal routing on both providers.
- model-only behavior is provider-sensitive but remains close enough that architecture effects are small compared with provider/runtime variance.
Qwen Cold Upload GPU-Convert Ablation (2026-02-19, G5)
- A/B toggle:
- off:
TRENI_TENSOR_CONVERT_GPU=0 - on: default GPU conversion path enabled
- off:
- Qwen first-hit metrics:
- full latency:
1116.567 ms -> 238.740 ms(4.677xfaster). - decoder tensor upload:
1007 ms -> 129 ms(7.806xfaster). - decoder tensor convert:
862 ms -> 6 ms(143.667xfaster). - decoder tensor h2d:
143 ms -> 121 ms(1.182xfaster). - startup + full response total:
2119.906 ms -> 1242.057 ms(1.707xfaster).
- full latency:
- Interpretation:
- this isolates CPU tensor conversion as the dominant cold bottleneck and shows that moving conversion to GPU materially reduces Qwen cold path latency.
- External-cold runtime-only confirmation (preload enabled,
max_tokens=48):- startup-to-healthy:
2004.560 -> 1003.455 ms(1.997xfaster). - request full latency:
317.989 -> 317.276 ms(no material change). - cold-total first response:
2322.549 -> 1320.731 ms(1.759xfaster). - cold-total first token:
2009.697 -> 1008.582 ms(1.993xfaster).
- startup-to-healthy:
Runtime vs vLLM External-Cold Repeatability (2026-02-19, 3 runs)
- Matched setup:
- same G5 host, same model family, token parity (
max_tokens=48), runtime preload enabled.
- same G5 host, same model family, token parity (
- 3-run means:
- runtime TTFT
5.135 msvs vLLM84.390 ms(16.433xspeedup). - runtime request full
319.063 msvs vLLM1111.463 ms(3.484xspeedup). - runtime cold-total first response
1656.573 msvs vLLM31151.892 ms(18.805xspeedup).
- runtime TTFT
- Runs 2-3 only (post-first-run stabilization):
- TTFT
17.211x, request full3.416x, cold-total22.395x.
- TTFT
- Interpretation:
- the cold-path fix and request-path lead hold in a fresh runtime-vLLM repeatability rerun after restoring vLLM in the benchmark env.
External Cold All-Backend Repeatability (2026-02-19, 3 runs, GPU-Convert Fix2)
- Setup:
- same G5 host, same model family (
Qwen 3B), token parity (max_tokens=48), runtime preload enabled. - backends: runtime + PyTorch + vLLM + Ollama.
- same G5 host, same model family (
- 3-run means (all runs):
- runtime: startup
2339.131 ms, TTFT5.131 ms, request full318.315 ms, cold-total2657.447 ms. - vLLM/runtime ratios: TTFT
16.091x, request full3.852x, cold-total10.887x. - PyTorch/runtime ratios: TTFT
115.313x, request full7.508x, cold-total3.921x. - Ollama/runtime ratios: TTFT
2108.743x, request full35.118x, cold-total4.584x.
- runtime: startup
- Stable reference (runs 1-2):
- runtime startup/cold-total:
1003.915/1321.205 ms. - vLLM/runtime ratios: TTFT
18.275x, request full4.298x, cold-total21.875x.
- runtime startup/cold-total:
- Runtime-only stability sweep (
5runs):- median runtime startup/cold-total:
1003.400/1320.952 ms. - one run showed preload upload outlier (
decoder_tensor_upload=1877.485 ms,decoder_tensor_h2d=1869.296 ms), inflating mean startup.
- median runtime startup/cold-total:
- Interpretation:
- request-path advantage is stable; residual cold variance is now a preload upload consistency problem, not a decoder compute bottleneck.
External Cold All-Backend Repeatability (2026-02-19, 3 runs, GPU-Convert + Host-Prefetch Fix)
- Setup:
- same G5 host, same model family (
Qwen 3B), token parity (max_tokens=48), runtime preload enabled. - backends: runtime + PyTorch + vLLM + Ollama.
- runtime cold change: host-page
MADV_WILLNEEDprefetch for large tensor source ranges (TRENI_TENSOR_HOST_PREFETCH=1).
- same G5 host, same model family (
- 3-run means:
- runtime: startup
1003.836 ms, TTFT5.130 ms, request full316.403 ms, cold-total1320.240 ms. - vLLM/runtime ratios: TTFT
16.537x, request full3.896x, cold-total21.918x. - PyTorch/runtime ratios: TTFT
108.567x, request full7.341x, cold-total14.601x. - Ollama/runtime ratios: TTFT
514.414x, request full9.471x, cold-total3.029x.
- runtime: startup
- Runtime-only stability compare (5 runs before vs after host-prefetch):
- startup max:
3006.388 -> 1003.627 ms. - cold-total max:
3324.212 -> 1322.338 ms. decoder_tensor_h2dmax:1869.296 -> 120.671 ms.decoder_tensor_uploadmax:1877.485 -> 128.777 ms.
- startup max:
- Interpretation:
- cold preload upload variance is effectively removed in this sweep and runtime’s request-path lead remains intact.
Phase 3 Agentic Loops (Canonical G5 Baseline, 2026-02-19, 3 seeds)
- Internal success rate mean:
1.0000. - External success rate mean:
0.9006. - External/Internal latency ratio mean:
16.0603x. - External/Internal steps ratio mean:
1.8147x. - Scenario signal (external success mean):
- retrieval correction:
1.0000 - tool-state adaptation:
0.7417 - confidence-gated branching:
1.0000
- retrieval correction:
Phase 3 Agentic Loops (Canonical G5 Stress, 2026-02-19, 3 seeds)
- Stress profile: tool fail every
9, timeout every11(1.1ssleep), controller timeout0.35s, retries2. - Internal success rate mean:
1.0000. - External success rate mean:
0.8782. - External/Internal latency ratio mean:
77.1703x. - External/Internal steps ratio mean:
1.8240x. - Scenario signal (external success mean):
- retrieval correction:
1.0000 - tool-state adaptation:
0.6833 - confidence-gated branching:
1.0000
- retrieval correction:
Phase 4 Kickoff (Lambda A100/H100, Phase 3 Canonical, 2026-02-20)
- A100 (3 baseline seeds + 3 stress seeds):
- baseline: internal success
1.0000, external success0.9006, external/internal latency16.8790x. - stress: internal success
1.0000, external success0.8782, external/internal latency77.8613x.
- baseline: internal success
- H100 (3 baseline seeds + 3 stress seeds):
- baseline: internal success
1.0000, external success0.9006, external/internal latency18.6933x. - stress: internal success
1.0000, external success0.8782, external/internal latency72.7407x.
- baseline: internal success
- Interpretation:
- Track C behavior is hardware-stable: internal path keeps perfect success while external remains weaker on tool-state adaptation and pays a large stress-amplified latency penalty.
- This is now paired with completed Phase 2 + C2 reruns on the same Lambda hardware classes.
Phase 4 Full Reruns (Lambda A100/H100, Phase 2 + C2, 2026-02-20)
- A100:
- cold first-hit summary: startup
1002.708 ms, TTFT29.657 ms, full32.008 ms. - warm request latency: mean
10.356 ms, p9914.536 ms. - routing matrix overall: external/internal
2.4300x, external error0.0347, internal error0.0000. - C2 runtime-native deltas: baseline
+0.2308/+0.2308(internal/external), stress+0.2308/+0.2212.
- cold first-hit summary: startup
- H100:
- cold first-hit summary: startup
1004.890 ms, TTFT56.944 ms, full62.064 ms. - warm request latency: mean
18.491 ms, p9924.944 ms. - routing matrix overall: external/internal
2.3972x, external error0.0347, internal error0.0000. - C2 runtime-native deltas: baseline
+0.2308/+0.2308(internal/external), stress+0.2308/+0.2212.
- cold first-hit summary: startup
- Interpretation:
- Cross-hardware direction stays consistent: internal path remains more stable/reliable while external routing shows stress-amplified latency and error behavior.
- Runtime-native uncertainty deltas hold their positive baseline/stress direction on both A100 and H100 in this calibrated setup.
Paper Package (2026-02-20)
- Generated outputs:
/benchmarks/paper_package/latest/package_summary.json/benchmarks/paper_package/latest/paper_package.md/benchmarks/paper_package/latest/tables/*.csv/benchmarks/paper_package/latest/manuscript/figure_manifest.json/benchmarks/paper_package/latest/manuscript/captions.md/benchmarks/paper_package/latest/manuscript/claims.md/benchmarks/paper_package/latest/manuscript/figures/*.mmd
- Scope:
- consolidates canonical G5 + Lambda A100 + Lambda H100 into paper-ready tables for:
- Phase 2 cold/hot summary
- routing matrix summary
- C2 runtime-native deltas
- Phase 3 loops baseline/stress
- external-cold backend comparison (G5)
- provides manuscript-ready figure/caption/claim templates with direct table provenance.
- consolidates canonical G5 + Lambda A100 + Lambda H100 into paper-ready tables for:
Phase 3 Uncertainty Ablation (Baseline Matrix, 2026-02-19, runs=8)
Arms:
int_on_ext_onint_off_ext_onint_on_ext_offint_off_ext_off
Sources:
normalized_logprobraw_logit_marginhybridruntime_native(canonical rerun now complete)
Observed deltas:
- Internal success delta (uncertainty on vs off, external fixed on):
+0.2308across all three sources. - External success delta (uncertainty on vs off, internal fixed on):
+0.2308across all three sources. - Direction is stable across source definitions, while latency ratios vary by source.
Interpretation:
- The loop benchmark now has first direct evidence that uncertainty-aware branching contributes to task success, not only narrative plausibility.
- This baseline matrix uses harness-level synthetic uncertainty signals.
- Runtime-native uncertainty rerun is now published and aligns directionally with this result.
Phase 3 Uncertainty Ablation (Repeatability + Stress, 2026-02-19, 3 seeds each)
Baseline means:
- Internal uncertainty success delta:
+0.2308(all sources). - External uncertainty success delta:
+0.2308(all sources).
Stress means:
- Internal uncertainty success delta:
+0.2308(all sources). - External uncertainty success delta:
+0.2212(all sources).
Stress minus baseline:
- Internal uncertainty delta change:
0.0000. - External uncertainty delta change:
-0.0096.
Interpretation:
- Uncertainty-aware branching gains are stable under both baseline and injected timeout/failure stress in this harness.
- This now has a canonical runtime-native corroboration run.
Phase 3 Uncertainty Ablation (Runtime-Native Canonical Rerun, 2026-02-19, 3 seeds each, Superseded)
- Pre-fix issue:
- greedy decode uncertainty path emitted flat zeros (
mean_logprob=0,mean_entropy=0), making runtime-native C2 non-informative.
- greedy decode uncertainty path emitted flat zeros (
- Fix:
- greedy sampling kernel now computes logprob + entropy from logits using log-sum-exp.
- Baseline runtime-native uncertainty deltas:
- internal:
+0.1026 - external:
+0.1155
- internal:
- Stress runtime-native uncertainty deltas:
- internal:
+0.2308 - external:
+0.2212
- internal:
- Interpretation:
- this run established runtime-native wiring, but part of the seed set later showed probe fallback contamination.
- use the 2026-02-20 quality-gated rerun below for current canonical interpretation.
Phase 3 Uncertainty Ablation (Runtime-Native Quality-Gated Rerun, 2026-02-20, 3 seeds each)
- Rerun setup:
- source:
runtime_native - seeds:
7/11/19 - baseline + stress
- runtime fast probe config:
TRENI_DEMO_LAYERS=2 - client consumes
awareness.generationfirst (legacyuncertaintyfallback preserved)
- source:
- Quality gate:
- all runtime-native arm artifacts in this rerun have non-zero requests/ok and
fallback=0,errors=0.
- all runtime-native arm artifacts in this rerun have non-zero requests/ok and
- Clean runtime-native uncertainty deltas:
- baseline internal:
-0.1538 - baseline external:
-0.1217 - stress internal:
-0.1538 - stress external:
-0.1089
- baseline internal:
- Interpretation:
- with clean zero-fallback runtime-native probes, this awareness3 rerun showed uncertainty-on was harmful in this harness.
- runtime-native transport/response wiring stayed validated and triggered the calibration pass below.
Phase 3 Uncertainty Ablation (Runtime-Native Calibrated Rerun calib1, 2026-02-20, 3 seeds each)
- Calibration update:
- runtime-native decision confidence now uses calibrated generation confidence (floor/ceil scaling) blended with prior confidence and optional route confidence.
- calibration knobs are now forwarded through the ablation runner for reproducible reruns.
- Rerun setup:
- source:
runtime_native - seeds:
7/11/19 - baseline + stress
- calibration params: prior weight
0.75, confidence floor0.10, confidence ceil0.35, route blend0.10
- source:
- Quality gate:
- all runtime-native arm artifacts in this rerun have non-zero requests/ok and
fallback=0,errors=0.
- all runtime-native arm artifacts in this rerun have non-zero requests/ok and
- Calibrated runtime-native uncertainty deltas:
- baseline internal:
+0.1539 - baseline external:
+0.1058 - stress internal:
+0.1539 - stress external:
+0.1154
- baseline internal:
- Interpretation:
- calibrated runtime-native uncertainty recovers positive on/off gains in this harness under both baseline and stress.
- C2 is now re-locked for this setup; optional next work is region-pinned commercial multi-hop controls (and higher-N only where needed).
External Cold Comparison (G5, 2026-02-18, Qwen 3B family)
| Backend | Startup->Healthy | Request TTFT | Request Full | Cold Total First Token | Cold Total First Response |
|---|---|---|---|---|---|
| runtime | 1003.537 ms | 1108.979 ms | 1339.459 ms | 2112.516 ms | 2342.996 ms |
| pytorch_transformers | - | 528.483 ms | 2288.516 ms | 6965.227 ms | 8725.259 ms |
| vllm | 24032.203 ms | 51.763 ms | 1036.815 ms | 24083.966 ms | 25069.018 ms |
| ollama (GGUF) | 1002.695 ms | 2168.902 ms | 2527.411 ms | 3171.597 ms | 3530.106 ms |
Runtime-normalized (lower is better for runtime):
- PyTorch cold total first response:
3.724xruntime. - vLLM cold total first response:
10.7xruntime. - Ollama cold total first response:
1.507xruntime.
Interpretation:
- vLLM request-path TTFT is fastest once healthy, but startup dominates total cold in this run.
- Runtime is strongest on end-to-end cold total in this specific setup.
- Ollama is quantized GGUF and kept with caveat tags (not precision-equivalent to BF16 paths).
External Cold Comparison (G5, 2026-02-18, preload + tokenizer cache)
| Backend | Startup->Healthy | Request TTFT | Request Full | Cold Total First Token | Cold Total First Response |
|---|---|---|---|---|---|
| runtime | 2004.735 ms | 91.596 ms | 271.346 ms | 2096.331 ms | 2276.081 ms |
| pytorch_transformers | - | 522.795 ms | 2252.382 ms | 6644.737 ms | 8374.324 ms |
| vllm | 27036.682 ms | 51.725 ms | 1035.826 ms | 27088.407 ms | 28072.508 ms |
| ollama (GGUF) | 1002.508 ms | 2182.542 ms | 2538.609 ms | 3185.050 ms | 3541.117 ms |
Runtime-normalized (lower is better for runtime):
- vLLM request full latency:
3.817xruntime. - vLLM cold total first response:
12.334xruntime. - Remaining gap: vLLM request TTFT is still lower (
51.725 msvs runtime91.596 ms).
Important caveat:
- The run above was before request
max_tokenswas wired through runtime inference (runtime still generated 4 tokens there).
External Cold Comparison (G5, 2026-02-18, token parity fixed at 48, pre decoder fix)
| Backend | Startup->Healthy | Request TTFT | Request Full | Cold Total First Token | Cold Total First Response |
|---|---|---|---|---|---|
| runtime | 2004.203 ms | 91.207 ms | 2518.142 ms | 2095.410 ms | 4522.345 ms |
| pytorch_transformers | - | 501.946 ms | 2244.327 ms | 9530.450 ms | 11272.831 ms |
| vllm | 27036.248 ms | 51.310 ms | 1075.404 ms | 27087.558 ms | 28111.652 ms |
| ollama (GGUF) | 1002.560 ms | 2197.797 ms | 2556.652 ms | 3200.357 ms | 3559.212 ms |
Interpretation:
- Runtime still wins cold-total first response vs vLLM (
6.216xbetter). - Runtime request-path TTFT and full latency are still slower than vLLM at equal 48-token budget.
- Residual bottleneck is decoder per-token step cost (not tensor upload anymore in this mode).
External Cold Comparison (G5, 2026-02-18, token parity + decoder/sampling fix)
| Backend | Startup->Healthy | Request TTFT | Request Full | Cold Total First Token | Cold Total First Response |
|---|---|---|---|---|---|
| runtime | 2004.759 ms | 5.022 ms | 311.289 ms | 2009.781 ms | 2316.048 ms |
| pytorch_transformers | - | 515.597 ms | 2291.854 ms | 6461.304 ms | 8237.561 ms |
| vllm | 24036.762 ms | 52.995 ms | 1094.517 ms | 24089.757 ms | 25131.279 ms |
| ollama (GGUF) | 1002.630 ms | 2184.383 ms | 2543.219 ms | 3187.013 ms | 3545.849 ms |
Interpretation:
- Runtime now leads vLLM in request-path TTFT (
10.553xfaster) and full latency (3.516xfaster) on this G5 token-parity run. - Runtime also remains much lower on cold-total first response (
10.851xbetter vs vLLM). - Main measured bottleneck in prior parity run (sampling + per-step host sync) is no longer dominant.
- Initial repeatability set (
2026-02-18) kept the same direction: mean speedups10.333xTTFT,3.380xfull latency,10.688xcold-total first response. - Superseded by
2026-02-19rerun and all-backend repeatability with stronger runtime advantage.
Cold TTFT Before vs After Index Cache (3-run means, G5)
| Model | Before | After | Speedup |
|---|---|---|---|
| qwen | 27574.564 ms | 1774.951 ms | 15.535x |
| donut | 67360.388 ms | 572.485 ms | 117.663x |
| bart | 77520.798 ms | 743.652 ms | 104.243x |
| minilm | 23.342 ms | 22.698 ms | 1.028x |
Cold TTFT: clean3 vs clean4 (3-run means, G5)
| Model | clean3 | clean4 | Improvement |
|---|---|---|---|
| qwen | 1411.831 ms | 1100.044 ms | 22.1% lower |
| donut | 619.499 ms | 150.322 ms | 75.7% lower |
| bart | 776.545 ms | 125.011 ms | 83.9% lower |
| minilm | 23.421 ms | 22.621 ms | 3.4% lower |
Dominant Cold Stages After clean4
model_tensor_index_builddropped to~1-2.3 msacross models (down ~99.6% vs clean3 for Bart/Donut).- Qwen still dominated by
decoder_tensor_upload(~1015 msmean). - Donut and Bart are now mostly in decoder setup/upload and no longer index-build bound.
Reverted Experiment (Transparency)
- Tried an async pinned conversion-buffer upload strategy after clean4.
- Result: Qwen
decoder_tensor_uploadregressed to~1419 msand TTFT regressed by~37%. - Decision: reverted that path; clean4 remains the accepted cold-path baseline.
- Follow-up validation run set (
clean7) matched clean4 within run noise (Qwen TTFT delta-0.16%).
What Was Actually Tested
- Baseline (Python/dependency path) runs on T4 and G5.
- Runtime cold and warm request-path benchmarks.
- True runtime-reported TTFT (not SSE first-event proxy).
- Internal-vs-external routing comparison on matched tasks.
- Internal-vs-external routing failure-amplification stress run with injected timeouts/failures.
- Internal-vs-external routing matrix expansion (baseline + 5 stress profiles on G5).
- Internal-vs-external routing cross-host pilot (baseline + stress via SSH tunnel to G5).
- Internal-vs-external routing split-host matrix (CPU router host + GPU runtime host, 6 profiles).
- Week 3 numerical parity checks (strict mode; donut intentionally skipped in parity harness).
- Phase 3 loop-capability canonical G5 benchmark (baseline profile, 3 seeds).
- Phase 3 loop-capability canonical G5 stress benchmark (failure/timeout injection + retries, 3 seeds).
- Qwen cold upload GPU-convert on/off ablation (same host, same harness, env-toggle only).
- External-cold runtime-only GPU-convert on/off ablation (preload enabled, matched token budget).
- Runtime-vLLM external-cold repeatability rerun (3 runs) after vLLM env restore.
- External-cold all-backend repeatability set (runtime + PyTorch + vLLM + Ollama, 3 runs).
- Runtime-only cold stability sweep (5 runs) with preload upload sub-stage inspection.
- Runtime host-prefetch cold fix rerun: runtime-only 5-run stability sweep.
- Runtime host-prefetch cold fix rerun: external-cold all-backend repeatability (3 runs).
- Lambda A100 full rerun: Phase 2 cold/hot runtime set + 6-profile routing matrix.
- Lambda H100 full rerun: Phase 2 cold/hot runtime set + 6-profile routing matrix.
- Lambda A100 C2 runtime-native calibrated set: baseline+stress (3 seeds each).
- Lambda H100 C2 runtime-native calibrated set: baseline+stress (3 seeds each).
- Internet multi-hop commercial matrix (Fly hops + OpenAI
gpt-5.2, 3 profiles). - Internet multi-hop commercial repeatability matrix (Fly hops + OpenRouter
openai/gpt-5.2,runs=3). - Internet multi-hop repeatability matrix (Fly hops + OpenRouter
anthropic/claude-sonnet-4.6,runs=3). - Local-control matrix (no Fly scheduler path) for OpenAI
gpt-5.2and OpenRouteranthropic/claude-sonnet-4.6. - Higher-N local-control rerun (
runs=8/profile) for the same OpenAI/OpenRouter Sonnet pair. - Task-family parity split rerun (
model_only+tool_only,runs=8) for OpenAI + OpenRouter Sonnet.
What Is Not Finished Yet
- Optional: add region-pinned/Fly-to-Fly control runs to reduce provider-path confounding in OpenRouter comparisons.
Raw Artifact Links
- Cold index-cache summary JSON
- Cold index-cache summary Markdown
- Routing comparison JSON
- Routing failure stress JSON
- Routing failure stress report JSON
- Routing failure stress report Markdown
- Routing matrix report JSON
- Routing matrix report Markdown
- Routing matrix baseline profile JSON
- Routing matrix mixed-aggressive profile JSON
- Routing cross-host baseline profile JSON
- Routing cross-host mild-timeout profile JSON
- Routing cross-host stress profile JSON
- Routing cross-host matrix JSON
- Routing cross-host matrix Markdown
- Routing split-host baseline profile JSON
- Routing split-host mild-timeout profile JSON
- Routing split-host mixed-aggressive profile JSON
- Routing split-host matrix JSON
- Routing split-host matrix Markdown
- Routing internet multi-hop matrix JSON (OpenAI, repeatability)
- Routing internet multi-hop matrix Markdown (OpenAI, repeatability)
- Routing internet multi-hop matrix JSON (OpenRouter, repeatability)
- Routing internet multi-hop matrix Markdown (OpenRouter, repeatability)
- Routing internet multi-hop matrix JSON (OpenRouter Claude Sonnet 4.6, repeatability)
- Routing internet multi-hop matrix Markdown (OpenRouter Claude Sonnet 4.6, repeatability)
- Routing local-control matrix JSON (OpenAI)
- Routing local-control matrix Markdown (OpenAI)
- Routing local-control matrix JSON (OpenRouter Claude Sonnet 4.6)
- Routing local-control matrix Markdown (OpenRouter Claude Sonnet 4.6)
- Routing local-control matrix JSON (OpenAI, higher-N)
- Routing local-control matrix Markdown (OpenAI, higher-N)
- Routing local-control matrix JSON (OpenRouter Claude Sonnet 4.6, higher-N)
- Routing local-control matrix Markdown (OpenRouter Claude Sonnet 4.6, higher-N)
- Routing task-family run JSON (OpenAI model-only, higher-N)
- Routing task-family run JSON (OpenAI tool-only, higher-N)
- Routing task-family run JSON (OpenRouter Sonnet model-only, higher-N)
- Routing task-family run JSON (OpenRouter Sonnet tool-only, higher-N)
- Routing internet multi-hop matrix JSON (OpenAI, initial exploratory)
- Routing internet multi-hop matrix Markdown (OpenAI, initial exploratory)
- Routing internet multi-hop matrix JSON (OpenRouter, initial exploratory)
- Routing internet multi-hop matrix Markdown (OpenRouter, initial exploratory)
- Qwen cold GPU-convert off JSON
- Qwen cold GPU-convert on JSON
- Qwen cold GPU-convert ablation summary JSON
- Qwen cold GPU-convert ablation summary Markdown
- External-cold runtime-only GPU-convert off JSON
- External-cold runtime-only GPU-convert on JSON
- External-cold runtime-only GPU-convert ablation summary JSON
- External-cold runtime-only GPU-convert ablation summary Markdown
- Runtime-vLLM rerun JSON
- Runtime-vLLM repeatability r1 JSON
- Runtime-vLLM repeatability r2 JSON
- Runtime-vLLM repeatability r3 JSON
- Runtime-vLLM repeatability summary JSON
- Runtime-vLLM repeatability summary Markdown
- External-cold all-backend repeatability run1 JSON
- External-cold all-backend repeatability run2 JSON
- External-cold all-backend repeatability run3 JSON
- External-cold all-backend repeatability summary JSON
- External-cold all-backend repeatability summary Markdown
- Runtime cold-stability sweep summary JSON
- Runtime cold-stability sweep summary Markdown
- Host-prefetch all-backend repeatability run1 JSON
- Host-prefetch all-backend repeatability run2 JSON
- Host-prefetch all-backend repeatability run3 JSON
- Host-prefetch all-backend repeatability summary JSON
- Host-prefetch all-backend repeatability summary Markdown
- Host-prefetch runtime stability compare JSON
- Host-prefetch runtime stability compare Markdown
- Phase 3 canonical summary JSON
- Phase 3 canonical summary Markdown
- Phase 3 realistic-v1 summary JSON
- Phase 3 realistic-v1 summary Markdown
- Phase 3 uncertainty ablation summary JSON
- Phase 3 uncertainty ablation summary Markdown
- Phase 3 uncertainty baseline-vs-stress comparison JSON
- Phase 3 uncertainty baseline-vs-stress comparison Markdown
- Phase 3 uncertainty runtime-native canonical comparison JSON
- Phase 3 uncertainty runtime-native canonical comparison Markdown
- Phase 3 uncertainty runtime-native awareness3 comparison JSON
- Phase 3 uncertainty runtime-native awareness3 comparison Markdown
- Phase 3 uncertainty runtime-native calib1 comparison JSON
- Phase 3 uncertainty runtime-native calib1 comparison Markdown
- Phase 4 Lambda A100 Phase 3 summary JSON
- Phase 4 Lambda A100 Phase 3 summary Markdown
- Phase 4 Lambda H100 Phase 3 summary JSON
- Phase 4 Lambda H100 Phase 3 summary Markdown
- Phase 4 Lambda A100 Phase 2 cold JSON
- Phase 4 Lambda A100 Phase 2 warm JSON
- Phase 4 Lambda A100 routing matrix JSON
- Phase 4 Lambda A100 routing matrix Markdown
- Phase 4 Lambda A100 C2 compare JSON
- Phase 4 Lambda A100 C2 compare Markdown
- Phase 4 Lambda H100 Phase 2 cold JSON
- Phase 4 Lambda H100 Phase 2 warm JSON
- Phase 4 Lambda H100 routing matrix JSON
- Phase 4 Lambda H100 routing matrix Markdown
- Phase 4 Lambda H100 C2 compare JSON
- Phase 4 Lambda H100 C2 compare Markdown
- Phase 4 Paper Package summary JSON
- Phase 4 Paper Package markdown
- Phase 4 Paper Package tables directory
- Phase 4 Paper Package manuscript figure manifest
- Phase 4 Paper Package manuscript captions
- Phase 4 Paper Package manuscript claims
- Phase 3 baseline seed s7 JSON
- Phase 3 baseline seed s11 JSON
- Phase 3 baseline seed s19 JSON
- Phase 3 stress seed s7 JSON
- Phase 3 stress seed s11 JSON
- Phase 3 stress seed s19 JSON
- Phase 3 realistic-v1 baseline seed s7 JSON
- Phase 3 realistic-v1 baseline seed s11 JSON
- Phase 3 realistic-v1 baseline seed s19 JSON
- Phase 3 realistic-v1 stress seed s7 JSON
- Phase 3 realistic-v1 stress seed s11 JSON
- Phase 3 realistic-v1 stress seed s19 JSON
- Phase 3 realistic-v1 uncertainty ablation baseline JSON
- Phase 3 realistic-v1 uncertainty ablation stress JSON
- Phase 3 realistic-v1 uncertainty comparison JSON
- True TTFT summary JSON
- Cold decomposition clean4 run 1 JSON
- Cold decomposition clean4 run 2 JSON
- Cold decomposition clean4 run 3 JSON
- Cold decomposition clean4 summary JSON
- Cold decomposition clean4 summary Markdown
- Cold decomposition clean7 run 1 JSON
- Cold decomposition clean7 run 2 JSON
- Cold decomposition clean7 run 3 JSON
- Cold decomposition clean7 summary JSON
- Cold decomposition clean7 summary Markdown
- Warm sanity clean7 JSON
- AWS G5 attention backend A/B first-order JSON
- AWS G5 attention backend A/B first-order Markdown
- AWS G5 attention backend A/B reverse-order canonical JSON
- AWS G5 attention backend A/B reverse-order canonical Markdown
- External cold canonical JSON
- External cold canonical Markdown
- External cold preload/tokenizer-cache JSON
- External cold preload/tokenizer-cache Markdown
- External cold token-parity JSON
- External cold token-parity Markdown
- External cold token-parity decoder-fix JSON
- External cold token-parity decoder-fix Markdown
2026-03-08
Qwen3.5 Prompt Parity And Remaining Fidelity Gap
- Confirmed on AWS that runtime Qwen3.5 prompt IDs match HF chat-template token IDs exactly on a failing IFEval case.
- Ran a four-way prefill A/B (
fast,no_linear,no_full,tokenwise) and all four produced the same first token on that case. - Conclusion: the remaining IFEval quality issue is not caused by prompt serialization or the new batched prefill path.
Step-0 Logit Comparison
- Runtime step-0 top-k on the failing prompt ranked:
ThefirstIsecond
- vLLM top logprobs on the same prompt exposed
IandTheas tied top candidates. - Interpretation: there is still a small decode/logit-distribution drift on the Qwen3.5 lane, even after prompt parity was confirmed.
IFEval Repair Loop Progress
- Added evaluator-guided IFEval repair messaging in
scripts/phase5_awareness_realbench.py. - Repair loop now uses failed-check feedback instead of only a generic uncertainty retry.
- Added explicit repair hints for:
- forbidden words
- exact repeated text
- JSON-only output
- markdown title lines
- word-count limits
- two-response formatting
- 3-seed IFEval-only repair sweep (
phase5-q35-ifeval-aware-repair-ab3) produced:- runtime
arm_a_control:0.361111score,1317.056 ms - runtime
arm_b_awareness_retry:0.402778score,2895.902 ms - runtime
arm_c_awareness_consistency:0.402778score,2910.412 ms - vLLM
arm_a_control:0.430555score,3034.889 ms
- runtime
- Interpretation:
- evaluator-guided repair improves runtime IFEval quality materially over control
- runtime repaired loop is still below vLLM control on IFEval quality
- runtime repaired loop remains slightly faster than vLLM control on this slice
2026-03-10
Stub Audit And Scope Lock
- Direct Phase 5 runtime/vLLM comparisons are not using Hermes stub tools.
- The only live wrapper-level stub issue found in the same-VM harness was in
/Users/andrewcorrea/treni/scripts/hermes_same_vm_mvp.py, where optionalterminal_tool/browser_toolcleanup shims could mask real imports under partial Hermes availability. - That wrapper path is now fixed to prefer real Hermes imports and only install a stub when the import genuinely fails.
- Phase 3 remains partially synthetic by design:
/Users/andrewcorrea/treni/scripts/phase3_agentic_loop_benchmark.pystill exposes asyntheticprofile.realistic_v1reduces stub bias with file-backed fixtures, but it is still not the same lane as direct runtime/vLLM benchmarking.
Qwen Family Compatibility Re-Proved On AWS
- Rebuilt the AWS runtime from the corrected
/Users/andrewcorrea/treni/monolith/main.cand/Users/andrewcorrea/treni/monolith/models/decoder.cu. qwen35(Qwen/Qwen3.5-0.8B) direct inference is restored on AWS and again returnsinference.used=true.qwen35_4b(Qwen/Qwen3.5-4B) now performs real inference on the same A10G host when launched with the correctruntime_pool_mb=15360.- Packed and booted a fresh
Qwen/Qwen2.5-0.5B-Instructcontainer (monolith_qwen25_0p5b.bin) to re-prove backward compatibility on the live host. qwen35_9baliasing remains wired in/Users/andrewcorrea/treni/scripts/qwen_runtime_env.py, but the current AWS box still has no packedmonolith_qwen35_9b.binand is not the intended proof GPU for 9B.
AWS Storage Update
- The AWS Hugging Face cache is mostly active model state, not arbitrary duplicates:
Qwen/Qwen3.5-4BQwen/Qwen3.5-0.8BQwen/Qwen3-ASR-0.6BQwen/Qwen3-VL-Embedding-2BQwen/Qwen3-VL-Reranker-2B
- Removed stale Whisper fallback cache copies after keeping Qwen ASR as the primary STT path.
- Later host cleanup removed the stale
monolith_qwen05*artifacts and the temporaryQwen2.5-0.5B-Instructhost cache/artifacts after backward compatibility had already been re-proven. - AWS root disk moved from roughly
97%used with about3.7Gfree to roughly94%used with about6.5Gfree.
Live Hermes Tooling And PDF/RAG Validation
- Native raw-PDF ingestion is now live in the worker:
/Users/andrewcorrea/treni/scripts/treni_local_tool_worker.pyacceptspaths=[...]and extracts PDF text natively viapdftotextwhen available orpypdfas fallback.- live AWS proof ingested
/home/ubuntu/treni/benchmarks/same_vm_mvp/data/manual-pncp-api.pdfdirectly into the local RAG store.
- Same-VM Hermes wrapper tool registration is now healthier:
- the wrapper now loads the real Hermes
toolspackage before installing any optional shims, so file/code tools are no longer masked by a synthetic top-level package. - live loaded-tool sets now include real
read_file,write_file,search_files,patch, andexecute_code.
- the wrapper now loads the real Hermes
- Live single-tool Hermes probes on AWS now show:
qwen35(0.8B) successfully uses realsamevm_rag_searchagainst the raw-PDF-ingested local RAG store.qwen35successfully calls realexecute_code; the current issue is model-generated code quality, not tool availability.qwen35also calls realsamevm_sqlite_query, but still tends to emit malformed SQL unless the prompt is tightly constrained.qwen35_4bstill lags0.8Bon exact-output and tool-call contract fidelity in the current same-VM harness.
Live Qwen Speed Snapshot
- Current live speed probe on AWS:
qwen35(0.8B):128completion tokens in about1111.4 ms,ttft_ms≈103.1,decode_tps≈115.37qwen35_4b(4B):128completion tokens in about3313.3 ms,ttft_ms≈170.7,decode_tps≈38.64
4B Same-VM Promotion Check And Lambda 9B Capacity Sweep
- Same-VM
qwen35_4bparity debugging on AWS found the real runtime bug. - Root cause:
- in
/Users/andrewcorrea/treni/monolith/models/decoder.cu, the cached linear-attention step path repeated key heads before the key depthwise-conv update whenq_proj_dim != attn_dim Qwen3.5-4Bhits that shape regime, so first-token decode drifted even though tokenizer parity and prompt format were already correct
- in
- Hugging Face on the same AWS host proved the model itself was fine:
- exact-output prompts behaved correctly
- tool-call prompts emitted valid
<tool_call>structure
- After the decode fix, live AWS
4Bbehavior changed from malformed outputs like</think>\n\nREADYand1.to:- exact
READY - normal sentence answers
- valid structured
tool_calls
- exact
- Repaired canonical 4B full suite:
/Users/andrewcorrea/treni/benchmarks/same_vm_mvp/results/samevm-agent-full-aws-r4-qwen35_4b_20260310T184433Z.json- result:
15/15
- Repaired 4B suite scope now passes end-to-end:
- direct runtime smoke
- SQLite
- raw PDF ingest + RAG search
- embedding + reranking
- TTS + Qwen ASR STT
- Hermes runtime-status
- Hermes RAG
- Hermes SQLite exec/query
- Hermes memory add/read
- Hermes
execute_code
- AWS cleanup was deepened again:
- removed the stale
q35-orpo-notemplate-1772992302training tree - removed
checkpoint-1fromsamevm-orpo-reload-q35-fixed_20260308T182430Z - pruned older same-VM debug WAV/debug-result artifacts and old worker logs
- current root disk is still tight but improved to about
4.0Gfree
- removed the stale
- Lambda 9B provisioning is still blocked by real cloud-side capacity:
- verified account auth and SSH key registration
- repeated
launchattempts across valid single-GPU types/regions returned either:instance-operations/launch/insufficient-capacity, or- Cloudflare rate-limit
1015
- no Lambda instance was created in this sweep
Clean Same-VM Agent Selector Lane (2026-03-10)
- Added a real model-dependent comparison harness:
/Users/andrewcorrea/treni/scripts/samevm_agent_regression_suite.py --mode agent_compare
- Canonical comparison artifacts:
/Users/andrewcorrea/treni/benchmarks/same_vm_mvp/results/samevm-agent-compare-aws-r2-qwen35.json/Users/andrewcorrea/treni/benchmarks/same_vm_mvp/results/samevm-agent-compare-aws-r2-qwen35_4b.json
- Scope of this selector lane:
- runtime health
- worker health
- direct runtime smoke
- Hermes runtime-status
- Hermes RAG search
- Hermes SQLite exec/query
- Hermes memory add/read
- Hermes execute_code
- Result on the AWS A10G host:
qwen35(0.8B) passed10/10qwen35_4b(4B) passed2/10
- Key interpretation:
- this selector artifact is now historical only
- it predates the cached linear-attention decode fix for
4B - the repaired full suite is the current source of truth for
4B
- Tightened
0.8Bagent lane improvements that matter:- explicit
script=trueguidance fixed the SQLite exec scenario - memory recall is now validated through a new session prompt, which matches Hermes memory semantics
- execute-code validation now uses an explicit one-line Python task and passes cleanly
- explicit
Isolated Speed Snapshot (2026-03-10)
- Current isolated
0.8Bspeed probe:/Users/andrewcorrea/treni/benchmarks/same_vm_mvp/results/qwen35_model_speed_compare_20260310.md- cold first hit:
103completion tokensttft_ms=608.891infer_ms=12444.005tok/s=8.277
- warm steady-state:
103completion tokensttft_ms≈95.386infer_ms≈905.594tok/s≈113.738
- Current repaired
4Bspeed probe:- same prompt family, repeated live AWS requests
119completion tokensttft_ms≈158.877infer_ms≈3093.814tok/s≈38.464
- Current interpretation:
0.8Bremains the speed-optimized lane4Bis now the repaired stronger-capability lane, but it is materially slower on A10G