Treni

Findings Changelog

Dated summary of major experiment findings and interpretation.

At A Glance

  • Public GPU Agent console is now split cleanly between canonical and scratch surfaces:
    • public console now exposes:
      • a direct runtime test path for raw generation speed/logprobs/uncertainty
      • a separate agent test path for SQLite/RAG/memory/tool verification
      • docs link: https://treni-docs.pages.dev
      • deck link: https://monostate.com/pitch
    • docs navigation is now reorganized around:
      • canonical lanes
      • detailed logs
      • scratch experiments
    • interpretation:
      • the main experiment story is easier to read without mixing claim-safe lanes with random debugging work,
      • while the scratch bucket still preserves the noisy exploratory trail when needed.
  • Native Hermes 4B same-VM conversation lane is now green for the split real-world persistence workflow:
    • artifact:
      • benchmarks/same_vm_mvp/results/hemkesh-v22_20260311T020710Z.json
    • result:
      • local discovery works
      • exact facts are written to SQLite and queried back
      • broader context is ingested into RAG and retrieval-checked
      • a memory note is saved after persistence
      • final recall correctly points exact facts to SQLite and broader context to RAG
    • interpretation:
      • the earlier failures were a mix of duplicate tool-call IDs, over-long replayed tool traces, and opaque worker errors,
      • those are now fixed enough that the native Hermes 4B lane can complete a real multi-turn investor-style knowledge-building workflow on AWS,
      • the remaining open weakness is still the single-turn combined persistence prompt, not the split multi-turn workflow.
  • Warm request path on G5 is stable and fast in the current runtime.
  • Larger-N sampled strict confirmation (2026-03-08) now strengthens the post-fix Qwen3.5 non-thinking claim:
    • artifact:
      • benchmarks/phase5_awareness_realbench/results/phase5_qwen35_remote_strict_matrix_20260308T235013Z.json
    • result (16 samples/task, gpqa_diamond+ifeval, 3 seeds):
      • overall: runtime 0.371528 vs vLLM 0.296875, runtime 1255.344 ms vs vLLM 1585.043 ms
      • gpqa_diamond: runtime 0.3750 vs vLLM 0.3125, runtime slower (801.900 ms vs 433.256 ms)
      • ifeval: runtime 0.368056 vs vLLM 0.281250, runtime faster (1708.789 ms vs 2736.831 ms)
    • interpretation:
      • the sampled strict runtime-vs-vLLM win survives beyond the 8-sample pilot,
      • the overall score and latency deltas stay positive with tighter confidence intervals.
  • Finalized thinking strict lane (2026-03-08) is now measurable rather than all-zero:
    • artifact:
      • benchmarks/phase5_awareness_realbench/results/phase5_qwen35_remote_strict_matrix_20260309T010353Z.json
    • result (8 samples/task, gpqa_diamond+ifeval, 3 seeds):
      • overall: runtime 0.250000 vs vLLM 0.194444, runtime 6823.816 ms vs vLLM 7503.000 ms
      • gpqa_diamond: runtime 0.166667 vs vLLM 0.166667, runtime 7727.880 ms vs 7741.028 ms
      • ifeval: runtime 0.333333 vs vLLM 0.222222, runtime 5919.753 ms vs 7264.973 ms
    • interpretation:
      • the old runtime 512 cap and long-decode corruption were real and are now fixed,
      • the close-form finalize pass turns length-exhausted thinking traces into parseable answers on both backends,
      • reducing the GPQA first-pass reasoning budget to 256 preserves the new score lead while collapsing the old GPQA latency penalty,
      • the resulting finalized thinking lane now beats vLLM overall on both score and latency.
  • GSM8K-only finalized thinking follow-up (2026-03-09) extends the same lane to another closed-form task family:
    • artifact:
      • benchmarks/phase5_awareness_realbench/results/phase5_qwen35_remote_strict_matrix_20260310T022347Z.json
    • result (32 samples/task, 3 seeds):
      • runtime 0.197917 vs vLLM 0.177083, runtime 7174.829 ms vs vLLM 7643.231 ms
    • interpretation:
      • the same finalized thinking setup remains directionally runtime-positive on GSM8K,
      • but the score interval is still too wide for a strong claim, so this should be treated as exploratory support rather than canonical proof.
  • AIME25 isolated finalized thinking pilot (2026-03-09) is a negative result:
    • artifact:
      • benchmarks/phase5_awareness_realbench/results/phase5_qwen35_remote_strict_matrix_20260310T021732Z.json
    • result (8 samples, 1 seed, 512 tokens, patched AIME prompts):
      • runtime 0.0 vs vLLM 0.0, runtime 19776.254 ms vs vLLM 16092.718 ms
    • interpretation:
      • increasing the reasoning budget and adding AIME-specific prompt/finalize guidance still does not recover AIME25,
      • this should be treated as an explicit limitation of the current thinking harness and/or the model size.
  • AIME25 second-thinking recovery attempt (2026-03-09) was also negative:
    • artifact:
      • benchmarks/phase5_awareness_realbench/results/phase5_qwen35_remote_strict_matrix_20260309T021331Z.json
    • result:
      • runtime 0.0 vs vLLM 0.0, runtime 21409.322 ms vs vLLM 22110.402 ms
    • interpretation:
      • a second short thinking finalize pass increases cost and still does not recover AIME,
      • so this branch remains non-canonical.
  • Late AWS sampled-lane fix (2026-03-08) resolved the last Qwen3.5 reproducibility blocker:
    • root cause was in scripts/phase5_awareness_realbench.py, not the runtime:
      • the shared first-pass arm_a_control request skipped the request seed and task-specific decode payload
    • post-fix sampled runtime-only reproducibility probes:
      • benchmarks/phase5_awareness_realbench/results/phase5_repro_runtime_ifeval_s7_raw_seedfix_r1.json
      • benchmarks/phase5_awareness_realbench/results/phase5_repro_runtime_ifeval_s7_raw_seedfix_r2.json
      • result: repeated sampled IFEval seed-7 runs are identical (score_mean=0.3125 both, 8/8 outputs identical)
    • post-fix sampled strict one-host matrix artifact:
      • benchmarks/phase5_awareness_realbench/results/phase5_qwen35_remote_strict_matrix_20260308T220806Z.json
    • repeatability confirmation artifact:
      • benchmarks/phase5_awareness_realbench/results/phase5_qwen35_remote_strict_matrix_20260308T221823Z.json
    • result:
      • overall: runtime 0.409722 vs vLLM 0.302083, runtime 1617.187 ms vs vLLM 2017.206 ms
      • gpqa_diamond: runtime 0.3750 vs vLLM 0.2500, runtime slower (710.693 ms vs 435.823 ms)
      • ifeval: runtime 0.4444 vs vLLM 0.3542, runtime faster (2523.680 ms vs 3598.588 ms)
      • repeatability check stayed aligned:
        • overall: runtime 0.409722 vs vLLM 0.281250, runtime 1607.757 ms vs vLLM 2008.759 ms
    • interpretation:
      • sampled-lane drift was a harness bug rather than runtime instability,
      • there is now a clean sampled strict AB3 lane where runtime wins overall on both score and latency,
      • and that result holds on an immediate second full-matrix rerun.
  • First explicit thinking-mode strict parity lane is now measured (2026-03-08), but it is not yet promotable:
    • initial thinking matrix:
      • benchmarks/phase5_awareness_realbench/results/phase5_qwen35_remote_strict_matrix_20260308T223442Z.json
      • overall: runtime 0.166667 vs vLLM 0.111111, runtime 3589.124 ms vs vLLM 4635.395 ms
    • budget-fixed thinking matrix:
      • benchmarks/phase5_awareness_realbench/results/phase5_qwen35_remote_strict_matrix_20260308T224358Z.json
      • overall: runtime 0.166667 vs vLLM 0.111111, runtime 8678.709 ms vs vLLM 9041.981 ms
      • gpqa_diamond: runtime 0.0 vs vLLM 0.0
    • one-example long-budget probes:
      • benchmarks/phase5_awareness_realbench/results/qwen35_thinking_gpqa_oneoff_runtime_1024_20260308T230352Z.json
      • benchmarks/phase5_awareness_realbench/results/qwen35_thinking_gpqa_oneoff_vllm_1024_20260308T230352Z.json
    • interpretation:
      • under the raw thinking template, both backends can stay trapped in reasoning without emitting a usable final GPQA answer,
      • that exploratory lane should now be read as the pre-fix baseline for the finalized result above, not as the current canonical thinking state.
  • Late AWS deterministic rerun (2026-03-08) is now the cleanest claim-safe Qwen3.5 one-host lane:
    • runtime-side reproducibility fix landed in monolith/server/http.c: request-scoped decode env overrides are now serialized instead of racing through process-global env state
    • direct runtime reproducibility probe:
      • benchmarks/phase5_awareness_realbench/results/phase5_repro_runtime_ifeval_s7_t0_r1.json
      • benchmarks/phase5_awareness_realbench/results/phase5_repro_runtime_ifeval_s7_t0_r2.json
      • result: identical outputs and identical score (0.5625) on repeated temperature=0 IFEval seed-7 runs
    • deterministic strict one-host matrix artifact:
      • benchmarks/phase5_awareness_realbench/results/phase5_qwen35_remote_strict_matrix_20260308T204248Z.json
    • result:
      • overall: runtime 0.295139 vs vLLM 0.267361, runtime 824.714 ms vs vLLM 1572.529 ms
      • gpqa_diamond: score parity (0.166667 vs 0.166667), runtime slower (671.640 ms vs 436.583 ms)
      • ifeval: runtime leads score (0.423611 vs 0.368055) and latency (977.787 ms vs 2708.475 ms)
    • interpretation:
      • there is now a reproducible deterministic strict lane where runtime wins overall on both score and latency.
  • Historical note: the earlier sampled-lane reproducibility failure (2026-03-08) is now explained and non-canonical:
    • repeated runtime-only IFEval seed-7 sampled runs:
      • benchmarks/phase5_awareness_realbench/results/phase5_repro_runtime_ifeval_s7_r1.json
      • benchmarks/phase5_awareness_realbench/results/phase5_repro_runtime_ifeval_s7_r2.json
    • result:
      • summary moved 0.375 -> 0.500 with the same seed/config
      • all 8/8 example outputs changed between reruns
    • interpretation:
      • these old drift artifacts came from the harness shared-first path skipping the request seed,
      • they should not be interpreted as runtime sampler instability.
  • Late AWS sampler update (2026-03-08) materially changed the Qwen3.5 strict picture again:
    • chunked stop-check plus fast top-k sampling landed after the hybrid prefill work,
    • focused GPQA decode profile moved:
      • decoder_step0_logits_sample 40.701 -> 3.538 ms
      • decoder_stepN_sample_mean 37.090 -> 2.366 ms
      • decoder_stepN_total_mean 47.748 -> 12.721 ms
    • focused artifacts:
      • benchmarks/phase5_awareness_realbench/results/q35-gpqa-profile-aws-stopchunk8_20260308T003422Z.json
      • benchmarks/phase5_awareness_realbench/results/q35-gpqa-profile-aws-samplefast1_20260308T003727Z.json
    • fast-sampler AB3 artifact:
      • benchmarks/phase5_awareness_realbench/results/phase5_qwen35_remote_strict_matrix_20260308T003749Z.json
      • overall: runtime 0.305556 vs vLLM 0.347222, runtime 1405.707 ms vs vLLM 1676.336 ms
    • tie-stable fast-sampler AB3 artifact:
      • benchmarks/phase5_awareness_realbench/results/phase5_qwen35_remote_strict_matrix_20260308T004758Z.json
      • overall: runtime 0.315972 vs vLLM 0.347222, runtime 1422.818 ms vs vLLM 1659.878 ms
      • gpqa_diamond: runtime 0.291667 vs vLLM 0.208333, runtime 886.296 ms vs vLLM 515.171 ms
      • ifeval: runtime 0.340278 vs vLLM 0.486111, runtime 1959.340 ms vs vLLM 2804.584 ms
    • interpretation:
      • runtime now has a clean strict latency lead on the one-host Qwen3.5 matrix,
      • prompt prefill and sampled decode are both materially improved,
      • the remaining blocker is recovering the small score deficit without giving back that latency win.
  • Late AWS update (2026-03-08) materially changed the Qwen3.5 strict picture:
    • new batched hybrid prefill landed in monolith/models/decoder.cu and monolith/main.c,
    • focused GPQA profile moved:
      • decoder_prefill 3263.527 -> 1341.628 -> 275.372 ms
      • decoder_ttft 3317.441 -> 1405.739 -> 1017.876 ms
    • latest strict AB3 artifact: benchmarks/phase5_awareness_realbench/results/phase5_qwen35_remote_strict_matrix_20260308T000429Z.json
    • result:
      • overall score: runtime 0.413195 vs vLLM 0.347222
      • overall latency: runtime 2940.172 ms vs vLLM 1686.263 ms
      • gpqa_diamond: runtime 0.458333 vs vLLM 0.208333, latency 1347.582 ms vs 512.075 ms
      • ifeval: runtime 0.368055 vs vLLM 0.486111, latency 4532.763 ms vs 2860.452 ms
    • interpretation: prompt-prefill was a real architectural blocker and is no longer the main latency limiter; the remaining gap is narrower and now looks more like warm decode/request-path overhead.
  • Late AWS update (2026-03-07) now has two new concrete results:
    • Qwen3.5 strict/AWS launcher drift is now fixed through a shared fast-path env:
      • code: scripts/qwen_runtime_env.py
      • consumers: scripts/qwen35_remote_isolated_ab.py, scripts/treni_local_tool_worker.py, scripts/hermes_same_vm_mvp.py
      • clean AB3 artifact: benchmarks/phase5_awareness_realbench/results/phase5_qwen35_remote_strict_matrix_20260307T231500Z.json
      • result:
        • overall score: runtime 0.335648 vs vLLM 0.291667
        • overall latency: runtime 3690.124 ms vs vLLM 1646.672 ms
        • gpqa_diamond: score parity (0.25 vs 0.25) but runtime remains much slower
      • ifeval: runtime higher score (0.421296 vs 0.333333) but still slower
      • interpretation: the old launcher mismatch was real, but fixing it does not close the latency gap; the remaining blocker is still long-prompt prefill.
    • The code-level reason for the remaining Qwen3.5 latency gap is now explicit:
      • monolith/models/decoder.cu currently returns invalid from treni_decoder_forward_f32(...) when ctx->is_linear_attn is true,
      • the comment in that path states that Qwen3.5 linear-attention is implemented only in cached/token decode,
      • so Qwen3.5 prompt prefill still falls back to the token-by-token cached loop in monolith/main.c instead of a true batched prompt-prefill path.
      • interpretation: the remaining long-prompt latency gap is architectural, not just a missing launch flag.
    • ORPO self-reload loop is real end-to-end:
      • artifact: benchmarks/same_vm_mvp/results/samevm-orpo-reload-aws_20260307T222341Z.json
      • local ORPO output was merged, packed into a new monolith container, restarted as a second runtime, and answered a real chat request.
    • Qwen3.5 shared-prefix tiering (64 -> 112 runtime cache cap with quartile tiers + exact replay) yields a real clean latency win, but not a full fix:
      • sequential GPQA profile artifact: benchmarks/phase5_awareness_realbench/results/q35-gpqa-profile-aws-seq2-cap112_20260307T222540Z.json
      • clean strict seed-7 spot A/B artifacts:
        • benchmarks/phase5_awareness_realbench/results/phase5_qwen35_remote_strict_matrix_20260307T223218Z.json
        • benchmarks/phase5_awareness_realbench/results/phase5_qwen35_remote_strict_matrix_20260307T223555Z.json
      • effect on runtime latency (112 vs 64):
        • overall -363.908 ms
        • gpqa_diamond -420.699 ms
        • ifeval -307.116 ms
      • runtime is still slower than vLLM overall even after this improvement.
  • Qwen3.5 contract validation + one-host strict rerun are now updated on AWS (2026-03-07):
    • tokenizer audit artifact: benchmarks/qwen35_tokenizer_audit/results/qwen35-tokenizer-audit-active_20260307T173024Z.json
    • runtime smoke artifact: benchmarks/qwen35_smoke/results/qwen35-runtime-smoke-active2_20260307T173132Z.json
    • isolated semantic A/B artifact: benchmarks/qwen35_smoke/results/qwen35-isolated-ab-active_20260307T173228Z.json
    • strict one-host matrix artifact: benchmarks/phase5_awareness_realbench/results/phase5_qwen35_remote_strict_matrix_20260307T191653Z.json
    • new runner: scripts/phase5_qwen35_remote_strict_matrix.py
    • current state:
      • packed tokenizer exactly matches HF full vocab for Qwen/Qwen3.5-0.8B (248077 tokens),
      • runtime extended non-thinking smoke passes 7/7 cases on the active AWS host,
      • runtime wins the isolated non-thinking probe suite overall while matched vLLM still misses multimodal-placeholder and forced-thinking probe cases in that probe harness,
      • strict realbench score is no longer behind overall: runtime score 0.3333 vs vLLM 0.3160,
      • strict realbench latency is still far behind: runtime 3809.745 ms vs vLLM 1626.068 ms.
    • request-path fixes included in that rerun:
      • Qwen3.5 decoder prefix cache now defaults on with 64 prefix tokens,
      • timing.ttft_ms now measures request-path first-token timing instead of the decode-loop step-0 proxy,
      • repeated prompt-family hot probe on AWS dropped from infer_ms ~1798.5 -> 842.4 ms and ttft_ms ~1531.9 -> 782.5 ms with a cache hit.
    • task split in the strict one-host matrix:
      • gpqa_diamond: score parity (0.2917 vs 0.2917) but runtime is still much slower (+2449.320 ms),
      • ifeval: runtime higher score (0.3750 vs 0.3403) but still slower (+1918.033 ms).
  • Follow-up prefix-cache debugging on AWS (2026-03-07, non-canonical debug cycle) isolated a real short-prompt runtime bug:
    • focused 2x gpqa + 2x ifeval profile with cache enabled showed:
      • GPQA does get a real 64-token prefix-cache hit,
      • that hit reduces prefill (~3075 ms -> ~2697 ms) but does not close the large prefill gap,
      • short IFEval requests were hitting CUDA invalid argument on the prefix-cache/store path and then poisoning the next request.
    • focused no-cache rerun removed the short-prompt CUDA failures entirely.
    • safe fix landed in monolith/main.c: only long prompts are allowed to store into the prefix cache; short prompts skip the buggy store path.
    • focused post-fix profile confirms:
      • GPQA cache hit is preserved,
      • short IFEval CUDA invalid-argument path is gone in the probe,
      • this is a stability/correctness fix, not yet a canonical strict-matrix win.
  • Same-VM Hermes wrapper recovery is now complete on AWS (2026-03-07):
    • artifact: benchmarks/same_vm_mvp/results/samevm-q35-stack_20260307T172158Z.json
    • result: wrapper now auto-starts the local runtime + CPU tool worker, calls samevm_runtime_health, runs the real extended Qwen3.5 smoke suite, and emits a deterministic plain-text summary from tool outputs.
    • entrypoints: scripts/hermes_same_vm_mvp.py, scripts/run_samevm_qwen35_stack.sh
    • smoke sub-artifacts: benchmarks/qwen35_smoke/results/samevm-smoke-20260307T172124Z_20260307T172124Z.json, benchmarks/qwen35_smoke/results/samevm-smoke-20260307T172124Z_20260307T172124Z.md
    • current status: PASS across 7/7 smoke cases in the wrapper path; remaining issue is an intermittent first tool-turn CUDA retry (compute/ops.cu:765, invalid argument) that recovered successfully in the observed run.
  • Same-VM runtime-admin proof is now clean on AWS (2026-03-07):
    • artifact: benchmarks/same_vm_mvp/results/samevm-q35-runtime-admin-proof-v5_20260307T212852Z.json
    • result: Hermes calls the local samevm_runtime_status and samevm_multimodal_status tools, and the wrapper now rewrites partial/truncated model responses into a deterministic tool-derived summary.
    • current state in that artifact:
      • runtime is managed by the worker on http://127.0.0.1:18080,
      • managed runtime PID is live (pid_running=yes),
      • Qwen3.5 runtime uses the packed local container with prefix cache enabled,
      • multimodal defaults are loaded from the same local worker (embed, rerank, tts, stt).
  • Same-VM ORPO control-plane proof is now complete on AWS (2026-03-07):
    • artifact: benchmarks/same_vm_mvp/results/samevm-orpo-probe-aws_20260307T215307Z.json
    • runner: scripts/samevm_orpo_probe.py
    • result:
      • local preference dataset write succeeded,
      • real background ORPO job launched through the worker,
      • job completed with returncode=0,
      • current scope is training control, not hot-reload: adapter/container ingestion back into the monolith runtime is still not wired.
  • Same-VM multimodal tool surface is now wired into the local worker + Hermes bridge (2026-03-07):
    • code: scripts/samevm_multimodal_models.py, scripts/treni_local_tool_worker.py, scripts/hermes_same_vm_mvp.py
    • new tool classes: samevm_multimodal_status, samevm_embed, samevm_rerank, samevm_tts, samevm_stt
    • default models:
      • Qwen/Qwen3-VL-Embedding-2B
      • Qwen/Qwen3-VL-Reranker-2B
      • Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice
      • Qwen/Qwen3-ASR-0.6B
      • Whisper fallback STT if the requested model id contains whisper
    • live worker status smoke confirms the new endpoints are reachable.
    • bootstrap entrypoint: scripts/bootstrap_samevm_multimodal.sh
    • MVP readme: benchmarks/same_vm_mvp/README.md
  • First real same-VM stack proof now runs end-to-end on AWS (2026-03-07):
    • artifact: benchmarks/same_vm_mvp/results/samevm-stack-probe-aws-v3_20260307T213248Z.json
    • runner: scripts/samevm_stack_probe.py
    • confirmed in one local-worker pass:
      • runtime status: healthy managed Qwen3.5 runtime on the same VM,
      • SQLite exec/query: pass (1 row),
      • RAG ingest/search: pass (match_count=1, top hit Same VM locality),
      • TTS: pass with Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice,
      • STT: pass with Whisper fallback on the generated WAV,
      • embedding: pass with Qwen/Qwen3-VL-Embedding-2B (dim=2048),
      • reranking: pass with Qwen/Qwen3-VL-Reranker-2B.
    • caveat: Whisper transcript is directionally correct but not exact on the synthetic audio (Treni misheard), so current STT proof is functional, not quality-benchmarked.
  • Same-VM multimodal cache retention bug is now explicit and mitigated on AWS (2026-03-07):
    • finding: after the multimodal proof, the local tool worker was holding about 13.3 GiB of GPU memory and starving the Qwen runtime path.
    • fix:
      • new worker endpoint: POST /v1/mm/clear_cache
      • new Hermes tool: samevm_multimodal_clear_cache
      • status now reports loaded_model_count, loaded_models, and CUDA allocation/reservation.
    • code: scripts/samevm_multimodal_models.py, scripts/treni_local_tool_worker.py, scripts/hermes_same_vm_mvp.py
  • Canonical same-VM MVP proof was revalidated on AWS after the runtime compatibility fix (2026-03-10):
    • artifact: benchmarks/same_vm_mvp/results/samevm-full-mvp-aws-v15.json
    • summary artifact: benchmarks/same_vm_mvp/results/samevm-full-mvp-aws-v15.md
    • result:
      • runtime health: pass
      • worker health: pass
      • Hermes runtime-status: pass
      • Hermes multimodal-status: pass
      • direct same-VM runtime smoke: pass on the basic non-thinking profile (all_ok=True, 5 cases, includes first-turn tool calling)
      • direct same-VM thinking smoke: pass on the extended/thinking profile with exact-match checks (all_ok=True)
      • direct same-VM stack probe: pass for SQLite, RAG, embedding, reranking, TTS, Qwen ASR STT
      • Qwen3.5 ORPO reload proof: pass via benchmarks/same_vm_mvp/results/samevm-orpo-reload-q35-fixed.json
      • sidecar cleanup: pass (port=18081, stopped=true)
      • final multimodal cache clear: pass
    • implementation note:
      • the full demo runner is now scripts/samevm_full_mvp_demo.py
      • one-command entrypoint is scripts/run_samevm_full_mvp.sh
    • compatibility fix:
      • monolith/server/http.c now accepts both POST /v1/chat/completions and POST /chat/completions
      • monolith/server/http.c now exposes both GET /v1/models and GET /models
      • this removed the live Hermes 404 failure against the root runtime URL on AWS
    • extra Hermes tool proofs after the rerun:
      • benchmarks/same_vm_mvp/results/hermes-demo-sqlite-query-v2.json
      • benchmarks/same_vm_mvp/results/hermes-demo-rag-search-v1.json
      • these confirm Hermes can use real SQLite and RAG tools on AWS beyond the status-only path
  • Live capability validation pass on AWS added concrete speed and multimodal proofs (2026-03-10):
    • direct generation speed on the current live Qwen3.5 runtime (0.8B, non-thinking lane):
      • 3 deterministic runs
      • 130 completion tokens each
      • mean end-to-end throughput: 112.37 tok/s
      • mean decode-only throughput: 121.90 tok/s
    • Hermes tool visibility proof:
      • benchmarks/same_vm_mvp/results/hermes-tool-list-v1.json
      • loaded tools include runtime control, smoke, SQLite, RAG, embedding, reranking, TTS, STT, ORPO, and job status
    • Hermes audio roundtrip proofs:
      • TTS: benchmarks/same_vm_mvp/results/hermes-tts-v2.json
      • STT: benchmarks/same_vm_mvp/results/hermes-stt-v2.json
      • current transcript roundtrip still shows the known synthetic-voice name drift (Treni -> Trinity)
    • PDF/RAG real-world proof:
      • extracted /Users/andrewcorrea/pncp-ata360/docs/manual-pncp-api.pdf to text
      • ingested extracted text into the AWS same-VM RAG store
      • search for Protocolo de Comunicação PNCP returned the correct manual section
    • reranker proof:
      • direct Qwen reranker call correctly ranked the Protocolo de Comunicação candidate first
    • current product caveat:
      • same-VM RAG currently ingests plain text files and text payloads only; PDF parsing is still an external preprocessing step
    • current 4B feasibility note:
      • the live AWS host is an A10G 24 GB box with enough GPU headroom to try Qwen3.5-4B,
      • but only about 12 GB root disk remains free, so model download/pack is the practical blocker on the current host
  • Qwen3.5 thinking-mode response salvage is now wired for unfinished reasoning traces (2026-03-08):
    • code: monolith/server/http.c
    • behavior:
      • unfinished Thinking Process: outputs on exact-output prompts now return a usable final message.content
      • the raw reasoning trace is preserved in message.reasoning_content
      • finish_reason remains length when the model still truncates
    • passing artifact: benchmarks/qwen35_smoke/results/postmvp-extended-thinking-r4_20260308T193858Z.json
  • Qwen3.5 ORPO reload is now promoted to the target family (2026-03-08):
    • tokenizer parity for packed tokenizer.json + merges is now exact in the runtime
    • repacked ORPO sidecar containers now carry the correct runtime model kind (qwen3_5)
    • canonical proof: benchmarks/same_vm_mvp/results/samevm-orpo-reload-q35-fixed.json
    • observed sidecar chat preview:
      • READY Local tools are useful because they offer immediate, offline access...
  • Same-VM multimodal STT is now promoted from Whisper fallback to Qwen ASR on AWS (2026-03-08):
    • wrapper fix: the local STT loader now initializes the forced aligner only when timestamps are explicitly requested
    • AWS disk cleanup removed obsolete ORPO runtime candidates and recovered the box from 100% to 88% root usage
    • passing Qwen ASR probe: benchmarks/same_vm_mvp/results/samevm-stack-probe-aws-v5.json
    • current observed transcript on the synthetic TTS probe:
      • Trinity runs its tools locally on the same machine.
    • remaining caveat:
      • timestamped STT still depends on the forced-aligner path and sufficient local disk to materialize that model
  • Extended same-VM runtime smoke is now fully green (2026-03-08):
    • extended non-thinking profile: pass (benchmarks/qwen35_smoke/results/postmvp-extended_20260308T185130Z.json, 7/7 cases)
    • extended thinking profile: pass (benchmarks/qwen35_smoke/results/postmvp-extended-thinking-r4_20260308T193858Z.json)
  • Clean direct GPQA runtime profile is now captured on AWS (2026-03-07):
    • artifact: benchmarks/phase5_awareness_realbench/results/q35-gpqa-profile-aws-clean_20260307T220200Z.json
    • runner: scripts/q35_gpqa_profile_once.py
    • result:
      • on a fresh runtime, the same GPQA prompt run twice still spends most time in decoder_prefill,
      • first call: decoder_prefill=3263.527 ms, decoder_ttft=3317.441 ms,
      • second call with prefix-cache reuse: decoder_prefill=2690.001 ms, decoder_ttft=2750.672 ms,
      • step-0 decode itself is small (decoder_step0_layers ~8 ms, decoder_step0_logits_sample ~33-36 ms),
      • tensor upload improves sharply on the second call (218.091 -> 11.216 ms), but the remaining gap is still overwhelmingly prefill.
  • Qwen3.5 tokenizer audit is now exact for the current packed target (2026-03-06):
    • artifact: benchmarks/qwen35_tokenizer_audit/results/runtime-q35-tokenizer-audit-r4_20260306T190418Z.json
    • packed runtime tokenizer/full vocab exactly matches HF Qwen/Qwen3.5-0.8B (248077 tokens), including <think> and vision/control tokens.
  • New Qwen3.5 probe matrix (2026-03-06) gives a cleaner functional picture than the earlier strict matrix alone:
    • artifact: benchmarks/qwen35_smoke/results/qwen35-probe-matrix-r2_20260306T200035Z.json
    • runtime non-thinking passes the full extended probe set (all_ok=true).
    • runtime thinking also completes all cases, but output discipline is weak and latency is very high.
    • vLLM non-thinking / thinking are not universal wins in this matrix:
      • current text-only launch rejects multimodal placeholder input (400),
      • several thinking/exact-output cases end with finish_reason=length.
    • claim-safe interpretation:
      • runtime is functionally solid in non-thinking,
      • vLLM remains much faster on long-prompt/tool cases,
      • the current runtime blocker is long-prompt/tool-path infer latency, not basic Qwen3.5 tokenizer or tool-call plumbing.
  • Same-VM Hermes MVP is now materially real (2026-03-06):
    • smoke artifact: benchmarks/same_vm_mvp/results/hermes-samevm-q35-smoke-r5_20260306T192703Z.json
    • ORPO smoke launch artifact: benchmarks/same_vm_mvp/results/hermes-samevm-q35-orpo-smoke-r1_20260306T194152Z.json
    • local worker completed a real ORPO training smoke run and saved output under benchmarks/same_vm_mvp/trainings/samevm-orpo-qwen25-smoke3/.
  • Phase 5 harness parse hardening landed (2026-03-04) for closed-form tasks:
    • long reasoning traces no longer get scored via accidental "last number" extraction,
    • parser now requires explicit answer signals (ANSWER: / Final Answer: / boxed / strict numeric-only),
    • think-tag blocks are stripped before parse.
    • code: /Users/andrewcorrea/treni/scripts/phase5_awareness_realbench.py
    • sanity artifact (vLLM thinking mode, no parser): phase5_awareness_realbench_q35-parsefix-vllm-thinking1_20260304T032441Z.json now yields prediction_parsed=null instead of false numeric parse.
  • New strict paired AB3 rerun (2026-03-04, gpqa_diamond+ifeval, Arm A only, request_logprobs=false, 16/task, seeds 7,17,27) is complete:
    • summary artifact: phase5_q35_runtime_vs_vllm_ifeval_gpqa_ab3_20260304T034227Z.json
    • overall: runtime score 0.3403 vs vLLM 0.3229 (delta +0.0174, CI includes 0), runtime latency 1772.931 ms vs vLLM 1553.034 ms (delta +219.897 ms).
    • stratified:
      • gpqa_diamond: score parity (0.2708 vs 0.2708), runtime much slower (+1881.776 ms).
      • ifeval: runtime better score (+0.0347) and much faster latency (-1441.983 ms).
  • Runtime awareness retries remain non-promotable on this exact profile (2026-03-04):
    • Arm B/C retries did not improve gpqa_diamond,
    • both regressed ifeval and added latency in tested settings (adaptive, summary-mode uncertainty, no token-logprobs).
  • Strict benchmark guard is now enforced for runtime HTTP runs (2026-03-02): TRENI_HTTP_REQUIRE_INFERENCE=1 returns hard failure (502 inference_required) when inference is unused/empty, eliminating silent heuristic-fallback/zero-filled artifact contamination.
  • Phase 5 paper-loop harness bug fix landed (2026-03-03): in paper mode, retry now commits the refined pass directly (instead of passing through confidence-margin replacement gating).
    • Code: /Users/andrewcorrea/treni/scripts/phase5_awareness_realbench.py
    • First live sanity artifacts:
      • vLLM (gpqa_diamond+ifeval, 8 samples/task): phase5_awareness_realbench_qwen35-paperfix-sanity1_20260303T201156Z.json (overall B-A=0.0, gpqa +0.125, ifeval -0.125).
      • runtime (same set): phase5_awareness_realbench_qwen35-paperfix-sanity2-runtime_20260303T201744Z.json (overall B-A=-0.125, retry rate 100%).
  • Runtime "all-zero" sanity artifact (phase5_awareness_realbench_qwen35-paperfix-sanity1-runtime_20260303T201620Z.json) was an infra contamination case, not a scoring result:
    • vLLM and runtime were co-resident on single A10G,
    • runtime hit GPU OOM on embedding upload and strict guard returned 502 inference_required for all requests.
  • Paper trigger calibration issue is now isolated on runtime (2026-03-03):
    • with default paper thresholds, max_entropy triggered retries on all samples (16/16),
    • raising only perplexity threshold (1.4 -> 1.8 -> 2.2) had no effect on behavior/outcome,
    • raising entropy threshold (1.5 -> 7.0) reduced retries (16 -> 9) but still produced no score uplift (overall B-A=0.0) and added latency.
  • Runtime summary-mode calibration fix is now live (2026-03-03):
    • retry logic now detects runtime_summary uncertainty payloads and uses guarded vote triggering (paper_summary_* thresholds) instead of entropy-only firing.
    • artifact (8/task sanity): phase5_awareness_realbench_qwen35-papersummaryfix-runtime-sanity2_20260303T204120Z.json
      • retry dropped (16 -> 9),
      • quality moved from negative to parity (overall B-A: -0.125 -> 0.0),
      • latency overhead remains material (~+1386 ms mean).
    • higher-N confirmation (32/task): phase5_awareness_realbench_qwen35-papersummaryfix-runtime-conf045-s32_20260303T204751Z.json
      • mixed result (gpqa +0.03125, ifeval -0.0625, overall B-A=-0.015626).
  • Task-aware summary policy is now the first repeatable positive awareness result on this Qwen3.5 runtime track (2026-03-03):
    • policy: keep summary-mode paper retries for gpqa_diamond, disable summary-mode retries for ifeval,
    • one larger run (32/task): phase5_awareness_realbench_qwen35-papersummaryfix-runtime-conf045-s32-ifevaloff_20260303T222841Z.json
      • overall B-A=+0.015624, latency delta +618.068 ms,
      • gpqa +0.03125, ifeval +0.0.
    • 3-seed repeatability (16/task): ...ifevaloff-rpt-s7/s17/s27...
      • overall delta mean +0.020833 (range 0.0 to +0.03125),
      • retries constrained to gpqa only, mean retry rate ~0.2917.
  • Late Phase 5 policy pass (2026-03-03) reduced awareness latency overhead without losing quality signal:
    • root cause isolated: most gpqa_diamond retries were invalid_parse with high first-pass confidence, and those retries were usually non-productive.
    • harness changes:
      • compact invalid-parse recovery prompt (build_format_recovery_messages),
      • new confidence gate for invalid-parse retries on closed-form tasks (--invalid-parse-retry-confidence-max).
    • code: /Users/andrewcorrea/treni/scripts/phase5_awareness_realbench.py
    • 3-seed repeatability (16/task, s7/s17/s27) with invalid_parse_retry_confidence_max=0.73:
      • artifacts: phase5_awareness_realbench_qwen35-papersummaryfix-runtime-compact-invmax073-s16_20260303T232029Z.json, ...-rpt-s17_20260303T232254Z.json, ...-rpt-s27_20260303T232516Z.json
      • quality delta unchanged vs prior baseline: overall B-A mean = +0.020833,
      • latency overhead reduced: +712.276 ms -> +404.603 ms,
      • GPQA retry rate reduced: 0.5833 -> 0.2917.
    • higher-N confirmation (32/task, s7):
      • artifact: phase5_awareness_realbench_qwen35-papersummaryfix-runtime-compact-invmax073-s32_20260303T232755Z.json
      • same quality delta as prior s32 baseline (overall B-A=+0.015624) with lower latency overhead (+618.068 ms -> +326.187 ms).
  • Qwen3.5 strict runtime-vs-vLLM matrix has a new Arm A-only canonical rerun (2026-03-03, phase5_qwen35_runtime_vs_vllm_matrix_20260303T104038Z.json) after decoder gate-layout fix:
    • score gap narrowed (runtime 0.15625 vs vLLM 0.19097, delta -0.03472, CI includes near-parity),
    • latency is still behind (runtime 1723.685 ms vs vLLM 958.757 ms, delta +764.928 ms).
  • Qwen3.5 strict runtime-vs-vLLM canonical matrix is now completed (2026-03-02) after decoder-path unblock (phase5_qwen35_runtime_vs_vllm_matrix_20260302T222013Z.json): runtime is currently behind on both score (0.0503 vs 0.2170) and latency (1881.188 ms vs 178.093 ms).
  • Follow-up Q/K norm fix check (qnorm-check1, 2026-03-02) did not resolve the Qwen3.5 gap (phase5_qwen35_runtime_vs_vllm_matrix_20260302T225529Z.json): runtime remained far slower and still produced malformed repetitive outputs on direct probes, narrowing the primary blocker to missing linear_attn (GatedDeltaNet) parity.
  • Qwen3.5 serving path is now unblocked on AWS via vLLM nightly (0.16.1rc1.dev...) with --language-model-only; stable endpoint validated on 127.0.0.1:18081 (2026-03-02).
  • Infra blocker/fix (2026-03-02): root disk hit 100% and broke vLLM startup (No usable temporary directory); cache/venv cleanup restored ~21GB free and launch stability.
  • Phase 5 A/B/C fairness fix landed (2026-03-02): all arms now reuse the exact same first completion per example before awareness actions.
  • Paper-loop alignment landed in Phase 5 harness (2026-03-02): cloned reference repo (third_party/weave-logprobs-reasoning-loop) and ported multi-signal retry triggers (perplexity, max_entropy, low_confidence_tokens) plus per-call uncertainty traces.
  • Paper-mode smoke validation ran end-to-end on AWS Qwen3.5 nightly (2026-03-02): trigger reason fields and loop traces are present in artifact phase5_awareness_realbench_qwen35-paper-smoke_20260302T191420Z.json.
  • Full Qwen3.5 paper-mode run (r4, 2026-03-02) confirms integration but not uplift at current thresholds: overall B-A=-0.046875, C-A=0.0, with extra latency from retries.
  • Adaptive uncertainty fix (2026-03-02) reduced over-triggering and improved tradeoffs on Qwen3.5:
    • r5 (...r5-adaptive...): B-A=-0.015625, C-A=0.0, with substantially lower latency overhead than r4.
    • stricter r6 variant reached B-A=0.0 but regressed C-A (-0.03125), so r5 adaptive defaults remain preferred.
  • Qwen3.5 Phase 5 run (r3) after fairness fix shows no awareness regressions (all B-A and C-A deltas are 0.0), but no quality uplift yet.
  • Decode-stop semantics are now aligned to end-marker stopping (not im_start), with token-level control-fragment filtering default-on and sanitize still opt-in (2026-03-02); AWS qwen05 probes no longer emit the prior "<|im" leak.
  • Tokenizer encode parity fix landed for chat templates (2026-03-02): <|...|> control tokens are now emitted as atomic special tokens in BPE path instead of punctuation fragments.
  • HTTP fallback behavior fix landed (2026-03-02): when inference succeeds but content is empty, API now returns empty assistant content instead of synthetic route-classifier text.
  • qwen05 deterministic MCQ token-0 stop parity gap is now resolved (2026-03-02) via Qwen default system preamble injection for user-only chats in runtime HTTP template build.
  • Post-fix qwen05 external-cold validation is complete (2026-03-02):
    • runtime now returns non-empty completions on the prior failing path (external_cold_qwen05_templatefix_20260302T154019Z.json),
    • TTFT remains strongly ahead of vLLM in this profile (1.703 ms vs 49.759 ms).
  • Phase 5 real-benchmark first canonical diagnostic pack is now complete (2026-03-01, r5): after runtime prompt/tokenizer fixes, gpqa_diamond and ifeval improved materially (A=0.500, A=0.5625 respectively), while gsm8k/aime25 remain at 0.0 in this setup.
  • Phase 5 matched-depth/matched-sample rerun on qwen after this fix (r9, 2026-03-02) is now complete:
    • gpqa_diamond dropped (A 0.500 -> 0.125) vs r5,
    • gsm8k recovered materially (A 0.000 -> 0.625, C 0.000 -> 0.750),
    • aime25 remains weak but Arm C is non-zero (0.125),
    • current interpretation stays mixed-by-task (not a universal quality win yet).
  • Tokenizer/runtime root-cause fixes landed for Phase 5 quality debugging: message aggregation (system+user), prompt-cap default (32 -> 256), BPE merges, added_tokens load, and UTF-8 JSON escape handling.
  • Qwen-template auto A/B run (r6, 2026-03-01) regressed both quality and latency versus r5; template path remains opt-in and non-canonical.
  • Phase 5 HF-reference parity is now complete on the same sampled set (phase5_hf_reference_qwen_r5_20260301T1900Z.json): runtime is higher on GPQA, slightly lower on IFEval, and tied at 0.0 on GSM8K/AIME (so math-task zeros are not runtime-only breakage in this setup).
  • Real-benchmark awareness A/B/C harness is now implemented (2026-02-28) for gpqa_diamond, ifeval, gsm8k, and aime25 (scripts/phase5_awareness_realbench.py + run wrapper); first canonical diagnostic run pack is now published (r5).
  • Higher-N same-window runtime-vLLM rerun (AB5, 2026-02-28) keeps runtime ahead on full request path (1184.812 ms vs 1318.675 ms, vLLM/runtime=1.113x) and cold-total first response (5.810x ratio).
  • Post-AB5 full-depth gate sweep (AB2 + delayed-Lt AB3, 2026-02-28) did not unlock a new canonical toggle: delayed-Lt failed mixed-load confirmation and proj_fast remained mixed/noise.
  • Tuned delayed-Lt slow-gate rescue (AB2, 2026-02-28) also stayed non-promotable: warm remained slightly positive but mixed stayed near-flat with p99 regression, so delayed-Lt is still non-canonical.
  • FFN-proj mixed-input fallback patch (2026-02-28) removes repeated failed batched2 GEMM attempts under forced-Lt stress, but canonical re-gate still leaves f32_input non-promotable on default path.
  • Full-depth qwen-focused rerun on clean inference path (pool=16384, no fallback) showed a positive TRENI_DECODER_FFN_PROJ_U16_FAST_COMPUTE=1 signal, but full foundation gate later rejected global promotion (canonical stays default-off).
  • Internal routing beats external routing on matched benchmark tasks.
  • Cold start bottlenecks were decomposed stage-by-stage; model_tensor_index_build is no longer a dominant stage.
  • Remaining cold cost is now concentrated mostly in Qwen decoder_tensor_upload.
  • External cold-start comparison (runtime vs PyTorch vs vLLM vs Ollama) now has a canonical G5 artifact with explicit request-path vs total-cold interpretation.
  • After decoder loop and sampling fixes, parity-corrected 48-token request path now beats vLLM on TTFT and full latency in latest G5 run.
  • First routing failure-amplification stress profile now shows external retry/timeout chains increase both latency and error rate vs internal.
  • Routing matrix expansion on G5 confirms this trend across 6 profiles (baseline + escalating stress).
  • Cross-host routing pilot (local client + SSH tunnel to G5 runtime/controller) now reproduces external-path degradation under stress.
  • Split-host routing matrix (CPU router host + GPU runtime host) is now complete as canonical Track B evidence.
  • Qwen cold upload now has a direct on/off ablation for GPU-side BF16/F16 conversion, showing large cold-path reduction.
  • External-cold runtime-only rerun confirms the same fix improves startup+cold-total, not just first-hit request latency.
  • Runtime-vLLM external-cold repeatability rerun now confirms the same direction with restored vLLM environment.
  • External-cold all-backend repeatability (runtime + PyTorch + vLLM + Ollama) is now complete after GPU-convert fix.
  • Runtime host-prefetch cold fix now removes the intermittent preload upload outlier while preserving request-path TTFT/full latency.
  • Staged H2D upload (TRENI_TENSOR_H2D_STAGING) is now benchmarked with chunk-size follow-up and is currently regressed on G5, so the path is parked opt-in/default-off.
  • Non-staging H2D chunk-size tuning (TRENI_TENSOR_H2D_CHUNK_MB=0/64/128) was initially near-neutral on this profile; later 2026-02-28 full-depth AB3 reruns promoted default 0 (see newer entry).
  • Host page-touch pre-fault upload path (TRENI_TENSOR_HOST_TOUCH) is now implemented and benchmarked; it shifts time from H2D to prefetch and regresses request latency in this profile, so it remains opt-in/default-off.
  • Upload sync diagnostics now isolate cold upload composition: conversion is measurable when synchronized, but H2D transfer remains the dominant stage.
  • Synchronized host-register diagnostics now confirm no meaningful transfer benefit on this profile, so that lane is currently deprioritized.
  • Decoder logits u16 mixed-precision path is now implemented/benchmarked; despite slight cold-upload reduction, request-path latency regresses and the lane remains parked.
  • Tensor-cache hash lookup lane (TRENI_TENSOR_CACHE_HASH) is now implemented/benchmarked and remains near-neutral in this profile with slight warm p99 regression, so it stays opt-in/default-off.
  • Sampler direct-store lane (TRENI_SAMPLE_DIRECT_STORE) is now implemented/benchmarked and regresses warm request latency in this profile, so it stays opt-in/default-off.
  • Decoder direct-out residual lane (TRENI_DECODER_DIRECT_OUT_HIDDEN) initially regressed on warm-profile A/B (2026-02-24), but was later revalidated and promoted for the current full-depth lane (2026-02-27 late cycle).
  • Multi-head seq1 attention lane (TRENI_ATTN_SEQ1_USE_MULTIHEAD) is now implemented/benchmarked and shows clear wins on qwen and bart request paths; it is now default-on with a bounded max-kv guard.
  • External-cold repeatability after seq1 multi-head default promotion is now complete (2026-02-24): runtime retained large margins vs PyTorch and vLLM while improving its own TTFT/full/cold-total vs prior host-prefetch baseline.
  • First step0 softmax/PV exp-reuse patch is now complete (2026-02-24) and validated on the same external-cold 3-run set: runtime remained ahead with small additional gains (sub-ms on full/cold-total).
  • Second step0 shared-probability follow-up was tested on the same 3-run set and did not beat exp-reuse; it was reverted to keep the better path.
  • Non-step0 decode-stage profiling is now wired (TRENI_DECODE_STAGE_PROFILE) and first G5 run (2026-02-25) shows decoder_stepN_logits_sample is the dominant decode stage on qwen.
  • Uncertainty capture ablation on the same run profile (TRENI_DEMO_CAPTURE_UNCERTAINTY=1/0) shows a measurable but secondary effect (~6.5 ms full-request delta at 64 tokens), confirming the main remaining hotspot is still logits+sample compute itself.
  • Additional decode split profiling (2026-02-25) now isolates decoder_stepN_logits_proj from sampling; logits projection remains dominant (~2.458 ms) and three immediate optimization probes were near-neutral/regressed, so those code paths were reverted.
  • Full-depth qwen check (--layers 36, --pool-mb 16384) is now explicitly validated; in this mode decoder_stepN_layers dominates, and runtime-vLLM results must be interpreted separately from the fast --layers 2 profile.
  • Runtime-vLLM cold rerun on the same host/profile (2026-02-25) still shows clear runtime lead (TTFT/full/cold-total).
  • Full-depth FFN u16 weight path (TRENI_DECODER_FFN_U16_PATH=1) now shows a material runtime uplift (TTFT -8.8 ms, full -328 ms, cold-total full -1329 ms) but still does not close the full request-path gap to vLLM in latest G5 A/B.
  • Full-depth ATTN+FFN+LOGITS u16 path now further improves runtime means (TTFT -10.8 ms, full -372 ms, cold-total full -1374 ms vs baseline), but request full is still slower than vLLM (~1.365x ratio in latest 3-seed set).
  • Post-rebuild full-depth sanity reruns (2026-02-26) remain aligned with the same residual-fused baseline (~1720 ms request full), confirming no hidden regression from recent code/instrumentation changes.
  • Full-depth FFN sub-stage split (2026-02-26) now shows ffn_proj is dominated by gate/up GEMMs (~0.101 + ~0.099 ms) while cast is minor (~0.005 ms); a batched gate+up trial regressed and was reverted.
  • Full-depth attention qkv fused-alias path (TRENI_DECODER_ATTN_U16_QKV_FUSED) is now implemented and default-on in this lane (2026-02-26), with 3-seed gains in runtime-only (full -5.869 ms) and runtime-vLLM matrix (runtime full -6.542 ms).
  • Full-depth FFN activation-to-u16 fused path (TRENI_DECODER_FFN_ACT_U16_FUSED) is now implemented and default-on in this lane (2026-02-26), with 3-seed gains in runtime-only (full -10.713 ms, cold_full -10.696 ms) and improved runtime-vLLM full ratio (1.3208x -> 1.3012x), while strict parity remains clean (checked=3, failed=0).
  • Full-depth follow-up probe cycle (2026-02-27) closed additional speculative lanes:
    • TRENI_DECODER_FFN_PROJ_U16_FUSED=1 regressed slightly but consistently in both runtime-only and runtime-vLLM 3-seed sets.
    • TRENI_LINEAR_U16_FAST_COMPUTE=1 was near-neutral/slightly regressed in the initial runtime-only 3-seed A/B (later superseded by 2026-02-28 AB5 promotion evidence).
    • TRENI_LINEAR_LT_WORKSPACE_MB=64 and TRENI_LINEAR_USE_LT=0 both regressed materially; canonical remains Lt on with zero workspace.
  • Linear Lt runtime path is now shape-failure-scoped (no global disable on first Lt miss); 3-seed runtime-only and runtime-vLLM checks on the full-depth profile were near-neutral (~0.05% full-latency movement), so this is a robustness fix, not a performance unlock.
  • Full-depth FFN projection batched2 lane (TRENI_DECODER_FFN_PROJ_U16_BATCHED2) is now implemented, benchmarked, and promoted default-on (2026-02-27):
    • runtime-only 3-seed: TTFT 15.189 -> 15.018 ms, full 1702.190 -> 1689.991 ms, cold_full 4708.109 -> 4696.805 ms.
    • runtime-vLLM 3-seed (runtime leg): TTFT 15.207 -> 15.032 ms, full 1704.091 -> 1691.116 ms, cold_full 4710.111 -> 4697.207 ms.
    • stage profile corroboration (off vs on): decoder_step_profile_ffn_proj_mean 0.205 -> 0.196 ms/layer, decoder_stepN_layers_mean 19.140 -> 18.447 ms.
    • strict parity remains clean in explicit-on and default-on reports (checked=3, failed=0).
  • Full-depth direct-out hidden lane (TRENI_DECODER_DIRECT_OUT_HIDDEN) is now promoted default-on for this profile (2026-02-27, late cycle):
    • runtime-only 3-seed: full 1690.855 -> 1684.908 ms, infer 1668.381 -> 1662.753 ms.
    • strict parity remains clean (week3_parity_report_directouthidden_default_20260227T184738Z.json, checked=3, failed=0).
  • Full-depth fused qkv split+bias lane (TRENI_DECODER_QKV_SPLIT_BIAS_FUSED) is now implemented and promoted default-on (2026-02-27, late cycle):
    • runtime-only 3-seed: TTFT 14.951 -> 14.687 ms, full 1684.135 -> 1663.776 ms, cold_full 4690.132 -> 4669.847 ms, infer 1662.833 -> 1641.322 ms.
    • strict parity remains clean (week3_parity_report_qkvsplitbias_default_20260227T190739Z.json, checked=3, failed=0).
  • External-cold harness now captures completion-length signals and supports fixed-token vLLM fairness (ignore_eos, streamed usage capture):
    • latest fixed-length runtime-vLLM set confirms matched completion_tokens=64 for both sides.
    • results: runtime TTFT=14.685 ms, full=1662.478 ms; vLLM TTFT=50.272 ms, full=1293.215 ms.
    • interpretation: runtime keeps a strong TTFT advantage, but request full still trails in this full-depth configuration.
  • Logits-only fast-compute hook follow-up (TRENI_DECODER_LOGITS_U16_FAST_COMPUTE, 2026-02-27 night) is now complete:
    • runtime-only 3-seed means: off full=1661.945 ms vs on full=1662.713 ms (no win; slight regression).
    • strict parity after hook integration remained clean (week3_parity_report_logitsfast_hook_20260227T193756Z.json, checked=3, failed=0).
    • decision: keep this knob disabled in canonical full-depth lane.
  • U16 tensor-cache unlock (TRENI_TENSOR_CACHE_U16, default-on) is now complete and claim-safe:
    • runtime-only 3-seed A/B (off/on) shows a large request-path drop: full 1661.982 -> 1189.452 ms (-472.529 ms), infer 1640.118 -> 1168.883 ms.
    • same-window runtime-vLLM A/B (off/on, 2 seeds each) flips request-full ordering:
      • off: runtime 1663.314 ms vs vLLM 1325.189 ms (runtime slower)
      • on: runtime 1192.145 ms vs vLLM 1290.816 ms (runtime faster)
    • log mechanism check confirms measured-request upload collapse:
      • off: decoder_tensor_upload ~476 ms, decoder_tensor_h2d ~468 ms
      • on: decoder_tensor_upload ~5 ms, decoder_tensor_h2d 0 ms
    • strict parity remains clean on final default-on build (week3_parity_report_u16cache_toggle_default_20260227T200652Z.json, checked=3, failed=0).
  • Late-night FFN retest cycle (2026-02-27) is complete with no new canonical promotion:
    • TRENI_LINEAR_BATCHED2_USE_LT=1 regressed materially in runtime-only full-depth A/B (full +12.469 ms).
    • higher-N TRENI_DECODER_FFN_PROJ_U16_BATCHED2_F32_INPUT=1 + TRENI_DECODER_FFN_PROJ_U16_FAST_COMPUTE=1 remained near-noise (full -0.198 ms).
    • fused-path bias-deferral expansion for TRENI_DECODER_FFN_PROJ_U16_FUSED=1 produced only near-noise movement (full -0.383 ms).
    • consolidated artifact: external_cold_layers36_ffn_followup_summary_20260227T223458Z.
  • Fast-profile logits fast-compute AB8 rerun (2026-02-28, --layers 2) remains near-noise (full -0.299 ms), with stage profile unchanged (decoder_stepN_logits_proj_mean ~1.261 ms), so no promotion from this lane.
  • Mixed-load repeatability rerun (2026-02-28, canonical lane, 3x120 requests) is stable: mean 122.247 ms, p95 198.518 ms, p99 199.608 ms.
  • Strict parity follow-up on the latest patched build (2026-02-28) passed (checked=3, failed=0).
  • Runtime benchmark stage parser fix (2026-02-28): phase2_runtime_benchmark.py now correctly parses decimal timing stage=... ms=... values (previous regex truncated to integer prefixes). Request-level TTFT/infer/full metrics were unaffected; stage telemetry is now reliable for hotspot ranking.
  • Full-depth parser-fixed profile reruns (cold_profile_qwen_layers36_fixparse_20260228T011037Z, warm_profile_qwen_layers36_fixparse_20260228T011037Z) reconfirm FFN dominance in layer compute (ffn_proj ~0.366 ms/layer, ffn_down_resid ~0.190 ms/layer, step_total ~0.705 ms/layer).
  • Full-depth warm AB3 for TRENI_DECODER_FFN_PROJ_U16_FAST_COMPUTE=0/1 (ffn_fast_compute_ab3_20260228T011146Z_summary) regressed slightly (request +0.317 ms, infer +0.305 ms) with no stage win in that cycle; lane stayed non-canonical until later clean-path reruns.
  • Strided batched-Lt fallback for batched2 FFN was implemented and tested (batched2lt_strided_ab3_20260228T011651Z_summary); warm AB3 was near-noise and runtime-only external-cold sanity was slightly worse, so the path remains opt-in and not promoted.
  • FFN gate/up dual-bias fused add (TRENI_DECODER_FFN_BIAS_PAIR_FUSED) now has a full-depth A/B set (2026-02-28):
    • warm AB3 showed a small request-path improvement (request -0.229 ms, p99 -0.390 ms, infer -0.090 ms) with near-flat TTFT (+0.009 ms);
    • cold follow-up (3 seeds each) regressed slightly (full +1.928 ms, infer +1.875 ms), so the path is currently non-canonical.
  • Batched2 seq1 split-GEMM lane (TRENI_LINEAR_BATCHED2_SPLIT_SEQ1) now has a full-depth warm/cold AB3 set (2026-02-28):
    • warm AB3 was near-noise/slightly worse (request +0.014 ms, infer +0.105 ms, p99 +0.124 ms);
    • cold AB3 improved slightly (full -2.070 ms, infer -2.002 ms, ttft -0.021 ms);
    • decision: keep opt-in only, not canonical, because warm path does not improve.
  • Batched2 dup-input strided lane (TRENI_LINEAR_BATCHED2_DUP_INPUT) now has a full-depth warm/cold AB3 set (2026-02-28):
    • warm AB3 regressed slightly on means (request +0.317 ms, infer +0.293 ms, ttft +0.009 ms) with minor p99 improvement (-0.208 ms);
    • cold AB3 also regressed (full +1.307 ms, infer +1.388 ms, ttft +0.010 ms);
    • decision: keep opt-in and non-canonical.
  • Batched2 dup-input v2 kernel swap probe (2026-02-28) was run as a warm AB2 gate set (batched2_dupinput_v2warm_ab2_20260228T032741Z) and regressed all warm means (request +0.438 ms, infer +0.381 ms, p99 +0.217 ms); probe was rejected and reverted before AB3 expansion.
  • FFN projection fused-lane gate rerun (TRENI_DECODER_FFN_PROJ_U16_FUSED=0/1, warm AB2, 2026-02-28) remained near-flat/slightly worse on means (request +0.149 ms, infer +0.173 ms); no AB3 expansion.
  • FFN projection batched2 f32-input gate rerun (TRENI_DECODER_FFN_PROJ_U16_BATCHED2_F32_INPUT=0/1, warm AB2, 2026-02-28) regressed (request +0.236 ms, infer +0.248 ms, p99 +0.512 ms); no AB3 expansion.
  • Linear u16 compute16f gate probe (2026-02-28, warm AB2, TRENI_LINEAR_U16_FORCE_COMPUTE_16F=0/1) regressed (request +0.210 ms, infer +0.240 ms, p99 +0.594 ms) and was rejected/reverted; no AB3 expansion.
  • Explicit-u16 full-depth warm rerun (qwen, layers=36, 2026-02-28) confirms active decode split in this lane: decoder_step_profile_total_mean ~0.402 ms, ffn_proj ~0.196 ms, ffn_down_resid ~0.099 ms.
  • Experimental FFN gate/up pair-pack lane (TRENI_DECODER_FFN_PAIR_PACK_U16, ffn_pair_pack_gate_ab2_20260228T040616Z) now has AB3 results:
    • warm AB3 delta (on-off): request -0.423 ms, infer -0.442 ms, p99 -0.673 ms;
    • both off/on runs already had contiguous gate/up pair active, so this is not a causal promotion signal.
    • decision: keep lane default-off and experimental.
  • Batched2 Lt rerun on explicit-u16 lane (TRENI_LINEAR_BATCHED2_USE_LT) is now split by warm/cold evidence:
    • warm AB3 (batched2_use_lt_u16lane_gate_ab2_20260228T041041Z): request -0.313 ms, infer -0.468 ms, p99 -0.511 ms;
    • cold AB3 (batched2_use_lt_u16lane_cold_ab2_20260228T041359Z): full +1.165 ms, infer +1.424 ms.
    • fixed-on decision: keep non-canonical.
  • Adaptive delayed batched2 Lt policy (TRENI_LINEAR_BATCHED2_USE_LT_ENABLE_AFTER_MS) has warm/cold wins but is not canonical (2026-02-28):
    • 5000ms AB3 (batched2_lt_enable_after_ms5000_warm_ab3_20260228T104525Z, batched2_lt_enable_after_ms5000_cold_ab3_20260228T104712Z) stayed mixed (warm gain, cold full +0.422 ms).
    • 10000ms AB3 (batched2_lt_enable_after_ms10000_warm_ab3_20260228T105028Z, batched2_lt_enable_after_ms10000_cold_ab3_20260228T105213Z) is net-positive:
      • warm delta: request -0.363 ms, infer -0.326 ms, p99 -0.696 ms;
      • cold delta: startup -4.307 ms, full -0.635 ms, infer -0.347 ms, TTFT -0.070 ms.
    • strict parity pass: week3_parity_report_batched2_lt_delay10000_20260228T105329Z.json (checked=3, failed=0).
    • default-path strict parity smoke (without explicit batched2 Lt env overrides) also passed: week3_parity_report_batched2_lt_defaultdelay_20260228T110825Z.json.
    • same-window mixed-load A/B (mixed_load_defaultdelay_onoff_ab3_20260228T115010Z.json) regressed with delayed-on (mean +0.846 ms, p95 +1.627 ms, p99 +0.679 ms).
    • parser defaults remain off: TRENI_LINEAR_BATCHED2_USE_LT=0 and TRENI_LINEAR_BATCHED2_USE_LT_ENABLE_AFTER_MS=0.
    • post-revert strict default-path parity pass: week3_parity_report_postrevert_defaults_20260228T115543Z.json.
  • Parser-default foundation rerun pack (foundation_defaultdelay_pack_20260228T114315Z) is now published:
    • warm AB3 means: request 147.258 ms, p99 247.617 ms, infer 128.450 ms, TTFT 16.999 ms;
    • cold AB3 means: startup 425.532 ms, full 598.787 ms, infer 580.173 ms, TTFT 12.210 ms;
    • mixed repeatability stayed worse than prior canonical summary (mixed_load_repeatability_compare_defaultdelay_vs_prev_20260228T114748Z.json: mean +2.841 ms, p95 +5.587 ms, p99 +5.140 ms), which aligns with keeping delayed-on non-canonical.
  • Experimental FFN batched2 Lt prewarm path (TRENI_DECODER_FFN_BATCHED2_LT_PREWARM) is now implemented and benchmarked:
    • fixed-Lt warm AB2 (batched2_lt_prewarm_warm_ab2_20260228T042453Z): request -0.328 ms, infer -0.394 ms;
    • fixed-Lt cold AB3 (batched2_lt_prewarm_cold_ab3_20260228T042649Z): full -1.497 ms, infer -1.406 ms.
  • Direct same-window combo A/B (lt=0,prewarm=0 vs lt=1,prewarm=1) remains mixed:
    • combined summary (batched2_lt_prewarm_combo_summary_20260228T042733Z.json) shows warm AB3 regression (request +0.198 ms, infer +0.178 ms) despite cold AB3 improvement (full -1.099 ms, infer -0.819 ms).
    • decision: keep prewarm path default-off and non-canonical.
  • TRENI_DECODER_FFN_DOWN_U16_FAST_COMPUTE now has canonical full-depth evidence and is promoted default-on (2026-02-28):
    • warm AB3 (ffn_down_fast_compute_gate_ab3_20260228T044546Z): request -0.565 ms, infer -0.566 ms, p99 -1.405 ms, TTFT -0.030 ms.
    • cold AB3 (ffn_down_fast_compute_cold_ab3_20260228T044753Z): startup -8.405 ms, full -0.351 ms, infer -0.406 ms, TTFT -0.028 ms.
    • strict parity pass (week3_parity_report_ffn_down_fast_20260228T044846Z.json): checked=3, failed=0.
  • Post-promotion FFN retest matrix (2026-02-28) is complete and did not produce a second promotion:
    • new structural stacked-GEMM lane (TRENI_LINEAR_BATCHED2_STACKED_SEQ1) regressed in warm AB3 (request +1.259 ms, infer +1.229 ms, p99 +2.830 ms) and stayed near-flat/slightly worse in cold AB3 (full +0.030 ms), so it remains experimental/default-off.
    • TRENI_LINEAR_BATCHED2_SPLIT_SEQ1 AB3 regressed warm (request +0.964 ms) and cold (full +1.496 ms).
    • TRENI_LINEAR_BATCHED2_USE_LT fixed-on AB3 improved warm (request -0.855 ms) but still regressed cold startup/full (startup +10.474 ms, full +0.330 ms); delayed-on improved warm/cold but still regressed mixed-load, so lane remains non-canonical.
    • combo lt=1 + prewarm=1 gave AB3 gains but failed AB5 cold confirmation (startup +3.199 ms, full +1.152 ms).
    • TRENI_DECODER_FFN_PROJ_U16_FAST_COMPUTE=1 remained non-canonical at that stage.
  • Follow-up rerun cycle (2026-02-28 late) promoted TRENI_LINEAR_U16_FAST_COMPUTE after higher-N validation:
    • warm+mixed AB5 (linearfast_ab5_20260228T124736Z/summary_ab5.json) was positive in both modes:
      • warm on-off: request -0.139 ms, p95 -0.128 ms, p99 -0.009 ms;
      • mixed on-off: request -0.139 ms, p95 -0.156 ms, p99 -0.208 ms.
    • cold AB3 (linearfast_cold_ab3_20260228T124510Z/summary_ab3.json) stayed near-flat on full latency (+0.302 ms) with better startup (-4.207 ms) and TTFT (-0.019 ms).
    • strict parity passed (week3_parity_report_linearfast_20260228T124557Z.json, checked=3, failed=0).
    • post-default strict parity smoke also passed (week3_parity_report_post_linearfast_default_20260228T125804Z.json).
    • same-window default-vs-forced-off sanity (linearfast_default_sanity_20260228T125957Z) is directionally positive on mixed request path (mean -0.603 ms, p95 -0.984 ms, p99 +0.029 ms).
    • runtime parser default is now TRENI_LINEAR_U16_FAST_COMPUTE=1.
  • Full-depth FFN projection fast-compute rerun (2026-02-28, late 8) completed on clean path (TRENI_POOL_MB=16384, classifier-disabled HTTP lane):
    • profiled AB3 (ffnprojfast_fullstep_ab3_20260228T160255Z/summary_ab3.json), on-off: request -0.370 ms, infer -0.348 ms, p99 -0.533 ms, TTFT -0.045 ms.
    • non-profiled warm AB3 (ffnprojfast_fullwarm_ab3_20260228T160358Z/summary_ab3.json), on-off: request -0.249 ms, infer -0.225 ms, p99 -0.328 ms, TTFT -0.015 ms.
    • strict parity passed with explicit candidate env and on temporary promoted build:
      • week3_parity_report_ffnprojfast_candidate_20260228T160459Z.json
      • week3_parity_report_ffnprojfast_default_20260228T160639Z.json
    • post-promotion sanity AB3 (ffnprojfast_default_sanity_ab3_20260228T160557Z/summary_ab3.json) stayed near-flat and directionally positive on means (default-force_off request -0.094 ms, infer -0.093 ms), with tiny p99 increase (+0.057 ms).
    • interim decision in that cycle: candidate looked positive and moved to full foundation validation.
  • Full foundation validation then rejected global promotion (2026-02-28, late 9):
    • foundation pack (foundation_ffnprojfastdefault_pack_20260228T194204Z/summary_ab3.json) was slower versus prior canonical in all modes (warm/cold/mixed means).
    • same-window foundation gate AB2 (foundation_ffnprojfast_gate_ab2_20260228T195240Z/summary_gate_ab2.json):
      • default-force_off warm request +0.489 ms, cold full +0.746 ms, mixed mean +0.004 ms (tails improved).
    • final decision: keep parser canonical default TRENI_DECODER_FFN_PROJ_U16_FAST_COMPUTE=0 and retain the lane as opt-in.
  • Canonical rerun on the promoted default (2026-02-28 late 2) is now published:
    • foundation pack: foundation_linearfastdefault_pack_20260228T134157Z (summary_ab3.json).
    • versus prior parser-default foundation (20260228T114315Z): warm/cold were near-flat/slightly slower, while mixed improved (request -0.629 ms, p95 -1.281 ms, p99 -0.163 ms).
  • Same-window runtime-vLLM full-depth AB3 rerun on updated canonical lane (aws_speedpass_runtime_vllm_linearfastdefault_ab3_20260228T134630Z) now shows runtime ahead on request full latency:
    • runtime 1185.186 ms vs vLLM 1305.971 ms (vLLM/runtime full = 1.102x).
    • cold-total-first-response and cold-total-first-token remain dominated by vLLM process startup in this harness profile (5.807x, 7.648x over runtime respectively).
  • Batched2-Lt fast-fallback short-circuit experiment (2026-02-28) was evaluated and reverted:
    • isolation AB3 (fastfallback_isolation_ab3_20260228T140122Z/summary_ab3.json) showed warm regression (request +1.155 ms, p95 +2.124 ms, p99 +1.504 ms) and mixed near-flat/slightly worse (mean +0.144 ms, p95 +0.569 ms), despite cold full improvement (-0.846 ms).
    • decision: keep reverted (non-canonical).
    • post-revert strict parity remains clean (week3_parity_report_post_fastfallback_revert_20260228T140626Z.json).
  • TRENI_TENSOR_H2D_CHUNK_MB was re-tested on current full-depth canonical lane (2026-02-28) and promoted default to 0 (no chunking):
    • cold AB3 (h2d_chunk_cold_ab3_20260228T142114Z/summary_ab3.json), chunk0 - chunk64: startup -4.022 ms, full -2.562 ms, infer -2.542 ms, TTFT -0.060 ms; decoder_tensor_h2d -3.347 ms.
    • warm+mixed AB3 (h2d_chunk_warm_mixed_ab3_20260228T142258Z/summary_ab3.json), chunk0 - chunk64: warm request -0.442 ms; mixed request -0.044 ms.
    • strict parity after promotion passed (week3_parity_report_h2dchunk0_default_20260228T142805Z.json).
    • single-run sanity (h2d_chunk_default_vs64_sanity_20260228T142845Z) showed small mixed sensitivity (default-force64 mean +0.340 ms), so this remains on repeatability watch.
  • Higher-N same-window runtime-vLLM rerun (AB5, updated defaults, 2026-02-28) now tightens the full-depth claim:
    • run root: benchmarks/phase2_external_cold/results/aws_speedpass_runtime_vllm_newdefaults_ab5_20260228T145502Z
    • summary: summary_ab5.json / summary_ab5.md
    • means:
      • runtime: full 1184.812 ms, TTFT 14.640 ms, cold-total full 4190.848 ms
      • vLLM: full 1318.675 ms, TTFT 50.309 ms, cold-total full 24350.818 ms
    • ratios (vLLM/runtime): full 1.113x, TTFT 3.436x, cold-total full 5.810x.
    • compare vs prior AB3 (compare_vs_prev_linearfastdefault_ab3.json / .md): runtime full improved slightly (-0.375 ms) and full-ratio direction strengthened (1.102x -> 1.113x).
  • Post-AB5 full-depth gate sweep on current defaults (2026-02-28) is now complete and does not add a new canonical lane:
    • gate artifact: benchmarks/phase2_runtime/results/aws_speedpass/fulldepth_gate_newdefaults_20260228T150709Z/summary_gate_ab2.json
    • delayed-Lt was directionally positive in AB2 and advanced to AB3 (warm request -0.384 ms, mixed request -0.256 ms), while TRENI_DECODER_FFN_PROJ_U16_FAST_COMPUTE=1 remained mixed/noise (warm p99 +0.129 ms, mixed p99 +0.022 ms).
    • delayed-Lt AB3 confirmation artifact: benchmarks/phase2_runtime/results/aws_speedpass/fulldepth_delayedlt_ab3_20260228T151322Z/summary_ab3.json
    • AB3 result split:
      • warm on-off: request -0.330 ms, infer -0.270 ms, p99 -0.098 ms;
      • mixed on-off: request +0.173 ms, infer +0.191 ms, p99 +0.291 ms.
    • decision at that stage: keep delayed-Lt non-canonical on defaults (TRENI_LINEAR_BATCHED2_USE_LT=0, TRENI_LINEAR_BATCHED2_USE_LT_ENABLE_AFTER_MS=0); TRENI_DECODER_FFN_PROJ_U16_FAST_COMPUTE=0 stayed canonical.
  • Full-depth stage-profile refresh (external_cold_layers36_stageprofile_20260227T175604Z) reconfirms remaining decode hotspot hierarchy:
    • decoder_stepN_layers_mean ~19.107 ms (dominant)
    • decoder_stepN_logits_proj_mean ~1.260 ms
    • decoder_stepN_sample_mean ~0.331 ms
    • layer split still led by FFN projection (decoder_step_profile_ffn_proj_mean ~0.204 ms/layer).
  • Phase 3 loop-capability now has a canonical G5 set (baseline+stress, 3 seeds each) with consolidated summary artifacts.
  • Canonical Phase 3 shows internal loops keep 100% success while external loops lose success in tool-state adaptation and pay large hop/retry latency amplification.
  • Phase 3 uncertainty ablation now has a first baseline matrix (runs=8) showing success drops when uncertainty-aware branching is disabled on either internal or external path.
  • Phase 3 uncertainty ablation now has 3-seed baseline+stress repeatability with consolidated baseline-vs-stress comparison.
  • Runtime now exports request uncertainty in HTTP responses, and Phase 3 C2 harness now supports runtime_native uncertainty source.
  • Runtime response contract now includes unified awareness (route + generation) while preserving legacy uncertainty; Phase 3 runtime-native client now consumes the unified payload first.
  • Runtime-native C2 calibrated rerun (calib1) is complete; with zero-fallback probes, uncertainty-on deltas are again positive in baseline+stress.
  • Phase 4 kickoff on Lambda A100/H100 is complete for Track C loops; both hardware classes preserve 100% internal success and show the same external latency amplification pattern.
  • Phase 4 full Lambda reruns are now complete (A100 + H100): Phase 2 cold/hot, routing matrix, and C2 runtime-native calibrated sets are locked with raw artifacts.
  • Paper-grade package is now generated from the canonical G5 + Lambda A100 + Lambda H100 sets (benchmarks/paper_package/latest).
  • Paper package now includes manuscript-ready assets (manuscript/captions.md, manuscript/claims.md, manuscript/figure_manifest.json, mermaid figure specs).
  • Internet multi-hop commercial routing matrices are now available (Fly.io controller/tool hops to OpenAI + OpenRouter).
  • Track B commercial control set has now been rerun with fairness-hardened harness controls (interleaved order + deterministic defaults + token normalization + strict tool parity on tool tasks), narrowing claim scope to task-family-stratified statements.
  • AWS G5 speedpass validation is now complete for the new kernel/cold pass: disabling per-tensor upload sync delivers the measurable gain (~1.03x cold full, ~1.01x warm mean, ~1.03x warm p99), while initial cublasLt was near-parity on warm/full and did not improve TTFT.
  • AWS G5 TTFT-focused kernel pass is now complete: softmax-only was near-parity, then row-parallel norm kernels (rmsnorm/layernorm) delivered a clear lift (~1.20x cold TTFT, ~1.18x warm mean in best lt1_sync0 config).
  • AWS G5 TTFT follow-up is now complete: seq_q=1 tiny attention kernels plus direct K/V cache writes further improved TTFT/warm path and materially reduced Bart TTFT (16.573 -> 12.842 ms).
  • Week-3 parity is now fully locked on AWS after two fixes: parser handles interleaved stderr/stdout runtime logs, and a rebuilt parity container (qwen+bart+minilm) removed invalid minilm offsets so strict external-HF parity passes.
  • Runtime now has a strict attention backend selector (TRENI_ATTN_BACKEND) plus an A/B harness (custom vs cudnn_sdpa proxy) and Phase 3 now supports file-backed realistic_v1 fixtures to reduce synthetic benchmark bias.
  • AWS G5 attention backend A/B rerun with reversed call order confirms near-parity between custom and cudnn_sdpa proxy paths; earlier large cold delta was call-order/cache bias.
  • Phase 3 realistic-v1 reruns are now complete (baseline+stress, 3 seeds each) with strong internal-loop advantage preserved; realistic-v1 uncertainty ablation baseline+stress pair is also published.
  • Attention runtime now caches backend env config once per process and includes a seq1 hybrid tuning matrix (custom, qk-cublas, pv-cublas, both-cublas) with warm/cold tradeoff data.
  • AWS G5 seq1 fused-softmax/PV follow-up is now complete: default custom request path improved again on warm and cold (seq1_hybrid_fused_20260222T192656Z).
  • H100 fused cuDNN SDPA probe pack is now published; current backend descriptor path still yields no viable fused SDPA engine configs (cudnn_sdpa_h100_probe_20260222T1935Z).
  • True fused cuDNN frontend SDPA path is now integrated and validated on G5 (attn_backend_ab_frontend_20260222T220111Z): warm path is near parity on fixed warmed shapes, but cold/mixed still regress due expensive frontend plan-build misses.
  • Fused frontend profiling now quantifies miss root cause (cudnn_frontend_profile_probe_20260222T2204Z): plan-build misses are ~705 ms each on A10G, while pack/execute/unpack costs are negligible.
  • Frontend A/B harness now hard-fails contamination when fused marker is absent or runtime was compiled with TRENI_WITH_CUDNN=0.
  • Frontend repeatability matrix is now complete (attn_backend_frontend_matrix_20260222T221948Z, repeats=3 for warm_fixed + mixed_churn): custom wins all tracked metrics (3/3 per metric) in both profiles.
  • Frontend claim-strength report is now published (attn_backend_frontend_claim_report_20260222T222958Z) with paired delta CI95 summaries for each latency metric/profile.
  • Grouped commercial root-cause report is now published (commercial_gap_root_cause_20260222T222958Z) and indicates current fairness splits are still parity/noise dominated at present sample sizes.
  • Fused frontend miss tracing is now explicit (TRENI_ATTN_CUDNN_FRONTEND_TRACE_MISSES) and confirms misses are concentrated in decode-step seq_q=1 shape growth (seq_kv=2..10 in probe).
  • Startup multi-prompt preload mitigation (TRENI_HTTP_PRELOAD_PROMPTS) is now benchmarked on G5 and materially reduces mixed-churn/full-latency spikes for fused frontend while keeping custom faster overall.
  • Hybrid shape-gated frontend policy is now validated on G5 (2026-02-23): startup prebuild overhead drops from ~7.0 s to ~2.0 s while no-preload fused TTFT/full remain low and strict inference-valid on the fixed harness profile; bounded-gate follow-up removes broader-shape miss cascades by routing out-of-window shapes to custom.
  • Coverage-instrumented frontend reruns are now published (2026-02-23): runtime exports per-request attention backend counters/shares, and high fused-coverage profiles show current fused path is still slower than custom on both warm and cold request paths.
  • Execution decision (2026-02-23): park cuDNN/frontend optimization and prioritize custom-kernel best-path work.
  • Custom lane implementation update (2026-02-23): added seq1 microfused attention path (TRENI_ATTN_SEQ1_USE_MICROFUSED) plus cached cuBLAS stream binding.
  • G5 seq1 microfused A/B (2026-02-23, qwen+bart, max_kv=64 and 16) shows no net win vs custom baseline; warm mean/TTFT regress while only isolated bart p99 improves in one profile. Path remains opt-in and defaults off.
    • summary artifact: benchmarks/phase2_runtime/seq1_microfused_ab/seq1_microfused_ab_summary_20260223T014848Z.md.
  • G5 stream-cache A/B (2026-02-23, qwen+bart) for TRENI_LINEAR_STREAM_CACHE + TRENI_ATTN_STREAM_CACHE is near-neutral in short runs; keep enabled by default and focus on higher-impact kernel/cold-path work.
    • summary artifact: benchmarks/phase2_runtime/results/stream_cache_ab_summary_20260223T015222Z.md.
  • G5 registry/model-index hash A/B (2026-02-23, qwen profile) for TRENI_REGISTRY_LOOKUP_HASH + TRENI_MODEL_INDEX_NAME_HASH showed no meaningful cold/setup improvement in this run set; kept as opt-in and defaults off.
    • summary artifact: benchmarks/phase2_runtime/results/registry_hash_ab_summary_20260223T020353Z.md.
  • Cold-start harness fix (2026-02-23): startup health polling moved to 50ms cadence, removing prior ~1s quantization from startup_to_healthy_ms.
  • Startup-smoke A/B with high-fidelity polling (startup_smoke_ab_hf_20260223T030059Z) shows skipping startup smoke is a material cold win:
    • startup-to-healthy: 488.027 -> 404.184 ms (-17.18%)
    • start-to-first-response (startup + first full): 705.454 -> 622.167 ms (-11.81%)
    • request-path TTFT/full are near-flat (expected; this is startup-stage, not decoder-step optimization).
    • runtime default now matches this policy (TRENI_SKIP_STARTUP_SMOKE=1 unless explicitly set false).
  • Additional custom-cold knob probes (TRENI_TENSOR_ENV_CACHE, TRENI_TENSOR_H2D_CHUNK_MB, TRENI_TENSOR_HOST_REGISTER) were run on G5 and were near-neutral on this profile; no new canonical promotion from those knobs.
    • consolidated artifact: benchmarks/phase2_runtime/results/cold_path_knob_probe_20260223T0303Z.md.
  • Per-tensor upload hotspot profiling (TRENI_TENSOR_UPLOAD_TOPK) is now wired into runtime and first qwen cold probe shows model.embed_tokens.weight as the dominant cold upload stage contributor (~79.3 ms, ~63.8% share in that probe).
  • Container-level readahead hint (TRENI_CONTAINER_WILLNEED) is now benchmarked on G5 and shows a modest, repeatable cold-total improvement in 8-run A/B (~-1.94% start-to-first-response).
  • Runtime default now enables this readahead hint (TRENI_CONTAINER_WILLNEED=1 unless explicitly disabled).
  • Combined readahead + host-register (TRENI_CONTAINER_WILLNEED=1, TRENI_TENSOR_HOST_REGISTER=1) did not add clear gain beyond readahead-only profile on current G5 runs.
    • consolidated artifact: benchmarks/phase2_runtime/results/cold_upload_hotspot_summary_20260223T1915Z.md.
  • Staged H2D upload follow-up (TRENI_TENSOR_H2D_STAGING) is now complete on G5:
    • min64/chunk32 (8-run A/B) regressed full latency +21.22% and decoder_tensor_h2d +38.68%.
    • min64/chunk128 (3-run probe) regressed further (full +44.43%, decoder_tensor_h2d +76.92%).
    • decision: park staged-H2D path for now; keep it opt-in/default-off and continue cold-path work on non-staging custom upload/H2D.
    • consolidated artifact: benchmarks/phase2_runtime/results/h2d_staging_followup_summary_20260224T101324Z.md.
  • Non-staging H2D chunk matrix (TRENI_TENSOR_H2D_CHUNK_MB=0/64/128, 8 runs each) is now complete on G5 and was near-neutral across request and upload metrics.
    • decision: keep current chunk default policy; prioritize structural upload/H2D and decoder_step0_layers work.
    • consolidated artifact: benchmarks/phase2_runtime/results/h2d_chunk_matrix_summary_20260224T101730Z.md.
  • Host page-touch pre-fault A/B (TRENI_TENSOR_HOST_TOUCH=1, TRENI_TENSOR_HOST_TOUCH_MIN_MB=256, 8 runs) is now complete on G5.
    • decoder_tensor_h2d improved (-31.13 ms) but prefetch/upload increased, causing net request regression (full +7.73%, infer +8.22%).
    • decision: keep host-touch path opt-in/default-off and continue cold-path work on non-regressing upload changes.
    • consolidated artifact: benchmarks/phase2_runtime/results/host_touch_ab_summary_20260224T102444Z.md.
  • Upload sync probe (TRENI_TENSOR_UPLOAD_SYNC=0/1, 3 runs each) now quantifies upload composition under synchronized timing.
    • conversion rises to ~6 ms with sync enabled, but H2D remains ~118 ms and dominant.
    • decision: keep transfer-path optimization as the primary cold-upload focus.
    • consolidated artifact: benchmarks/phase2_runtime/results/upload_sync_probe_summary_20260224T102618Z.md.
  • Synchronized host-register probe (TRENI_TENSOR_HOST_REGISTER=0/1, with TRENI_TENSOR_UPLOAD_SYNC=1) is now complete.
    • transfer-stage metrics stayed effectively flat and request path slightly regressed.
    • decision: deprioritize host-register optimization lane for current cold-upload work.
    • consolidated artifact: benchmarks/phase2_runtime/results/host_register_sync_probe_summary_20260224T102915Z.md.
  • Decoder logits u16 A/B (TRENI_DECODER_LOGITS_U16_PATH=0/1) is now complete with valid inference in both arms.
    • upload/setup moved slightly in the right direction, but request-path metrics regressed materially (ttft, infer, full).
    • fix2 pilot follow-up after mixed-precision path adjustment still regressed request path, confirming the same direction.
    • decision: keep logits-u16 path opt-in/default-off and park for now.
    • consolidated artifact: benchmarks/phase2_runtime/results/logits_u16_ab_fix1_summary_20260224T105532Z.md.
  • Tensor-cache hash A/B (TRENI_TENSOR_CACHE_HASH=0/1) is now complete (mixed + warm 3-seed follow-up).
    • warm 3-seed request deltas are near-neutral, with slight p99 regression when enabled (+0.149 ms).
    • decision: keep tensor-cache hash path opt-in/default-off.
    • artifacts:
      • benchmarks/phase2_runtime/results/tensor_cache_hash_ab_20260224T113911Z/
      • benchmarks/phase2_runtime/results/tensor_cache_hash_warm3_20260224T114126Z/
  • Sampler direct-store A/B (TRENI_SAMPLE_DIRECT_STORE=0/1) is now complete (3-seed warm).
    • enabled path regressed warm request metrics (mean +0.062 ms, p95 +0.076 ms, p99 +0.143 ms).
    • decision: keep sampler direct-store opt-in/default-off.
    • artifact: benchmarks/phase2_runtime/results/sample_direct_store_ab_20260224T114633Z/.
  • Decoder direct-out residual A/B (TRENI_DECODER_DIRECT_OUT_HIDDEN=0/1) is now complete (3-seed warm).
    • enabled path regressed warm request and infer metrics (mean +0.540 ms, p95 +0.495 ms, p99 +0.444 ms, infer +0.150 ms).
    • decision at that time: keep decoder direct-out path opt-in/default-off.
    • superseded for current full-depth lane by 2026-02-27 late-cycle rerun (direct-out promoted default-on there).
    • artifact: benchmarks/phase2_runtime/results/direct_outhidden_ab_20260224T115051Z/.
  • Consolidated summary artifact for these custom-path probes:
    • benchmarks/phase2_runtime/results/custom_path_probe_summary_20260224T115602Z.md.
  • Multi-head seq1 attention A/B (TRENI_ATTN_SEQ1_USE_MULTIHEAD=0/1) is now complete and directionally strong.
    • qwen warm (3-seed): request mean 1.041x, p99 1.042x, infer 1.074x.
    • qwen mixed (3-seed): request mean 1.036x, p99 1.045x, infer 1.074x, cold wall 1.010x.
    • bart warm (3-seed): request mean 1.097x, p99 1.112x, TTFT 1.429x, infer 1.185x.
    • default sanity run (no env override) remains faster than forced-off.
    • decision: promote this path to default-on (TRENI_ATTN_SEQ1_USE_MULTIHEAD=1, TRENI_ATTN_SEQ1_MULTIHEAD_MAX_KV=2048) while retaining off-switch fallback.
    • artifacts:
      • benchmarks/phase2_runtime/results/seq1_multihead_ab_20260224T125127Z/
      • benchmarks/phase2_runtime/results/seq1_multihead_bart_ab_20260224T125404Z/
      • benchmarks/phase2_runtime/results/seq1_multihead_step0_probe_20260224T125508Z/
      • benchmarks/phase2_runtime/results/seq1_multihead_default_sanity_20260224T125713Z/
      • benchmarks/phase2_runtime/results/seq1_multihead_ab_summary_20260224T125619Z.md
  • External-cold repeatability rerun after seq1 multi-head default promotion (2026-02-24, 3 runs, runtime + PyTorch + vLLM) is now complete.
    • runtime means: startup 1003.315 ms, TTFT 4.022 ms, request full 239.277 ms, cold-total first response 1242.592 ms.
    • runtime-normalized ratios: PyTorch 127.900x TTFT / 9.378x full / 6.320x cold-total; vLLM 12.350x TTFT / 4.139x full / 19.333x cold-total.
    • runtime delta vs prior host-prefetch repeatability means (2026-02-19): TTFT 5.130 -> 4.022 ms, full 316.403 -> 239.277 ms, cold-total 1320.240 -> 1242.592 ms.
    • note: Ollama was skipped for this rerun because service/model were not installed on this host environment.
    • artifact: benchmarks/phase2_external_cold/results/external_cold_seq1mh_default_repeatability_20260224T192020Z.md.
  • Step0 optimization follow-up (2026-02-24): seq1 multi-head softmax/PV now reuses normalized probabilities and avoids repeated exp in the inner PV accumulation loop.
    • 3-run external-cold repeatability (runtime + PyTorch + vLLM) runtime deltas vs seq1mh baseline:
      • TTFT 4.022 -> 4.018 ms
      • request full 239.277 -> 238.400 ms
      • cold-total first response 1242.592 -> 1241.688 ms
    • interpretation: positive but small gain; further decoder_step0_layers work is still required for material uplift.
    • artifact: benchmarks/phase2_external_cold/results/external_cold_step0expfix_repeatability_20260224T194226Z.md.
  • Step0 shared-probability follow-up (2026-02-24) was run and compared against exp-reuse baseline:
    • runtime deltas vs exp-reuse means: TTFT +0.001 ms, request full +0.278 ms, cold-total first response +0.282 ms.
    • decision: revert this follow-up and keep exp-reuse as current best state.
    • artifact: benchmarks/phase2_external_cold/results/external_cold_step0shared_repeatability_20260224T194913Z.md.
  • Decode-stage and uncertainty update (2026-02-25):
    • first non-step0 profile artifact: benchmarks/phase2_external_cold/results/external_cold_stepn_profile_20260225T001334Z.json
    • key stage means (qwen, 64 tokens, no preload):
      • decoder_stepN_logits_sample_mean=2.671 ms
      • decoder_stepN_layers_mean=1.360 ms
    • uncertainty A/B artifacts:
      • benchmarks/phase2_external_cold/results/external_cold_uncert_on_20260225T001702Z.json
      • benchmarks/phase2_external_cold/results/external_cold_uncert_off_20260225T001704Z.json
    • uncertainty A/B deltas (on -> off):
      • request full 479.889 -> 473.367 ms
      • infer 461.771 -> 454.878 ms
      • decoder_stepN_logits_sample_mean 2.671 -> 2.562 ms
    • interpretation: uncertainty overhead exists but is not the primary decode bottleneck in this profile.
  • Runtime-vLLM cold rerun (2026-02-25, same profile, uncertainty-off runtime):
    • artifact: benchmarks/phase2_external_cold/results/external_cold_runtime_vllm_uncertoff_20260225T001929Z.json
    • runtime: TTFT 3.929 ms, full 472.724 ms, cold-total full 1476.116 ms
    • vLLM: TTFT 49.577 ms, full 1311.481 ms, cold-total full 24344.013 ms
    • interpretation: runtime remains decisively ahead in this cold-first-hit comparison.
  • Decode stepN logits split + immediate kernel probes (2026-02-25, qwen, 64 tokens, no preload):
    • split artifacts:
      • benchmarks/phase2_external_cold/results/external_cold_stepn_split_20260225T081450Z.json
      • benchmarks/phase2_external_cold/results/external_cold_stepn_split_revert_20260225T082055Z.json
    • split result:
      • decoder_stepN_logits_proj_mean=2.458 ms
      • decoder_stepN_sample_mean=0.106 ms
      • conclusion: decode hotspot is logits projection, not sampling.
    • probe A/Bs (all near-neutral; no sustained gain):
      • lt16 path (external_cold_stepn_lt16_off/on_20260225T081717Z/081718Z)
      • fast16/tensor-op GEMMEx (external_cold_stepn_split_fast16_20260225T082158Z)
      • direct-u16-input probe (external_cold_stepn_u16direct_off/on_20260225T082445Z/082447Z)
      • lt_u16 workspace probe (external_cold_stepn_ltu16ws_off/on_20260225T082735Z/082737Z)
    • decision: all four experimental lanes reverted; baseline path remains canonical while next optimization focuses on deeper logits-projection architecture changes.
  • Full-depth qwen rerun + runtime-vLLM comparison (2026-02-25, --layers 36, --pool-mb 16384, no preload):
    • runtime-only profile artifact:
      • benchmarks/phase2_external_cold/results/external_cold_stepn_split_layers36_pool16g_20260225T083216Z.json
    • runtime-vLLM artifact:
      • benchmarks/phase2_external_cold/results/external_cold_runtime_vllm_layers36_pool16g_20260225T083306Z.json
    • full-depth runtime stage means (profiled run):
      • decoder_stepN_layers_mean=24.306 ms
      • decoder_stepN_logits_proj_mean=2.458 ms
      • decoder_stepN_total_mean=26.875 ms
    • runtime-vLLM request-path comparison (non-profiled run):
      • runtime: TTFT 26.775 ms, full 2983.780 ms, cold-total full 3987.092 ms
      • vLLM: TTFT 49.998 ms, full 1315.478 ms, cold-total full 24346.938 ms
    • interpretation: full-depth runtime still wins TTFT and cold-total, but currently loses first-request full latency to vLLM in this configuration.
  • Full-depth preload follow-up (2026-02-25, --layers 36, --pool-mb 16384, preload on):
    • artifacts:
      • benchmarks/phase2_external_cold/results/external_cold_runtime_vllm_layers36_pool16g_preload_20260225T150209Z.json
      • benchmarks/phase2_external_cold/results/external_cold_runtime_vllm_layers36_pool16g_preload64_20260225T150410Z.json
    • key result: preload converts request cache deltas from misses to hits (cache_hit_delta=434, cache_miss_delta=0) and drops runtime full latency to ~2135 ms, but still above vLLM request full (~1263-1280 ms in these runs).
    • implication: remaining gap is layer/decode compute, not upload misses.
  • Full-depth hybrid/path probes (2026-02-25) did not improve the layer-compute gap:
    • seq1 hybrid matrix (default vs qk vs pv vs both) artifacts:
      • external_cold_layers36_hybrid_default_20260225T150806Z.json
      • external_cold_layers36_hybrid_qk_20260225T150811Z.json
      • external_cold_layers36_hybrid_pv_20260225T150816Z.json
      • external_cold_layers36_hybrid_both_20260225T150821Z.json
    • result: default custom path remained best (infer ~2113 ms); all hybrid variants regressed (~2459-2556 ms).
    • direct-u16-input full-depth A/B (external_cold_layers36_preload_a2_u16direct_off/on_20260225T150710Z/150715Z) was near-neutral/regressed and was reverted.
  • Full-depth FFN u16 path follow-up (2026-02-25) is now complete:
    • artifacts:
      • benchmarks/phase2_external_cold/results/external_cold_layers36_preload64_ab2_base_20260225T1628Z.json
      • benchmarks/phase2_external_cold/results/external_cold_layers36_preload64_ab2_ffnu16_20260225T1628Z.json
    • runtime deltas (ffnu16 - base):
      • TTFT 26.872 -> 18.077 ms (-8.795 ms)
      • request full 2148.336 -> 1820.345 ms (-327.991 ms)
      • cold-total full 6155.513 -> 4826.635 ms (-1328.878 ms)
    • vLLM request full in matched runs: 1300.232 / 1317.144 ms.
    • interpretation: full-depth gap narrowed substantially, but remains open on request full latency.
  • Full-depth attention/logits u16 expansion (2026-02-25) is now complete (3-seed matrix):
    • artifacts:
      • benchmarks/phase2_external_cold/results/external_cold_layers36_preload64_ab3_base_s{1,2,3}_20260225T1640Z.json
      • benchmarks/phase2_external_cold/results/external_cold_layers36_preload64_ab3_attnffnu16_s{1,2,3}_20260225T1640Z.json
      • benchmarks/phase2_external_cold/results/external_cold_layers36_preload64_ab3_attnffnlogitsu16_s{1,2,3}_20260225T1700Z.json
    • mean results:
      • baseline runtime: TTFT=26.863 ms, full=2147.754 ms, cold_full=6154.978 ms
      • ATTN+FFN u16: TTFT=17.080 ms, full=1791.873 ms, cold_full=4797.910 ms
      • ATTN+FFN+LOGITS u16: TTFT=16.104 ms, full=1775.313 ms, cold_full=4780.830 ms
    • runtime/vLLM full-latency ratio improved from 1.653x (baseline) to 1.365x (best), but request full latency still trails vLLM in this full-depth setup.
  • Full-depth decode-input reuse + u16-Lt follow-up (2026-02-25) is now complete:
    • regressing fused gate+up FFN trial was explicitly reverted after measured slowdown.
    • shared decode-input pre-cast reuse for q/k/v + gate/up was implemented and validated.
    • u16 cublasLt cached path (dtype-aware, safe fallback) was implemented and validated.
    • new 3-seed means (precastreuse+u16lt):
      • runtime TTFT=15.522 ms, full=1729.351 ms, cold_full=4735.345 ms
      • delta vs prior best (ATTN+FFN+LOGITS u16): TTFT -0.582 ms, full -45.962 ms, cold_full -45.485 ms
    • runtime/vLLM full-latency ratio improved further to ~1.323x in latest matched 3-seed set.
  • FAST_16 compute-mode follow-up (2026-02-25) was tested on top of u16-Lt:
    • strict Week 3 parity remained clean (checked=3, failed=0).
    • request-full changes were small and one repeatability run showed a large startup outlier on both runtime and vLLM.
    • decision: do not promote FAST_16 as canonical yet; keep non-fast compute on the stable u16-Lt lane.
  • Residual-fused u16-Lt follow-up (2026-02-26) is now complete:
    • implemented u16 no-bias residual-accumulate path for decoder o_proj and ffn_down (Lt fused when available, safe fallback otherwise).
    • strict Week 3 parity remained clean (checked=3, failed=0).
    • new 3-seed means (residfuse+u16lt):
      • runtime TTFT=15.400 ms, full=1719.302 ms, cold_full=4725.923 ms
      • delta vs prior precastreuse+u16lt: TTFT -0.122 ms, full -10.049 ms, cold_full -9.422 ms
    • profiler corroboration: decoder_step_profile_o_proj_resid_mean and decoder_step_profile_ffn_down_resid_mean dropped, and decoder_stepN_layers_mean moved down accordingly.
  • cuBLASLt workspace probe (TRENI_LINEAR_LT_WORKSPACE_MB) was run and rejected for this lane (2026-02-26):
    • trial artifacts:
      • benchmarks/phase2_external_cold/results/external_cold_layers36_trial_ltws0_20260226T105356Z.json
      • benchmarks/phase2_external_cold/results/external_cold_layers36_trial_ltws32_20260226T105401Z.json
    • request full regressed with workspace enabled (1711.213 -> 1754.568 ms), so no promotion.
  • Full-depth FFN activation-to-u16 fused follow-up (TRENI_DECODER_FFN_ACT_U16_FUSED) is now complete and promoted default-on (2026-02-26):
    • runtime-only 3-seed A/B:
      • off: TTFT=15.333 ms, full=1715.700 ms, cold_full=4721.653 ms
      • on: TTFT=15.193 ms, full=1704.987 ms, cold_full=4710.958 ms
      • delta (on-off): TTFT -0.140 ms, full -10.713 ms, cold_full -10.696 ms
    • runtime-vLLM 3-seed A/B (same host/window):
      • off ratio: runtime/vLLM full = 1.3208x
      • on ratio: runtime/vLLM full = 1.3012x
    • strict parity remained clean in explicit-on and default-on runs:
      • benchmarks/phase2_runtime/results/aws_speedpass/week3_parity_report_ffnactu16_20260226T1100.json
      • benchmarks/phase2_runtime/results/aws_speedpass/week3_parity_report_ffnactu16_default_20260226T1108.json

Timeline

Latest Key Numbers

Warm Path (G5)

  • Warm steady-state request mean: ~80.8 ms
  • Warm steady-state p99: ~89.6 ms

Frontend Coverage-Instrumented Reruns (G5, 2026-02-23)

  • Matrix with bounded hybrid gate (attn_backend_frontend_matrix_20260223T011158Z):
    • warm fixed fused share: ~0.030303 (custom handles ~0.969697 of calls)
    • warm fixed TTFT: custom 4.194 ms vs fused-profile 4.269 ms
    • mixed warm mean: custom 48.302 ms vs fused-profile 47.592 ms (near parity in that bounded-coverage setting)
  • High fused-coverage warm profile (fused_coverage_profiles_20260223T011504Z):
    • fused share ~0.878788
    • request mean: custom 20.292 ms vs fused 22.310 ms (~1.099x slower on fused)
    • TTFT: custom 4.196 ms vs fused 4.496 ms
  • High fused-coverage cold profile (fused_coverage_cold_profiles_20260223T011534Z):
    • fused share ~0.9
    • cold TTFT: custom 4.215 ms vs fused 704.176 ms
    • cold full: custom 246.306 ms vs fused 6595.157 ms
  • Plain interpretation:
    • bounded gating avoids most regressions by keeping the majority of calls on custom.
    • when fused is exercised heavily, current frontend implementation still loses; dynamic shape-plan reuse/coverage remains the blocker.

AWS G5 TTFT Kernel Pass (2026-02-22, lt0_sync0 baseline -> best norm+softmax+lt1_sync0)

  • Cold TTFT: 16.738 ms -> 13.974 ms (1.198x faster).
  • Cold full latency: 424.685 ms -> 396.814 ms (1.070x faster).
  • Warm mean latency: 174.237 ms -> 147.269 ms (1.183x faster).
  • Warm p99 latency: 1035.823 ms -> 936.297 ms (1.106x faster).
  • Per-model cold TTFT deltas:
    • qwen: 39.537 -> 29.411 ms (largest gain).
    • donut: 3.505 -> 2.619 ms.
    • bart: near-flat (16.523 -> 16.573 ms), remaining hotspot to isolate.

AWS G5 TTFT Follow-Up (2026-02-22, best norm+softmax+lt1_sync0 -> default seq_q=1 tiny-kernel path)

  • Cold TTFT: 13.974 ms -> 12.504 ms (1.118x faster).
  • Cold full latency: 396.814 ms -> 390.099 ms (1.017x faster).
  • Warm mean latency: 147.269 ms -> 143.230 ms (1.028x faster).
  • Warm p99 latency: 936.297 ms -> 924.276 ms (1.013x faster).
  • Bart cold TTFT: 16.573 ms -> 12.842 ms (1.29x faster).
  • Profiling signal (TRENI_STEP0_PROFILE=1) showed Bart step0 dominated by decoder_step0_layers, which motivated this path.
  • 3-seed repeatability on the new default path:
    • cold TTFT 12.563 ± 0.037 ms
    • cold full 390.961 ± 0.270 ms
    • warm mean 143.297 ± 0.222 ms
    • warm p99 925.668 ± 1.070 ms
  • Week-3 parity status update (strict trace rerun):
    • parser fix correctly classifies fallback/failure markers in interleaved runtime logs.
    • debug rerun found old-container root cause: out-of-bounds embeddings.word_embeddings.weight offset for minilm in monolith_phase3.bin.
    • rebuilt parity container (monolith_phase3_qbm.bin, qwen+bart+minilm) now passes strict external-HF parity: checked=3, failed=0, missing=0.
    • runtime on/off Bart step0 logits A/B stayed numerically aligned (max abs diff ~2e-6, cosine ~1.0).

AWS G5 Attention Backend A/B (2026-02-22, deconfounded)

  • First-order run (attn_backend_ab_20260222T143605Z) showed large cold infer/full gap caused by run order (custom executed first after build/startup).
  • Reverse-order rerun (attn_backend_ab_rev_20260222T144736Z) removed that bias:
    • cold TTFT: 6.460 ms custom vs 6.447 ms cudnn proxy (1.002x custom/cudnn).
    • cold full: 147.789 ms custom vs 146.707 ms cudnn proxy (1.007x).
    • warm mean: 53.545 ms custom vs 53.341 ms cudnn proxy (1.004x).
    • warm p99: 82.031 ms custom vs 80.754 ms cudnn proxy (1.016x).
  • Interpretation:
    • legacy cudnn_sdpa proxy path is near-parity/slightly faster by a small margin.
    • runtime now keeps proxy behavior explicit opt-in (TRENI_ATTN_ALLOW_SDPA_PROXY=1).
    • this section is proxy-only; true fused frontend results are reported in the next section.

AWS G5 True Fused cuDNN Frontend A/B (2026-02-22, fixed qwen, warmed query set)

  • Artifact: attn_backend_ab_frontend_20260222T220111Z.
  • Warm (http_warmup_runs=8, http_runs=8, --http-model qwen):
    • request mean: custom 19.324 ms vs fused 21.503 ms (custom/fused=0.899)
    • request p99: custom 22.087 ms vs fused 24.875 ms (custom/fused=0.888)
    • infer mean: custom 18.803 ms vs fused 20.976 ms (custom/fused=0.896)
    • TTFT mean: custom 4.199 ms vs fused 4.498 ms (custom/fused=0.934)
  • Cold first hit:
    • TTFT: custom 4.220 ms vs fused 710.641 ms
    • full latency: custom 250.929 ms vs fused 6610.148 ms
  • Profile probe (TRENI_ATTN_CUDNN_FRONTEND_PROFILE=1) showed:
    • avg_build_ms_per_miss ~= 704.8 ms
    • avg_pack_ms ~= 0.010 ms
    • avg_exec_ms ~= 0.021-0.048 ms
    • avg_unpack_ms ~= 0.005 ms
  • Interpretation:
    • fused path is active and measurable.
    • warm path is close to custom when shapes are warmed.
    • cold/mixed penalty is dominated by shape-plan miss compilation, not kernel execution.

AWS G5 Frontend Repeatability Matrix (2026-02-22, repeats=3)

  • Artifact: attn_backend_frontend_matrix_20260222T221948Z.
  • Profiles:
    • warm_fixed (http_warmup_runs=8)
    • mixed_churn (http_warmup_runs=0)
  • Warm-fixed aggregate:
    • custom request mean 19.271 +/- 0.050 ms vs fused 21.468 +/- 0.018 ms
    • custom infer 18.812 +/- 0.059 ms vs fused 20.984 +/- 0.026 ms
    • custom TTFT 4.198 +/- 0.001 ms vs fused 4.498 +/- 0.001 ms
  • Mixed-churn aggregate:
    • custom request mean 47.864 +/- 0.018 ms vs fused 843.141 +/- 0.735 ms
    • custom infer 47.331 +/- 0.050 ms vs fused 842.542 +/- 0.747 ms
    • custom TTFT 4.197 +/- 0.002 ms vs fused 179.744 +/- 0.263 ms
  • Win counts:
    • custom wins every tracked metric in both profiles (3/3 each metric).
  • Interpretation:
    • this now provides repeatable evidence that current custom path outperforms current fused frontend path under both stable warmed traffic and shape churn.

AWS G5 Frontend Claim-Strength (2026-02-22)

  • Artifact: attn_backend_frontend_claim_report_20260222T222958Z.
  • Paired-delta CI95 summary (frontend - custom; positive means custom faster):
    • warm-fixed request mean delta: +2.197 ms CI95 [2.125, 2.238].
    • warm-fixed TTFT delta: +0.300 ms CI95 [0.299, 0.301].
    • mixed-churn request mean delta: +795.277 ms CI95 [794.408, 795.747].
    • mixed-churn TTFT delta: +175.546 ms CI95 [175.300, 175.820].
  • Interpretation:
    • effect direction is consistent and large in both profiles.
    • repeat count is still small (n=3/profile), so this should be treated as strong directional evidence with low-variance replication, not final high-N significance.

AWS G5 Frontend Miss-Mitigation (Updated Canonical, 2026-02-22)

  • Artifacts:
    • baseline matrix: attn_backend_frontend_matrix_20260222T230445Z (no_preload)
    • candidate matrix: attn_backend_frontend_matrix_20260222T231139Z (startup_preload_benchmark_queries)
    • compare report: attn_backend_frontend_missmit_compare_20260222T231335Z
    • exact cold-prompt probe: preload_exact_prompt_probe_20260222T231050Z.json
  • Mitigation:
    • fixed runtime splitter bug so TRENI_HTTP_PRELOAD_PROMPTS executes all prompts.
    • used preload prompts matched to benchmark cold/warm query set.
  • Mixed-churn improvements (no_preload -> startup_preload_benchmark_queries):
    • fused request mean: 843.242 -> 22.433 ms (37.590x)
    • fused infer mean: 842.684 -> 21.965 ms (38.365x)
    • fused warm TTFT: 179.541 -> 4.497 ms (39.928x)
    • fused cold TTFT: 704.521 -> 4.495 ms (156.723x)
    • fused cold full latency: 6593.495 -> 25.785 ms (255.707x)
  • Exact cold-prompt probe:
    • fused first-hit TTFT: 4.499 ms
    • fused first-hit full latency: 26.090 ms
  • Interpretation:
    • with matched preload coverage, the previous fused cold/mixed miss penalty is removed for this harness.
    • custom still has a small warmed-path lead, but first-hit TTFT no longer regresses on the canonical prompt set.
    • still open: generalize this behavior without curated prompt-list preload.

AWS G5 Frontend Shape-Prebuild Probe (No Preload Prompts, 2026-02-22)

  • Artifacts:
    • cold probe (startup prebuild enabled): prebuild_startup_nopreload_probe_20260222T232932Z.json
    • matrix probe (repeats=1): attn_backend_frontend_matrix_20260222T233003Z
    • compare (no_preload -> shape_prebuild_nopreload): attn_backend_frontend_missmit_compare_20260222T233116Z
  • Mitigation:
    • startup shape prebuild via:
      • TRENI_ATTN_CUDNN_FRONTEND_PREBUILD_SEQ1_MAX_KV=16
      • TRENI_ATTN_CUDNN_FRONTEND_PREBUILD_HEAD_DIM=128
    • no prompt preload list used.
  • Key numbers:
    • cold probe startup->healthy: 11017.541 ms
    • cold probe fused TTFT: 5.814 ms
    • cold probe fused full request latency: 255.434 ms
    • mixed-churn fused deltas (no_preload -> shape_prebuild_nopreload):
      • cold TTFT: 704.521 -> 5.805 ms (121.364x)
      • cold full: 6593.495 -> 255.267 ms (25.830x)
      • warm request mean: 843.242 -> 51.482 ms (16.379x)
      • warm TTFT: 179.541 -> 4.824 ms (37.218x)
  • Interpretation:
    • this is the first prompt-independent mitigation that removes fused request-path spikes.
    • current tradeoff is startup compile burst (http_attn_prebuild), so next work is lowering startup overhead while preserving these request-path gains.
  • Follow-up tuning (seq_kv_max: 16 -> 10) artifact: prebuild_startup10_nopreload_probe_20260222T235944Z.json
    • startup->healthy: 11017.541 -> 7011.472 ms (1.571x faster startup)
    • request TTFT: 5.814 -> 5.826 ms (near-identical)
    • request full latency: 255.434 -> 254.936 ms (near-identical)
  • Tuned matrix confirmation:
    • tuned matrix (seq_kv_max=10): attn_backend_frontend_matrix_20260223T000256Z
    • compare vs seq_kv_max=16: attn_backend_frontend_missmit_compare_20260223T000343Z
    • warm-fixed fused request mean: 22.556 -> 22.265 ms
    • mixed fused request mean: 51.482 -> 50.974 ms
  • Lower-range startup probe (seq_kv_max=8): prebuild_startup8_nopreload_probe_20260223T000600Z.json
    • startup->healthy: 6010.381 ms
    • request TTFT: 703.771 ms (regression)
    • request full latency: 1660.576 ms (regression)
    • interpretation: seq_kv_max=8 under-covers this benchmark prompt profile; 10 is the current minimum safe tuned range.
  • Heuristic probe (TRENI_ATTN_CUDNN_FRONTEND_HEUR_MODE):
    • A and B remained near-identical on startup/build behavior in this path.
    • FALLBACK produced no valid engine configs for the current frontend descriptor on sm86.

AWS G5 Frontend Hybrid Shape-Gate Follow-Up (2026-02-23)

  • Artifacts:
    • startup probes (3 runs): prebuild_hybrid10_nopreload_probe_r{1,2,3}_20260223T002214Z.json
    • matrix (repeats=3): attn_backend_frontend_matrix_20260223T001959Z
    • compare vs prior tuned no-gate matrix: attn_backend_frontend_missmit_compare_20260223T002153Z
  • Policy:
    • TRENI_ATTN_CUDNN_FRONTEND_PREBUILD_SEQ1_MIN_KV=10
    • TRENI_ATTN_CUDNN_FRONTEND_PREBUILD_SEQ1_MAX_KV=10
    • TRENI_ATTN_CUDNN_FRONTEND_PREBUILD_HEAD_DIM=128
    • TRENI_ATTN_CUDNN_FRONTEND_MIN_SEQ_KV=10
  • Startup probe aggregate (qwen, no preload prompts, fused frontend):
    • startup->healthy: 2004.840 +/- 0.146 ms
    • request TTFT: 4.955 +/- 0.011 ms
    • request full latency: 242.673 +/- 0.352 ms
  • Delta vs prior tuned no-gate probe (prebuild_startup10_nopreload_probe_20260222T235944Z):
    • startup->healthy: 7011.472 -> 2004.840 ms (3.497x faster)
    • request TTFT: 5.826 -> 4.955 ms (1.176x faster)
    • request full latency: 254.936 -> 242.673 ms (1.051x faster)
  • Matrix deltas vs prior tuned no-gate matrix (attn_backend_frontend_matrix_20260223T000256Z):
    • warm-fixed fused request mean: 22.265 -> 20.354 ms (1.094x faster)
    • mixed fused request mean: 50.974 -> 47.904 ms (1.064x faster)
    • cold fused TTFT: 5.819 -> 4.959 ms (1.173x faster)
    • cold fused full latency: 254.146 -> 242.569 ms (1.048x faster)
  • Broader-shape sanity artifact: hybrid_shape_sanity_20260223T002857Z.json
    • startup->healthy stayed ~2004 ms, inference.used=true for all 5 requests.
    • long-prompt/full-latency regressions were observed as seq1 shapes exceeded the prebuilt window (seq_kv=11..30 miss lines in log head), confirming remaining dynamic-shape work.
  • Bounded-gate follow-up artifact: hybrid_shape_sanity_maxgate_20260223T003453Z.json
    • added TRENI_ATTN_CUDNN_FRONTEND_MAX_SEQ_KV=10.
    • no fused miss lines were observed, and the same broader-shape set stayed inference-valid with low TTFT.
    • mean full latency over the 5-shape set dropped from 9974.576 ms to 274.072 ms (36.395x faster).
  • Fixed-profile confirmation after max gate:
    • matrix (repeats=3): attn_backend_frontend_matrix_20260223T003611Z
    • compare vs prior hybrid: attn_backend_frontend_missmit_compare_20260223T003734Z (near-parity fixed-profile deltas).
  • Interpretation:
    • hybrid shape gating materially reduces startup compile cost while preserving low no-preload request-path latency.
    • strict fused runs remain inference-valid with low-shape custom fallback in this harness.
    • bounded max-gate removes broad-shape miss cascades now, but wider fused coverage without fallback still needs dynamic seq1 plan reuse/coverage.

Commercial Fairness Root-Cause Grouping (2026-02-22)

  • Artifact: commercial_gap_root_cause_20260222T222958Z.
  • OpenAI gpt-5.2, model-only (paired_n=36):
    • latency delta mean (external-internal): -69.311 ms, CI95 [-193.985, 61.444] -> parity/noise.
    • external controller overhead mean: 2.081 ms vs model-hop mean 1406.971 ms.
  • OpenAI gpt-5.2, tool-only parity (paired_n=12):
    • latency delta mean: +49.601 ms, CI95 [-162.047, 274.981] -> parity/noise.
    • external controller overhead mean: 12.842 ms vs model-hop mean 2456.108 ms.
  • OpenRouter Sonnet 4.6, model-only (paired_n=24):
    • latency delta mean: +204.883 ms, CI95 [-148.517, 683.114] -> parity/noise.
    • external controller overhead mean: 2.254 ms vs model-hop mean 2220.251 ms.
  • Interpretation:
    • current commercial "loss" is not locked statistically in this dataset.
    • dominant variance is upstream model-hop time; to claim directional differences, we need higher-N region/time-pinned reruns.

AWS G5 Seq1 Hybrid Tuning (2026-02-22)

  • Warm matrix (seq1_hybrid_20260222T1554Z):
    • default: 54.505 ms mean, 82.134 ms p99.
    • qk-cublas: 54.572 ms mean, 81.776 ms p99.
    • pv-cublas: 54.281 ms mean, 80.754 ms p99.
    • both-cublas: 54.822 ms mean, 79.947 ms p99.
  • Cold sanity (seq1_hybrid_20260222T1558Z):
    • default full 147.756 ms vs pv-cublas full 149.293 ms.
  • Interpretation:
    • pv-cublas gives the best warm mean/p99 in this pass, but slightly worsens cold full latency.
    • default remains custom seq1 for the best overall cold/hot balance.

AWS G5 Seq1 Fused Follow-Up (2026-02-22)

  • Follow-up artifact pack: seq1_hybrid_fused_20260222T192656Z.
  • Code changes:
    • fused seq_q=1 softmax+PV kernel in custom path
    • seq1 QK kernel launch retune (64/128/256 by head_dim)
  • Warm deltas vs prior matrix (seq1_hybrid_20260222T1554Z):
    • default mean: 54.505 -> 52.535 ms (1.037x)
    • default p99: 82.134 -> 80.554 ms (1.020x)
    • pv-cublas mean: 54.281 -> 51.964 ms (1.045x)
    • pv-cublas p99: 80.754 -> 78.519 ms (1.028x)
  • Cold deltas vs prior sanity (seq1_hybrid_20260222T1558Z):
    • default TTFT: 6.447 -> 6.209 ms (1.038x)
    • default full: 147.756 -> 145.587 ms (1.015x)
    • pv-cublas TTFT: 6.450 -> 6.215 ms (1.038x)
    • pv-cublas full: 149.293 -> 147.937 ms (1.009x)
  • Interpretation:
    • this moved both warm and cold in the correct direction without changing model routing/task behavior.
    • default custom path remains the balanced default; pv-cublas still leads warm-only in this slice.

H100 Fused cuDNN SDPA Probe (2026-02-22)

  • Probe artifact pack: cudnn_sdpa_h100_probe_20260222T1935Z.
  • Results:
    • alignment sweep: all cnt=0 (align={16,32,64,128,256}).
    • shape/layout sweep: tested=1440, supported=0.
    • debug traces show candidate engines (8/9/10/11) but no viable configs after support checks:
      • NOT_SUPPORTED_GRAPH_PATTERN (8/9/11)
      • NOT_SUPPORTED_ARCH_MISMATCH (10, Blackwell-only).
  • Interpretation:
    • this H100 probe formulation still finds no viable fused engines.
    • proxy path remains explicit opt-in only for legacy A/B.

Phase 3 Realistic-v1 Loop Pack (2026-02-22, 3 seeds baseline + 3 seeds stress)

  • Summary artifact: phase3_realistic_v1_summary_20260222T143919Z.json.
  • Baseline means:
    • internal success 1.0000
    • external success 0.9010
    • external/internal latency ratio 15.8563x
    • external/internal steps ratio 1.8037x
  • Stress means:
    • internal success 1.0000
    • external success 0.9010
    • external/internal latency ratio 75.3563x
    • external/internal steps ratio 1.8037x
  • Interpretation:
    • moving to richer file-backed fixtures did not change direction; internal remains faster and more reliable.
    • stress again amplifies external hop latency substantially while internal remains stable.

Phase 3 Realistic-v1 Uncertainty Ablation (2026-02-22, seed 7 baseline+stress)

  • Comparison artifact: phase3_uncertainty_compare_realistic_v1_s7_20260222T144116Z.json.
  • Success deltas with uncertainty enabled (int_on_ext_on vs off-arms):
    • normalized_logprob:
      • baseline: internal +0.2500, external +0.2500
      • stress: internal +0.2500, external +0.2344
    • raw_logit_margin: same deltas as above.
    • hybrid: same deltas as above.
  • Interpretation:
    • uncertainty-aware branching continues to provide positive success deltas on realistic-v1.
    • under stress, external benefit remains positive but slightly reduced.

Routing (Internal vs External, G5)

  • Internal mean: 94.849 ms
  • External mean: 97.927 ms
  • External/Internal: 1.032x (internal faster)

Routing Failure-Amplification Stress (G5, 2026-02-18)

  • Internal mean: 76.071 ms
  • External mean: 109.806 ms (1.443x external/internal)
  • Internal error rate: 0.0000
  • External error rate: 0.0833
  • Error-rate amplification: inf (external errors present, internal none)
  • External retry/failure signal: tool retries mean 0.182, taxonomy tool_hop_failed=4

Routing Matrix Expansion (G5, 2026-02-19, 6 profiles)

  • Baseline profile (p00): ratio 1.0420x, external error 0.0000.
  • Mild fail profile (p01): ratio 1.0480x, external error 0.0000.
  • Mild timeout profile (p02): ratio 1.1420x, external error 0.0000.
  • Mixed moderate (p03): ratio 1.1640x, external error 0.0417.
  • Mixed aggressive (p04): ratio 1.4360x, external error 0.0833.
  • Mixed aggressive + retry2 (p05): ratio 1.4160x, external error 0.0833.
  • Internal error rate stayed 0.0000 across all profiles.
  • Matrix-wide mean ratio: 1.2080x external/internal.

Routing Cross-Host Pilot (2026-02-19, local client -> SSH tunnel -> G5)

  • Baseline profile (crosshost-p00-baseline, 12 runs):
    • internal mean: 1071.477 ms
    • external mean: 1059.478 ms
    • external/internal ratio: 0.989x
    • internal error rate: 0.0000
    • external error rate: 0.0000
  • Mild-timeout profile (crosshost-p02-timeout-mild, 12 runs):
    • internal mean: 1054.123 ms
    • external mean: 1123.393 ms
    • external/internal ratio: 1.066x
    • internal error rate: 0.0000
    • external error rate: 0.0000
    • external tool retries mean: 0.083
  • Stress profile (crosshost-p04-stress, 12 runs):
    • internal mean: 1056.013 ms
    • external mean: 1100.010 ms
    • external/internal ratio: 1.042x
    • internal error rate: 0.0000
    • external error rate: 0.0833
    • external tool retries mean: 0.182
  • Interpretation:
    • in cross-host conditions, stress again amplifies external-path latency and error rates while internal remains error-free.

Routing Split-Host Matrix (2026-02-19, canonical Track B)

  • Topology:
    • GPU host: runtime endpoint
    • CPU host: external controller + tool services
    • controller/tool calls runtime over VPC private network
  • Profiles (12 runs each):
    • splithost-p00-baseline: ratio 0.995x, ext error 0.0000.
    • splithost-p01_fail_mild: ratio 0.998x, ext error 0.0000, tool retries 0.021.
    • splithost-p02_timeout_mild: ratio 1.042x, ext error 0.0000, tool retries 0.021.
    • splithost-p03_mixed_moderate: ratio 1.001x, ext error 0.0417, tool retries 0.065.
    • splithost-p04_mixed_aggressive: ratio 1.087x, ext error 0.0833, tool retries 0.182.
    • splithost-p05_mixed_aggressive_retry2: ratio 1.045x, ext error 0.0833, tool retries 0.091.
  • Matrix-wide:
    • external/internal latency ratio mean 1.028x
    • internal error mean 0.0000
    • external error mean 0.0347
  • Interpretation:
    • baseline remains near parity, while timeout/failure pressure amplifies external-path latency and error rate; internal path remains error-free across all profiles.

Internet Multi-Hop Matrix (2026-02-20, Fly.io + Commercial APIs)

  • Topology:
    • internal path: local client -> commercial API
    • external path: local client -> Fly controller/tool -> same commercial API
  • OpenAI (gpt-5.2, runs=3, 3 profiles):
    • matrix ratio mean: 1.1123x external/internal
    • baseline: 1.110x
    • timeout-mild: 1.082x
    • mixed-aggressive: 1.145x
    • mixed-aggressive external error: 0.0833 (internal 0.0000)
  • OpenRouter (openai/gpt-5.2, runs=3, 3 profiles):
    • matrix ratio mean: 0.7553x external/internal
    • baseline: 0.686x
    • timeout-mild: 0.891x
    • mixed-aggressive: 0.689x
    • mixed-aggressive external error: 0.1667 (internal 0.0000)
  • OpenRouter (anthropic/claude-sonnet-4.6, runs=3, 3 profiles):
    • matrix ratio mean: 1.0277x external/internal
    • baseline: 1.236x
    • timeout-mild: 0.968x
    • mixed-aggressive: 0.879x
    • mixed-aggressive external error: 0.1667 (internal 0.0000)
  • Interpretation:
    • OpenAI matrix supports the expected direction under internet hops: external path is slower and less reliable in stress.
    • OpenRouter remains non-canonical for Track B direction claims in this topology due mixed/inverted profile direction and elevated errors.

Local Control Matrix (No Fly Scheduler Path, 2026-02-20)

  • Topology:
    • internal path: local client -> commercial API
    • external path: local client -> local standalone controller/tool -> same commercial API
  • OpenAI (gpt-5.2, runs=8, 3 profiles):
    • matrix ratio mean: 0.9867x external/internal
    • profiles: baseline 0.995x, timeout-mild 0.977x, mixed-aggressive 0.988x
    • external error mean: 0.0313
  • OpenRouter (anthropic/claude-sonnet-4.6, runs=8, 3 profiles):
    • matrix ratio mean: 1.0663x external/internal
    • profiles: baseline 1.055x, timeout-mild 1.141x, mixed-aggressive 1.003x
    • external error mean: 0.0313
  • Interpretation:
    • higher-N controls reduce jitter and show mixed but informative behavior: OpenAI near parity, OpenRouter Sonnet trending external > internal.
    • external stress errors remain present while internal stayed error-free.

Task-Family Parity Split (Local Control, Higher-N, 2026-02-20)

  • OpenAI gpt-5.2 (runs=8):
    • model_only: external/internal 0.958x (near parity, slight inversion).
    • tool_only: external/internal 1.136x (external slower).
    • errors: internal 0.0, external 0.0 on both runs.
  • OpenRouter anthropic/claude-sonnet-4.6 (runs=8):
    • model_only: external/internal 1.044x (external slower).
    • tool_only: external/internal 1.051x (external slower).
    • errors: internal 0.0, external 0.0 on both runs.
  • Interpretation:
    • task-family split removes ambiguity from mixed task composition.
    • tool-required tasks consistently favor internal routing on both providers.
    • model-only behavior is provider-sensitive but remains close enough that architecture effects are small compared with provider/runtime variance.

Qwen Cold Upload GPU-Convert Ablation (2026-02-19, G5)

  • A/B toggle:
    • off: TRENI_TENSOR_CONVERT_GPU=0
    • on: default GPU conversion path enabled
  • Qwen first-hit metrics:
    • full latency: 1116.567 ms -> 238.740 ms (4.677x faster).
    • decoder tensor upload: 1007 ms -> 129 ms (7.806x faster).
    • decoder tensor convert: 862 ms -> 6 ms (143.667x faster).
    • decoder tensor h2d: 143 ms -> 121 ms (1.182x faster).
    • startup + full response total: 2119.906 ms -> 1242.057 ms (1.707x faster).
  • Interpretation:
    • this isolates CPU tensor conversion as the dominant cold bottleneck and shows that moving conversion to GPU materially reduces Qwen cold path latency.
  • External-cold runtime-only confirmation (preload enabled, max_tokens=48):
    • startup-to-healthy: 2004.560 -> 1003.455 ms (1.997x faster).
    • request full latency: 317.989 -> 317.276 ms (no material change).
    • cold-total first response: 2322.549 -> 1320.731 ms (1.759x faster).
    • cold-total first token: 2009.697 -> 1008.582 ms (1.993x faster).

Runtime vs vLLM External-Cold Repeatability (2026-02-19, 3 runs)

  • Matched setup:
    • same G5 host, same model family, token parity (max_tokens=48), runtime preload enabled.
  • 3-run means:
    • runtime TTFT 5.135 ms vs vLLM 84.390 ms (16.433x speedup).
    • runtime request full 319.063 ms vs vLLM 1111.463 ms (3.484x speedup).
    • runtime cold-total first response 1656.573 ms vs vLLM 31151.892 ms (18.805x speedup).
  • Runs 2-3 only (post-first-run stabilization):
    • TTFT 17.211x, request full 3.416x, cold-total 22.395x.
  • Interpretation:
    • the cold-path fix and request-path lead hold in a fresh runtime-vLLM repeatability rerun after restoring vLLM in the benchmark env.

External Cold All-Backend Repeatability (2026-02-19, 3 runs, GPU-Convert Fix2)

  • Setup:
    • same G5 host, same model family (Qwen 3B), token parity (max_tokens=48), runtime preload enabled.
    • backends: runtime + PyTorch + vLLM + Ollama.
  • 3-run means (all runs):
    • runtime: startup 2339.131 ms, TTFT 5.131 ms, request full 318.315 ms, cold-total 2657.447 ms.
    • vLLM/runtime ratios: TTFT 16.091x, request full 3.852x, cold-total 10.887x.
    • PyTorch/runtime ratios: TTFT 115.313x, request full 7.508x, cold-total 3.921x.
    • Ollama/runtime ratios: TTFT 2108.743x, request full 35.118x, cold-total 4.584x.
  • Stable reference (runs 1-2):
    • runtime startup/cold-total: 1003.915/1321.205 ms.
    • vLLM/runtime ratios: TTFT 18.275x, request full 4.298x, cold-total 21.875x.
  • Runtime-only stability sweep (5 runs):
    • median runtime startup/cold-total: 1003.400/1320.952 ms.
    • one run showed preload upload outlier (decoder_tensor_upload=1877.485 ms, decoder_tensor_h2d=1869.296 ms), inflating mean startup.
  • Interpretation:
    • request-path advantage is stable; residual cold variance is now a preload upload consistency problem, not a decoder compute bottleneck.

External Cold All-Backend Repeatability (2026-02-19, 3 runs, GPU-Convert + Host-Prefetch Fix)

  • Setup:
    • same G5 host, same model family (Qwen 3B), token parity (max_tokens=48), runtime preload enabled.
    • backends: runtime + PyTorch + vLLM + Ollama.
    • runtime cold change: host-page MADV_WILLNEED prefetch for large tensor source ranges (TRENI_TENSOR_HOST_PREFETCH=1).
  • 3-run means:
    • runtime: startup 1003.836 ms, TTFT 5.130 ms, request full 316.403 ms, cold-total 1320.240 ms.
    • vLLM/runtime ratios: TTFT 16.537x, request full 3.896x, cold-total 21.918x.
    • PyTorch/runtime ratios: TTFT 108.567x, request full 7.341x, cold-total 14.601x.
    • Ollama/runtime ratios: TTFT 514.414x, request full 9.471x, cold-total 3.029x.
  • Runtime-only stability compare (5 runs before vs after host-prefetch):
    • startup max: 3006.388 -> 1003.627 ms.
    • cold-total max: 3324.212 -> 1322.338 ms.
    • decoder_tensor_h2d max: 1869.296 -> 120.671 ms.
    • decoder_tensor_upload max: 1877.485 -> 128.777 ms.
  • Interpretation:
    • cold preload upload variance is effectively removed in this sweep and runtime’s request-path lead remains intact.

Phase 3 Agentic Loops (Canonical G5 Baseline, 2026-02-19, 3 seeds)

  • Internal success rate mean: 1.0000.
  • External success rate mean: 0.9006.
  • External/Internal latency ratio mean: 16.0603x.
  • External/Internal steps ratio mean: 1.8147x.
  • Scenario signal (external success mean):
    • retrieval correction: 1.0000
    • tool-state adaptation: 0.7417
    • confidence-gated branching: 1.0000

Phase 3 Agentic Loops (Canonical G5 Stress, 2026-02-19, 3 seeds)

  • Stress profile: tool fail every 9, timeout every 11 (1.1s sleep), controller timeout 0.35s, retries 2.
  • Internal success rate mean: 1.0000.
  • External success rate mean: 0.8782.
  • External/Internal latency ratio mean: 77.1703x.
  • External/Internal steps ratio mean: 1.8240x.
  • Scenario signal (external success mean):
    • retrieval correction: 1.0000
    • tool-state adaptation: 0.6833
    • confidence-gated branching: 1.0000

Phase 4 Kickoff (Lambda A100/H100, Phase 3 Canonical, 2026-02-20)

  • A100 (3 baseline seeds + 3 stress seeds):
    • baseline: internal success 1.0000, external success 0.9006, external/internal latency 16.8790x.
    • stress: internal success 1.0000, external success 0.8782, external/internal latency 77.8613x.
  • H100 (3 baseline seeds + 3 stress seeds):
    • baseline: internal success 1.0000, external success 0.9006, external/internal latency 18.6933x.
    • stress: internal success 1.0000, external success 0.8782, external/internal latency 72.7407x.
  • Interpretation:
    • Track C behavior is hardware-stable: internal path keeps perfect success while external remains weaker on tool-state adaptation and pays a large stress-amplified latency penalty.
    • This is now paired with completed Phase 2 + C2 reruns on the same Lambda hardware classes.

Phase 4 Full Reruns (Lambda A100/H100, Phase 2 + C2, 2026-02-20)

  • A100:
    • cold first-hit summary: startup 1002.708 ms, TTFT 29.657 ms, full 32.008 ms.
    • warm request latency: mean 10.356 ms, p99 14.536 ms.
    • routing matrix overall: external/internal 2.4300x, external error 0.0347, internal error 0.0000.
    • C2 runtime-native deltas: baseline +0.2308/+0.2308 (internal/external), stress +0.2308/+0.2212.
  • H100:
    • cold first-hit summary: startup 1004.890 ms, TTFT 56.944 ms, full 62.064 ms.
    • warm request latency: mean 18.491 ms, p99 24.944 ms.
    • routing matrix overall: external/internal 2.3972x, external error 0.0347, internal error 0.0000.
    • C2 runtime-native deltas: baseline +0.2308/+0.2308 (internal/external), stress +0.2308/+0.2212.
  • Interpretation:
    • Cross-hardware direction stays consistent: internal path remains more stable/reliable while external routing shows stress-amplified latency and error behavior.
    • Runtime-native uncertainty deltas hold their positive baseline/stress direction on both A100 and H100 in this calibrated setup.

Paper Package (2026-02-20)

  • Generated outputs:
    • /benchmarks/paper_package/latest/package_summary.json
    • /benchmarks/paper_package/latest/paper_package.md
    • /benchmarks/paper_package/latest/tables/*.csv
    • /benchmarks/paper_package/latest/manuscript/figure_manifest.json
    • /benchmarks/paper_package/latest/manuscript/captions.md
    • /benchmarks/paper_package/latest/manuscript/claims.md
    • /benchmarks/paper_package/latest/manuscript/figures/*.mmd
  • Scope:
    • consolidates canonical G5 + Lambda A100 + Lambda H100 into paper-ready tables for:
      • Phase 2 cold/hot summary
      • routing matrix summary
      • C2 runtime-native deltas
      • Phase 3 loops baseline/stress
      • external-cold backend comparison (G5)
    • provides manuscript-ready figure/caption/claim templates with direct table provenance.

Phase 3 Uncertainty Ablation (Baseline Matrix, 2026-02-19, runs=8)

Arms:

  • int_on_ext_on
  • int_off_ext_on
  • int_on_ext_off
  • int_off_ext_off

Sources:

  • normalized_logprob
  • raw_logit_margin
  • hybrid
  • runtime_native (canonical rerun now complete)

Observed deltas:

  • Internal success delta (uncertainty on vs off, external fixed on): +0.2308 across all three sources.
  • External success delta (uncertainty on vs off, internal fixed on): +0.2308 across all three sources.
  • Direction is stable across source definitions, while latency ratios vary by source.

Interpretation:

  • The loop benchmark now has first direct evidence that uncertainty-aware branching contributes to task success, not only narrative plausibility.
  • This baseline matrix uses harness-level synthetic uncertainty signals.
  • Runtime-native uncertainty rerun is now published and aligns directionally with this result.

Phase 3 Uncertainty Ablation (Repeatability + Stress, 2026-02-19, 3 seeds each)

Baseline means:

  • Internal uncertainty success delta: +0.2308 (all sources).
  • External uncertainty success delta: +0.2308 (all sources).

Stress means:

  • Internal uncertainty success delta: +0.2308 (all sources).
  • External uncertainty success delta: +0.2212 (all sources).

Stress minus baseline:

  • Internal uncertainty delta change: 0.0000.
  • External uncertainty delta change: -0.0096.

Interpretation:

  • Uncertainty-aware branching gains are stable under both baseline and injected timeout/failure stress in this harness.
  • This now has a canonical runtime-native corroboration run.

Phase 3 Uncertainty Ablation (Runtime-Native Canonical Rerun, 2026-02-19, 3 seeds each, Superseded)

  • Pre-fix issue:
    • greedy decode uncertainty path emitted flat zeros (mean_logprob=0, mean_entropy=0), making runtime-native C2 non-informative.
  • Fix:
    • greedy sampling kernel now computes logprob + entropy from logits using log-sum-exp.
  • Baseline runtime-native uncertainty deltas:
    • internal: +0.1026
    • external: +0.1155
  • Stress runtime-native uncertainty deltas:
    • internal: +0.2308
    • external: +0.2212
  • Interpretation:
    • this run established runtime-native wiring, but part of the seed set later showed probe fallback contamination.
    • use the 2026-02-20 quality-gated rerun below for current canonical interpretation.

Phase 3 Uncertainty Ablation (Runtime-Native Quality-Gated Rerun, 2026-02-20, 3 seeds each)

  • Rerun setup:
    • source: runtime_native
    • seeds: 7/11/19
    • baseline + stress
    • runtime fast probe config: TRENI_DEMO_LAYERS=2
    • client consumes awareness.generation first (legacy uncertainty fallback preserved)
  • Quality gate:
    • all runtime-native arm artifacts in this rerun have non-zero requests/ok and fallback=0, errors=0.
  • Clean runtime-native uncertainty deltas:
    • baseline internal: -0.1538
    • baseline external: -0.1217
    • stress internal: -0.1538
    • stress external: -0.1089
  • Interpretation:
    • with clean zero-fallback runtime-native probes, this awareness3 rerun showed uncertainty-on was harmful in this harness.
    • runtime-native transport/response wiring stayed validated and triggered the calibration pass below.

Phase 3 Uncertainty Ablation (Runtime-Native Calibrated Rerun calib1, 2026-02-20, 3 seeds each)

  • Calibration update:
    • runtime-native decision confidence now uses calibrated generation confidence (floor/ceil scaling) blended with prior confidence and optional route confidence.
    • calibration knobs are now forwarded through the ablation runner for reproducible reruns.
  • Rerun setup:
    • source: runtime_native
    • seeds: 7/11/19
    • baseline + stress
    • calibration params: prior weight 0.75, confidence floor 0.10, confidence ceil 0.35, route blend 0.10
  • Quality gate:
    • all runtime-native arm artifacts in this rerun have non-zero requests/ok and fallback=0, errors=0.
  • Calibrated runtime-native uncertainty deltas:
    • baseline internal: +0.1539
    • baseline external: +0.1058
    • stress internal: +0.1539
    • stress external: +0.1154
  • Interpretation:
    • calibrated runtime-native uncertainty recovers positive on/off gains in this harness under both baseline and stress.
    • C2 is now re-locked for this setup; optional next work is region-pinned commercial multi-hop controls (and higher-N only where needed).

External Cold Comparison (G5, 2026-02-18, Qwen 3B family)

BackendStartup->HealthyRequest TTFTRequest FullCold Total First TokenCold Total First Response
runtime1003.537 ms1108.979 ms1339.459 ms2112.516 ms2342.996 ms
pytorch_transformers-528.483 ms2288.516 ms6965.227 ms8725.259 ms
vllm24032.203 ms51.763 ms1036.815 ms24083.966 ms25069.018 ms
ollama (GGUF)1002.695 ms2168.902 ms2527.411 ms3171.597 ms3530.106 ms

Runtime-normalized (lower is better for runtime):

  • PyTorch cold total first response: 3.724x runtime.
  • vLLM cold total first response: 10.7x runtime.
  • Ollama cold total first response: 1.507x runtime.

Interpretation:

  • vLLM request-path TTFT is fastest once healthy, but startup dominates total cold in this run.
  • Runtime is strongest on end-to-end cold total in this specific setup.
  • Ollama is quantized GGUF and kept with caveat tags (not precision-equivalent to BF16 paths).

External Cold Comparison (G5, 2026-02-18, preload + tokenizer cache)

BackendStartup->HealthyRequest TTFTRequest FullCold Total First TokenCold Total First Response
runtime2004.735 ms91.596 ms271.346 ms2096.331 ms2276.081 ms
pytorch_transformers-522.795 ms2252.382 ms6644.737 ms8374.324 ms
vllm27036.682 ms51.725 ms1035.826 ms27088.407 ms28072.508 ms
ollama (GGUF)1002.508 ms2182.542 ms2538.609 ms3185.050 ms3541.117 ms

Runtime-normalized (lower is better for runtime):

  • vLLM request full latency: 3.817x runtime.
  • vLLM cold total first response: 12.334x runtime.
  • Remaining gap: vLLM request TTFT is still lower (51.725 ms vs runtime 91.596 ms).

Important caveat:

  • The run above was before request max_tokens was wired through runtime inference (runtime still generated 4 tokens there).

External Cold Comparison (G5, 2026-02-18, token parity fixed at 48, pre decoder fix)

BackendStartup->HealthyRequest TTFTRequest FullCold Total First TokenCold Total First Response
runtime2004.203 ms91.207 ms2518.142 ms2095.410 ms4522.345 ms
pytorch_transformers-501.946 ms2244.327 ms9530.450 ms11272.831 ms
vllm27036.248 ms51.310 ms1075.404 ms27087.558 ms28111.652 ms
ollama (GGUF)1002.560 ms2197.797 ms2556.652 ms3200.357 ms3559.212 ms

Interpretation:

  • Runtime still wins cold-total first response vs vLLM (6.216x better).
  • Runtime request-path TTFT and full latency are still slower than vLLM at equal 48-token budget.
  • Residual bottleneck is decoder per-token step cost (not tensor upload anymore in this mode).

External Cold Comparison (G5, 2026-02-18, token parity + decoder/sampling fix)

BackendStartup->HealthyRequest TTFTRequest FullCold Total First TokenCold Total First Response
runtime2004.759 ms5.022 ms311.289 ms2009.781 ms2316.048 ms
pytorch_transformers-515.597 ms2291.854 ms6461.304 ms8237.561 ms
vllm24036.762 ms52.995 ms1094.517 ms24089.757 ms25131.279 ms
ollama (GGUF)1002.630 ms2184.383 ms2543.219 ms3187.013 ms3545.849 ms

Interpretation:

  • Runtime now leads vLLM in request-path TTFT (10.553x faster) and full latency (3.516x faster) on this G5 token-parity run.
  • Runtime also remains much lower on cold-total first response (10.851x better vs vLLM).
  • Main measured bottleneck in prior parity run (sampling + per-step host sync) is no longer dominant.
  • Initial repeatability set (2026-02-18) kept the same direction: mean speedups 10.333x TTFT, 3.380x full latency, 10.688x cold-total first response.
  • Superseded by 2026-02-19 rerun and all-backend repeatability with stronger runtime advantage.

Cold TTFT Before vs After Index Cache (3-run means, G5)

ModelBeforeAfterSpeedup
qwen27574.564 ms1774.951 ms15.535x
donut67360.388 ms572.485 ms117.663x
bart77520.798 ms743.652 ms104.243x
minilm23.342 ms22.698 ms1.028x

Cold TTFT: clean3 vs clean4 (3-run means, G5)

Modelclean3clean4Improvement
qwen1411.831 ms1100.044 ms22.1% lower
donut619.499 ms150.322 ms75.7% lower
bart776.545 ms125.011 ms83.9% lower
minilm23.421 ms22.621 ms3.4% lower

Dominant Cold Stages After clean4

  • model_tensor_index_build dropped to ~1-2.3 ms across models (down ~99.6% vs clean3 for Bart/Donut).
  • Qwen still dominated by decoder_tensor_upload (~1015 ms mean).
  • Donut and Bart are now mostly in decoder setup/upload and no longer index-build bound.

Reverted Experiment (Transparency)

  • Tried an async pinned conversion-buffer upload strategy after clean4.
  • Result: Qwen decoder_tensor_upload regressed to ~1419 ms and TTFT regressed by ~37%.
  • Decision: reverted that path; clean4 remains the accepted cold-path baseline.
  • Follow-up validation run set (clean7) matched clean4 within run noise (Qwen TTFT delta -0.16%).

What Was Actually Tested

  1. Baseline (Python/dependency path) runs on T4 and G5.
  2. Runtime cold and warm request-path benchmarks.
  3. True runtime-reported TTFT (not SSE first-event proxy).
  4. Internal-vs-external routing comparison on matched tasks.
  5. Internal-vs-external routing failure-amplification stress run with injected timeouts/failures.
  6. Internal-vs-external routing matrix expansion (baseline + 5 stress profiles on G5).
  7. Internal-vs-external routing cross-host pilot (baseline + stress via SSH tunnel to G5).
  8. Internal-vs-external routing split-host matrix (CPU router host + GPU runtime host, 6 profiles).
  9. Week 3 numerical parity checks (strict mode; donut intentionally skipped in parity harness).
  10. Phase 3 loop-capability canonical G5 benchmark (baseline profile, 3 seeds).
  11. Phase 3 loop-capability canonical G5 stress benchmark (failure/timeout injection + retries, 3 seeds).
  12. Qwen cold upload GPU-convert on/off ablation (same host, same harness, env-toggle only).
  13. External-cold runtime-only GPU-convert on/off ablation (preload enabled, matched token budget).
  14. Runtime-vLLM external-cold repeatability rerun (3 runs) after vLLM env restore.
  15. External-cold all-backend repeatability set (runtime + PyTorch + vLLM + Ollama, 3 runs).
  16. Runtime-only cold stability sweep (5 runs) with preload upload sub-stage inspection.
  17. Runtime host-prefetch cold fix rerun: runtime-only 5-run stability sweep.
  18. Runtime host-prefetch cold fix rerun: external-cold all-backend repeatability (3 runs).
  19. Lambda A100 full rerun: Phase 2 cold/hot runtime set + 6-profile routing matrix.
  20. Lambda H100 full rerun: Phase 2 cold/hot runtime set + 6-profile routing matrix.
  21. Lambda A100 C2 runtime-native calibrated set: baseline+stress (3 seeds each).
  22. Lambda H100 C2 runtime-native calibrated set: baseline+stress (3 seeds each).
  23. Internet multi-hop commercial matrix (Fly hops + OpenAI gpt-5.2, 3 profiles).
  24. Internet multi-hop commercial repeatability matrix (Fly hops + OpenRouter openai/gpt-5.2, runs=3).
  25. Internet multi-hop repeatability matrix (Fly hops + OpenRouter anthropic/claude-sonnet-4.6, runs=3).
  26. Local-control matrix (no Fly scheduler path) for OpenAI gpt-5.2 and OpenRouter anthropic/claude-sonnet-4.6.
  27. Higher-N local-control rerun (runs=8/profile) for the same OpenAI/OpenRouter Sonnet pair.
  28. Task-family parity split rerun (model_only + tool_only, runs=8) for OpenAI + OpenRouter Sonnet.

What Is Not Finished Yet

  1. Optional: add region-pinned/Fly-to-Fly control runs to reduce provider-path confounding in OpenRouter comparisons.

2026-03-08

Qwen3.5 Prompt Parity And Remaining Fidelity Gap

  • Confirmed on AWS that runtime Qwen3.5 prompt IDs match HF chat-template token IDs exactly on a failing IFEval case.
  • Ran a four-way prefill A/B (fast, no_linear, no_full, tokenwise) and all four produced the same first token on that case.
  • Conclusion: the remaining IFEval quality issue is not caused by prompt serialization or the new batched prefill path.

Step-0 Logit Comparison

  • Runtime step-0 top-k on the failing prompt ranked:
    • The first
    • I second
  • vLLM top logprobs on the same prompt exposed I and The as tied top candidates.
  • Interpretation: there is still a small decode/logit-distribution drift on the Qwen3.5 lane, even after prompt parity was confirmed.

IFEval Repair Loop Progress

  • Added evaluator-guided IFEval repair messaging in scripts/phase5_awareness_realbench.py.
  • Repair loop now uses failed-check feedback instead of only a generic uncertainty retry.
  • Added explicit repair hints for:
    • forbidden words
    • exact repeated text
    • JSON-only output
    • markdown title lines
    • word-count limits
    • two-response formatting
  • 3-seed IFEval-only repair sweep (phase5-q35-ifeval-aware-repair-ab3) produced:
    • runtime arm_a_control: 0.361111 score, 1317.056 ms
    • runtime arm_b_awareness_retry: 0.402778 score, 2895.902 ms
    • runtime arm_c_awareness_consistency: 0.402778 score, 2910.412 ms
    • vLLM arm_a_control: 0.430555 score, 3034.889 ms
  • Interpretation:
    • evaluator-guided repair improves runtime IFEval quality materially over control
    • runtime repaired loop is still below vLLM control on IFEval quality
    • runtime repaired loop remains slightly faster than vLLM control on this slice

2026-03-10

Stub Audit And Scope Lock

  • Direct Phase 5 runtime/vLLM comparisons are not using Hermes stub tools.
  • The only live wrapper-level stub issue found in the same-VM harness was in /Users/andrewcorrea/treni/scripts/hermes_same_vm_mvp.py, where optional terminal_tool / browser_tool cleanup shims could mask real imports under partial Hermes availability.
  • That wrapper path is now fixed to prefer real Hermes imports and only install a stub when the import genuinely fails.
  • Phase 3 remains partially synthetic by design:
    • /Users/andrewcorrea/treni/scripts/phase3_agentic_loop_benchmark.py still exposes a synthetic profile.
    • realistic_v1 reduces stub bias with file-backed fixtures, but it is still not the same lane as direct runtime/vLLM benchmarking.

Qwen Family Compatibility Re-Proved On AWS

  • Rebuilt the AWS runtime from the corrected /Users/andrewcorrea/treni/monolith/main.c and /Users/andrewcorrea/treni/monolith/models/decoder.cu.
  • qwen35 (Qwen/Qwen3.5-0.8B) direct inference is restored on AWS and again returns inference.used=true.
  • qwen35_4b (Qwen/Qwen3.5-4B) now performs real inference on the same A10G host when launched with the correct runtime_pool_mb=15360.
  • Packed and booted a fresh Qwen/Qwen2.5-0.5B-Instruct container (monolith_qwen25_0p5b.bin) to re-prove backward compatibility on the live host.
  • qwen35_9b aliasing remains wired in /Users/andrewcorrea/treni/scripts/qwen_runtime_env.py, but the current AWS box still has no packed monolith_qwen35_9b.bin and is not the intended proof GPU for 9B.

AWS Storage Update

  • The AWS Hugging Face cache is mostly active model state, not arbitrary duplicates:
    • Qwen/Qwen3.5-4B
    • Qwen/Qwen3.5-0.8B
    • Qwen/Qwen3-ASR-0.6B
    • Qwen/Qwen3-VL-Embedding-2B
    • Qwen/Qwen3-VL-Reranker-2B
  • Removed stale Whisper fallback cache copies after keeping Qwen ASR as the primary STT path.
  • Later host cleanup removed the stale monolith_qwen05* artifacts and the temporary Qwen2.5-0.5B-Instruct host cache/artifacts after backward compatibility had already been re-proven.
  • AWS root disk moved from roughly 97% used with about 3.7G free to roughly 94% used with about 6.5G free.

Live Hermes Tooling And PDF/RAG Validation

  • Native raw-PDF ingestion is now live in the worker:
    • /Users/andrewcorrea/treni/scripts/treni_local_tool_worker.py accepts paths=[...] and extracts PDF text natively via pdftotext when available or pypdf as fallback.
    • live AWS proof ingested /home/ubuntu/treni/benchmarks/same_vm_mvp/data/manual-pncp-api.pdf directly into the local RAG store.
  • Same-VM Hermes wrapper tool registration is now healthier:
    • the wrapper now loads the real Hermes tools package before installing any optional shims, so file/code tools are no longer masked by a synthetic top-level package.
    • live loaded-tool sets now include real read_file, write_file, search_files, patch, and execute_code.
  • Live single-tool Hermes probes on AWS now show:
    • qwen35 (0.8B) successfully uses real samevm_rag_search against the raw-PDF-ingested local RAG store.
    • qwen35 successfully calls real execute_code; the current issue is model-generated code quality, not tool availability.
    • qwen35 also calls real samevm_sqlite_query, but still tends to emit malformed SQL unless the prompt is tightly constrained.
    • qwen35_4b still lags 0.8B on exact-output and tool-call contract fidelity in the current same-VM harness.

Live Qwen Speed Snapshot

  • Current live speed probe on AWS:
    • qwen35 (0.8B): 128 completion tokens in about 1111.4 ms, ttft_ms≈103.1, decode_tps≈115.37
    • qwen35_4b (4B): 128 completion tokens in about 3313.3 ms, ttft_ms≈170.7, decode_tps≈38.64

4B Same-VM Promotion Check And Lambda 9B Capacity Sweep

  • Same-VM qwen35_4b parity debugging on AWS found the real runtime bug.
  • Root cause:
    • in /Users/andrewcorrea/treni/monolith/models/decoder.cu, the cached linear-attention step path repeated key heads before the key depthwise-conv update when q_proj_dim != attn_dim
    • Qwen3.5-4B hits that shape regime, so first-token decode drifted even though tokenizer parity and prompt format were already correct
  • Hugging Face on the same AWS host proved the model itself was fine:
    • exact-output prompts behaved correctly
    • tool-call prompts emitted valid <tool_call> structure
  • After the decode fix, live AWS 4B behavior changed from malformed outputs like </think>\n\nREADY and 1. to:
    • exact READY
    • normal sentence answers
    • valid structured tool_calls
  • Repaired canonical 4B full suite:
    • /Users/andrewcorrea/treni/benchmarks/same_vm_mvp/results/samevm-agent-full-aws-r4-qwen35_4b_20260310T184433Z.json
    • result: 15/15
  • Repaired 4B suite scope now passes end-to-end:
    • direct runtime smoke
    • SQLite
    • raw PDF ingest + RAG search
    • embedding + reranking
    • TTS + Qwen ASR STT
    • Hermes runtime-status
    • Hermes RAG
    • Hermes SQLite exec/query
    • Hermes memory add/read
    • Hermes execute_code
  • AWS cleanup was deepened again:
    • removed the stale q35-orpo-notemplate-1772992302 training tree
    • removed checkpoint-1 from samevm-orpo-reload-q35-fixed_20260308T182430Z
    • pruned older same-VM debug WAV/debug-result artifacts and old worker logs
    • current root disk is still tight but improved to about 4.0G free
  • Lambda 9B provisioning is still blocked by real cloud-side capacity:
    • verified account auth and SSH key registration
    • repeated launch attempts across valid single-GPU types/regions returned either:
      • instance-operations/launch/insufficient-capacity, or
      • Cloudflare rate-limit 1015
    • no Lambda instance was created in this sweep

Clean Same-VM Agent Selector Lane (2026-03-10)

  • Added a real model-dependent comparison harness:
    • /Users/andrewcorrea/treni/scripts/samevm_agent_regression_suite.py --mode agent_compare
  • Canonical comparison artifacts:
    • /Users/andrewcorrea/treni/benchmarks/same_vm_mvp/results/samevm-agent-compare-aws-r2-qwen35.json
    • /Users/andrewcorrea/treni/benchmarks/same_vm_mvp/results/samevm-agent-compare-aws-r2-qwen35_4b.json
  • Scope of this selector lane:
    • runtime health
    • worker health
    • direct runtime smoke
    • Hermes runtime-status
    • Hermes RAG search
    • Hermes SQLite exec/query
    • Hermes memory add/read
    • Hermes execute_code
  • Result on the AWS A10G host:
    • qwen35 (0.8B) passed 10/10
    • qwen35_4b (4B) passed 2/10
  • Key interpretation:
    • this selector artifact is now historical only
    • it predates the cached linear-attention decode fix for 4B
    • the repaired full suite is the current source of truth for 4B
  • Tightened 0.8B agent lane improvements that matter:
    • explicit script=true guidance fixed the SQLite exec scenario
    • memory recall is now validated through a new session prompt, which matches Hermes memory semantics
    • execute-code validation now uses an explicit one-line Python task and passes cleanly

Isolated Speed Snapshot (2026-03-10)

  • Current isolated 0.8B speed probe:
    • /Users/andrewcorrea/treni/benchmarks/same_vm_mvp/results/qwen35_model_speed_compare_20260310.md
    • cold first hit:
      • 103 completion tokens
      • ttft_ms=608.891
      • infer_ms=12444.005
      • tok/s=8.277
    • warm steady-state:
      • 103 completion tokens
      • ttft_ms≈95.386
      • infer_ms≈905.594
      • tok/s≈113.738
  • Current repaired 4B speed probe:
    • same prompt family, repeated live AWS requests
    • 119 completion tokens
    • ttft_ms≈158.877
    • infer_ms≈3093.814
    • tok/s≈38.464
  • Current interpretation:
    • 0.8B remains the speed-optimized lane
    • 4B is now the repaired stronger-capability lane, but it is materially slower on A10G

On this page

At A GlanceTimelineLatest Key NumbersWarm Path (G5)Frontend Coverage-Instrumented Reruns (G5, 2026-02-23)AWS G5 TTFT Kernel Pass (2026-02-22, lt0_sync0 baseline -> best norm+softmax+lt1_sync0)AWS G5 TTFT Follow-Up (2026-02-22, best norm+softmax+lt1_sync0 -> default seq_q=1 tiny-kernel path)AWS G5 Attention Backend A/B (2026-02-22, deconfounded)AWS G5 True Fused cuDNN Frontend A/B (2026-02-22, fixed qwen, warmed query set)AWS G5 Frontend Repeatability Matrix (2026-02-22, repeats=3)AWS G5 Frontend Claim-Strength (2026-02-22)AWS G5 Frontend Miss-Mitigation (Updated Canonical, 2026-02-22)AWS G5 Frontend Shape-Prebuild Probe (No Preload Prompts, 2026-02-22)AWS G5 Frontend Hybrid Shape-Gate Follow-Up (2026-02-23)Commercial Fairness Root-Cause Grouping (2026-02-22)AWS G5 Seq1 Hybrid Tuning (2026-02-22)AWS G5 Seq1 Fused Follow-Up (2026-02-22)H100 Fused cuDNN SDPA Probe (2026-02-22)Phase 3 Realistic-v1 Loop Pack (2026-02-22, 3 seeds baseline + 3 seeds stress)Phase 3 Realistic-v1 Uncertainty Ablation (2026-02-22, seed 7 baseline+stress)Routing (Internal vs External, G5)Routing Failure-Amplification Stress (G5, 2026-02-18)Routing Matrix Expansion (G5, 2026-02-19, 6 profiles)Routing Cross-Host Pilot (2026-02-19, local client -> SSH tunnel -> G5)Routing Split-Host Matrix (2026-02-19, canonical Track B)Internet Multi-Hop Matrix (2026-02-20, Fly.io + Commercial APIs)Local Control Matrix (No Fly Scheduler Path, 2026-02-20)Task-Family Parity Split (Local Control, Higher-N, 2026-02-20)Qwen Cold Upload GPU-Convert Ablation (2026-02-19, G5)Runtime vs vLLM External-Cold Repeatability (2026-02-19, 3 runs)External Cold All-Backend Repeatability (2026-02-19, 3 runs, GPU-Convert Fix2)External Cold All-Backend Repeatability (2026-02-19, 3 runs, GPU-Convert + Host-Prefetch Fix)Phase 3 Agentic Loops (Canonical G5 Baseline, 2026-02-19, 3 seeds)Phase 3 Agentic Loops (Canonical G5 Stress, 2026-02-19, 3 seeds)Phase 4 Kickoff (Lambda A100/H100, Phase 3 Canonical, 2026-02-20)Phase 4 Full Reruns (Lambda A100/H100, Phase 2 + C2, 2026-02-20)Paper Package (2026-02-20)Phase 3 Uncertainty Ablation (Baseline Matrix, 2026-02-19, runs=8)Phase 3 Uncertainty Ablation (Repeatability + Stress, 2026-02-19, 3 seeds each)Phase 3 Uncertainty Ablation (Runtime-Native Canonical Rerun, 2026-02-19, 3 seeds each, Superseded)Phase 3 Uncertainty Ablation (Runtime-Native Quality-Gated Rerun, 2026-02-20, 3 seeds each)Phase 3 Uncertainty Ablation (Runtime-Native Calibrated Rerun calib1, 2026-02-20, 3 seeds each)External Cold Comparison (G5, 2026-02-18, Qwen 3B family)External Cold Comparison (G5, 2026-02-18, preload + tokenizer cache)External Cold Comparison (G5, 2026-02-18, token parity fixed at 48, pre decoder fix)External Cold Comparison (G5, 2026-02-18, token parity + decoder/sampling fix)Cold TTFT Before vs After Index Cache (3-run means, G5)Cold TTFT: clean3 vs clean4 (3-run means, G5)Dominant Cold Stages After clean4Reverted Experiment (Transparency)What Was Actually TestedWhat Is Not Finished YetRaw Artifact Links2026-03-08Qwen3.5 Prompt Parity And Remaining Fidelity GapStep-0 Logit ComparisonIFEval Repair Loop Progress2026-03-10Stub Audit And Scope LockQwen Family Compatibility Re-Proved On AWSAWS Storage UpdateLive Hermes Tooling And PDF/RAG ValidationLive Qwen Speed Snapshot4B Same-VM Promotion Check And Lambda 9B Capacity SweepClean Same-VM Agent Selector Lane (2026-03-10)Isolated Speed Snapshot (2026-03-10)