TODO
Live execution checklist and next actions.
Priority Order
Current Checklist
Same-VM Agent Productization
- Register Treni same-VM tools natively inside Hermes instead of wrapper-only injection.
- Verify no duplicate Hermes native tools for browser/code execution/same-VM registrations.
- Fix multi-turn native Hermes
4Bsession replay bug (tool_call_iduniqueness). - Make the split real-world
discover -> SQLite -> RAG -> memory -> recallconversation lane pass on native Hermes4B. - Make the single-turn combined persistence prompt reliably satisfy SQLite + RAG + memory in one freeform turn.
- Add true token streaming for agent-mode turns in the public GPU Agent console.
- Move the public demo from AWS
g5.2xlargeto a larger GPU host once Lambda (or another provider) gives stable capacity. - Harden
Qwen3.5-4Bruntime supervision on18081so long demo sessions survive without manual restarts.
Track A: Cold/Hot Foundations
- True TTFT instrumentation in runtime request path.
- 3x cold-first-hit repeatability set (G5).
- 3x warm steady-state repeatability set (G5).
- Cold bottleneck fix: per-model tensor lookup index cache.
- Cold rerun after fix with artifact pack.
- Add stage-level cold decomposition metrics (tokenizer load, index build, tensor upload, first decode step).
- Optimize
model_tensor_index_buildvia fast tensor collect path and rerun 3x cold validation. - Rerun 3x cold validation after reverting regressed upload path (
clean7) and confirm clean4 parity. - Add sub-stage upload instrumentation (
decoder_tensor_convert,decoder_tensor_h2d,decoder_tensor_copy_total). - Add startup preload + tokenizer cache path to cut first-request upload/tokenizer overhead.
- Wire request
max_tokensthrough runtime HTTP path for token-parity comparisons. - Disable decoder per-step trace by default (
TRENI_DEMO_TRACEopt-in). - Reduce remaining Qwen request-path TTFT/full gap vs vLLM (decoder/sampling fixes validated on token-parity reruns).
- Align decode-stop behavior with vLLM/HF semantics (stop on end markers, not
im_start) and keep chat cleanup token-level (TRENI_HTTP_OUTPUT_FILTER=1default, sanitize still opt-in).- Validation (
2026-03-02, AWS qwen05 probe): prior"<|im"leak removed in direct/v1/chat/completionsresponses with decode-stop on.
- Validation (
- Fix tokenizer special-token encode parity for chat templates (
<|...|>now encoded as atomic tokens in BPE path).- Validation (
2026-03-02, AWS qwen05 prompt-id probe): template prompt length dropped (35 -> 25) and first token is now the expected chat-control token id instead of punctuation-fragment ids.
- Validation (
- Prevent HTTP heuristic route-text fallback when inference succeeds with empty output.
- Validation (
2026-03-02): empty generation now returns empty assistant content, not synthetic"Routed to ..."text.
- Validation (
- Resolve qwen05 deterministic MCQ empty-completion parity gap (runtime token-0 stop vs vLLM non-empty output).
- Root cause (
2026-03-02): runtime Qwen template did not inject the default system preamble for user-only chats (HF/vLLM template does), shifting next-token distribution toward immediate EOS for that prompt. - Fix: inject Qwen default system preamble in HTTP chat-template build when no
systemmessage is provided. - Validation (
2026-03-02, user-only prompts): runtime now returns non-empty MCQ output (\"12\") and no longer immediate-stops on EOS for that case.
- Root cause (
- Re-run qwen05 external-cold runtime-vLLM benchmark after template/decoder fixes and confirm non-empty runtime response on the prior failing path.
- Artifacts (
2026-03-02):external_cold_qwen05_templatefix_20260302T154019Z.json,external_cold_qwen05_templatefix_nofixeos_20260302T154151Z.json. - Result: runtime completion is non-empty (
usage_completion_tokens=3), TTFT remains strongly ahead of vLLM on this profile.
- Artifacts (
- Re-run Phase 5 awareness benchmark on canonical
qwenwith matched depth/samples after qwen05 parity fixes.- Artifact (
2026-03-02,layers=36,samples=8):phase5_awareness_realbench_qwen-realbench-r9-templatefix1-l36s8_20260302T161123Z.json. - Result snapshot:
gsm8krecovered materially (A=0.625,C=0.750), whilegpqa_diamondregressed vsr5(A=0.125), so quality claim remains mixed by task family.
- Artifact (
- Bring up Qwen3.5 serving path on AWS using vLLM nightly (
mainbranch wheel path) and validate OpenAI-compatible endpoint.- Runtime env (
2026-03-02):.venv-vllm-nightly-q35,vllm 0.16.1rc1.dev.... - Launch mode:
--language-model-only,--max-model-len 32768,--enforce-eager.
- Runtime env (
- Resolve host infra blocker that broke Qwen3.5 startup (
No usable temporary directory).- Root cause: root filesystem at
100%. - Fix: cleanup old caches/venvs and run server with explicit
TMPDIR.
- Root cause: root filesystem at
- Run Qwen3.5 Phase 5 diagnostic sequence and remove A/B/C fairness noise.
r1:phase5_awareness_realbench_qwen35-realbench-r1-s8-nonthinking_20260302T184159Z.json(showed down deltas).r2:phase5_awareness_realbench_qwen35-realbench-r2-policyfix1-s8-nonthinking_20260302T184624Z.json(partial improvement).r3shared-first fairness fix:phase5_awareness_realbench_qwen35-realbench-r3-sharedfirst-s8-nonthinking_20260302T184947Z.json(allB-A/C-Adeltas0.0).
- Clone paper reference implementation and align Phase 5 trigger policy to paper-style uncertainty loop.
- Repo:
third_party/weave-logprobs-reasoning-loop - Harness updates (
scripts/phase5_awareness_realbench.py):- new trigger mode
paper|confidence|hybrid(defaultpaper), - paper trigger signals:
perplexity,max_entropy,low_confidence_tokens, - retry prompt now carries uncertainty summary from first pass,
- artifact trace now includes per-call uncertainty metrics/table.
- new trigger mode
- Repo:
- Validate paper-mode path with an end-to-end AWS smoke run (Qwen3.5 nightly vLLM).
- Artifact:
benchmarks/phase5_awareness_realbench/results/phase5_awareness_realbench_qwen35-paper-smoke_20260302T191420Z.json - Validation:
retry_decision.paper_reasonsandloop_trace[*].uncertaintypopulated per run.
- Artifact:
- Run Qwen3.5 Phase 5 rerun (
r4) with--awareness-trigger-mode paperand compare deltas vsr3no-up baseline.- Artifact:
benchmarks/phase5_awareness_realbench/results/phase5_awareness_realbench_qwen35-realbench-r4-paper-s8-nonthinking_20260302T191642Z.json - Outcome: loop path works and triggers correctly, but net quality uplift is not present on this sample (
overall B -0.046875,overall C 0.0vs A).
- Artifact:
- Retune from fixed paper thresholds to adaptive uncertainty gating and rerun on Qwen3.5.
- Implementation: adaptive trigger mode with rolling per-task uncertainty history in
scripts/phase5_awareness_realbench.py. r5artifact:benchmarks/phase5_awareness_realbench/results/phase5_awareness_realbench_qwen35-realbench-r5-adaptive-s8-nonthinking_20260302T202105Z.json.- Result vs
r4: lower negative delta (B-A -0.015625vs-0.046875) and much lower latency overhead.
- Implementation: adaptive trigger mode with rolling per-task uncertainty history in
- Run stricter adaptive follow-up (
r6) and compare.- Artifact:
benchmarks/phase5_awareness_realbench/results/phase5_awareness_realbench_qwen35-realbench-r6-adaptive-strict-s8-nonthinking_20260302T202314Z.json. - Result:
B-A=0.0butC-A=-0.03125; keptr5adaptive defaults as better balance.
- Artifact:
- Add strict inference hard-fail mode for HTTP benchmarking (
TRENI_HTTP_REQUIRE_INFERENCE=1) so empty/fallback outputs are rejected instead of silently counted.- Runtime now returns
502 {"error":"inference_required"}when model inference is unused/invalid in strict mode.
- Runtime now returns
- Add strict canonical matrix runner for
Qwen/Qwen3.5-0.8Bruntime-vs-vLLM (scripts/phase5_qwen35_runtime_vs_vllm_matrix.py).- Enforces fixed seeds/params, endpoint preflight, hard artifact validation, and bootstrap CI output.
- Add explicit arm selection to Phase 5 harness + matrix runner (
--arms,--phase5-arms) so strict backend matrix can run Arm A-only. - Unblock runtime decoder support for Qwen3.5
linear_attnlayers and complete strict canonical runtime-vs-vLLM matrix.- Matrix artifacts (
2026-03-02):phase5_qwen35_runtime_vs_vllm_matrix_20260302T221546Z.json,phase5_qwen35_runtime_vs_vllm_matrix_20260302T222013Z.json. - Current canonical snapshot (
r1, 3 seeds): runtime score0.0503vs vLLM0.2170; runtime latency1881.188 msvs vLLM178.093 ms.
- Matrix artifacts (
- Rerun strict Qwen3.5 matrix after decoder gate-layout fix in Arm A-only mode (
3 seeds,4 tasks,8 samples/task).- Matrix artifacts (
2026-03-03):phase5_qwen35_runtime_vs_vllm_matrix_20260303T104038Z.json,phase5_qwen35_runtime_vs_vllm_matrix_20260303T104038Z.md. - Outcome: quality gap narrowed (
runtime 0.15625vsvLLM 0.19097), but runtime is still slower overall (1723.685 msvs958.757 ms).
- Matrix artifacts (
- Harden Phase 5 closed-form parsing to prevent false positives from long reasoning traces.
- Changes (
2026-03-04,scripts/phase5_awareness_realbench.py):- strict final-answer extraction for GPQA/GSM8K/AIME (
ANSWER:/Final Answer:/ boxed / strict numeric-only), - strip
<think>...</think>blocks before parse, - reject long chain-of-thought "last number" fallback parses.
- strict final-answer extraction for GPQA/GSM8K/AIME (
- Validation artifact:
phase5_awareness_realbench_q35-parsefix-vllm-thinking1_20260304T032441Z.jsonnow returnsprediction_parsed=nullfor unresolved thinking traces.
- Changes (
- Run post-parse-fix strict paired AB3 rerun on
gpqa_diamond+ifeval(16/task, seeds7/17/27, Arm A,request_logprobs=false).- Summary artifact:
phase5_q35_runtime_vs_vllm_ifeval_gpqa_ab3_20260304T034227Z.json - Result:
- overall score: runtime
0.3403vs vLLM0.3229(small edge, CI includes parity), - overall latency: runtime
1772.931 msvs vLLM1553.034 ms(runtime slower on aggregate), - stratified: runtime wins strongly on
ifevallatency/score, but remains far slower ongpqa_diamond.
- overall score: runtime
- Summary artifact:
- Add Qwen3.5 tokenizer/full-vocab audit and extended endpoint probe matrix.
- Tokenizer audit:
runtime-q35-tokenizer-audit-r4_20260306T190418Z.json - Consolidated probe matrix:
qwen35-probe-matrix-r2_20260306T200035Z.json
- Tokenizer audit:
- Build same-VM Hermes MVP for local runtime + CPU tools and validate ORPO smoke training.
- Smoke:
hermes-samevm-q35-smoke-r5_20260306T192703Z.json - ORPO smoke launch:
hermes-samevm-q35-orpo-smoke-r1_20260306T194152Z.json
- Smoke:
- Recover the explicit AWS same-VM Qwen3.5 wrapper so it can auto-start runtime + tool worker and emit a usable final summary.
- Wrapper artifact (
2026-03-07):benchmarks/same_vm_mvp/results/samevm-q35-stack_20260307T172158Z.json - Smoke sub-artifacts:
benchmarks/qwen35_smoke/results/samevm-smoke-20260307T172124Z_20260307T172124Z.json,benchmarks/qwen35_smoke/results/samevm-smoke-20260307T172124Z_20260307T172124Z.md
- Wrapper artifact (
- Add a sequential one-host strict matrix runner for Qwen3.5 runtime-vs-vLLM and rerun it on the active AWS host.
- Runner:
scripts/phase5_qwen35_remote_strict_matrix.py - Contract artifacts (
2026-03-07):qwen35-tokenizer-audit-active_20260307T173024Z.json,qwen35-runtime-smoke-active2_20260307T173132Z.json,qwen35-isolated-ab-active_20260307T173228Z.json
- Runner:
- Enable Qwen3.5 prefix cache by default and fix request-path TTFT accounting before the next strict rerun.
- Code path:
monolith/main.c - Late strict rerun artifact:
benchmarks/phase5_awareness_realbench/results/phase5_qwen35_remote_strict_matrix_20260307T191653Z.json - Result: score recovered overall (
runtime 0.333333vsvLLM 0.315972), but latency is still far behind (runtime 3809.745 msvsvLLM 1626.068 ms).
- Code path:
- Keep two Qwen3.5 strict lanes explicit and fix sampled reproducibility.
- Deterministic canonical strict run (
2026-03-08):benchmarks/phase5_awareness_realbench/results/phase5_qwen35_remote_strict_matrix_20260308T204248Z.json- overall: runtime
0.295139vs vLLM0.267361, runtime824.714 msvs vLLM1572.529 ms gpqa_diamond: parity score, runtime slowerifeval: runtime higher score and much faster
- Sampled reproducibility is now fixed:
phase5_repro_runtime_ifeval_s7_raw_seedfix_r1.jsonvsphase5_repro_runtime_ifeval_s7_raw_seedfix_r2.json- same seed/config holds at
0.3125with8/8outputs identical
- Sampled canonical strict run (
2026-03-08):benchmarks/phase5_awareness_realbench/results/phase5_qwen35_remote_strict_matrix_20260308T220806Z.json- overall: runtime
0.409722vs vLLM0.302083, runtime1617.187 msvs2017.206 ms gpqa_diamond: runtime higher score, runtime slowerifeval: runtime higher score and faster
- Deterministic canonical strict run (
- Profile the Qwen3.5 strict benchmark hotspot directly with
TRENI_STEP0_PROFILE/TRENI_DECODE_STAGE_PROFILE.- Artifact:
benchmarks/phase5_awareness_realbench/results/q35-gpqa-profile-aws-clean_20260307T220200Z.json - Clean current finding:
- first call:
decoder_tensor_upload=218.091 ms,decoder_prefill=3263.527 ms,decoder_ttft=3317.441 ms - second call:
decoder_tensor_upload=11.216 ms,decoder_prefix_cache_copy=0.162 ms,decoder_prefill=2690.001 ms,decoder_ttft=2750.672 ms - step-0 decode remains small (
decoder_step0_layers ~8 ms,decoder_step0_logits_sample ~33-36 ms)
- first call:
- Conclusion: the remaining GPQA gap is still dominated by long-prompt prefill, not tokenizer or step-0 decode.
- Artifact:
- [~] Keep the Qwen3.5 prefix-cache path correctness-safe while continuing latency work.
- Focused AWS profile (
2x gpqa + 2x ifeval,2026-03-07) found:- GPQA gets a real
64-token prefix-cache hit (decoder_prefill ~3075 -> ~2697 ms), - short IFEval prompts were tripping a prefix-cache/store CUDA invalid-argument path.
- GPQA gets a real
- Safe runtime policy now skips prefix-cache store on short prompts while preserving long-prompt GPQA cache hits.
- Follow-up smarter tiering (
cap=112, quartile tiers + exact replay) now has clean latency evidence:- direct sequential GPQA profile:
q35-gpqa-profile-aws-seq2-cap112_20260307T222540Z.json- second related-call
decoder_prefill 2696.101 -> 2544.202 ms - second related-call
decoder_ttft 2747.697 -> 2595.907 ms
- second related-call
- clean strict seed-7 spot:
cap112:phase5_qwen35_remote_strict_matrix_20260307T223218Z.jsoncap64:phase5_qwen35_remote_strict_matrix_20260307T223555Z.json- runtime latency delta (
112 - 64):- overall
-363.908 ms gpqa_diamond-420.699 msifeval-307.116 ms
- overall
- direct sequential GPQA profile:
- Next requirement: convert this real but partial latency win into a multi-seed strict result that is not still behind vLLM overall.
- Focused AWS profile (
- Remove Qwen3.5 launcher/config drift between the strict runner and the AWS same-VM stack.
- Shared env source:
scripts/qwen_runtime_env.py - Updated launchers:
scripts/qwen35_remote_isolated_ab.pyscripts/treni_local_tool_worker.pyscripts/hermes_same_vm_mvp.py
- Clean strict AB3 artifact:
benchmarks/phase5_awareness_realbench/results/phase5_qwen35_remote_strict_matrix_20260307T231500Z.json - Current effect:
- runtime overall score now leads in the paired set:
0.335648vs vLLM0.291667 - runtime overall latency is still far behind:
3690.124 msvs1646.672 ms gpqa_diamondscore is now parity, but latency remains the main lossifevalscore improves clearly in runtime while remaining slower
- runtime overall score now leads in the paired set:
- Shared env source:
- [~] Recover strict benchmark quality without giving back the new latency profile.
- Batched hybrid prefill is now implemented:
- linear-attention sequence forward
- full-attention sequence prefill + K/V cache materialization
- hybrid layer-major prefill in
main.c
- Latest clean split (
2026-03-08, tie-stable fast-sampler AB3):- overall latency delta
-237.060 ms(runtime faster) gpqa_diamondscore delta+0.083333ifevalscore delta-0.145833
- overall latency delta
- Next implementation target is sampler/output-fidelity recovery on the runtime side, not another major prefill pass.
- Batched hybrid prefill is now implemented:
- Add explicit thinking-mode parity lane (vLLM
--reasoning-parser qwen3+ runtime equivalent output contract) before using thinking benchmarks for claim-grade comparisons.- First strict thinking artifact:
benchmarks/phase5_awareness_realbench/results/phase5_qwen35_remote_strict_matrix_20260308T223442Z.json
- Budget-fixed follow-up:
benchmarks/phase5_awareness_realbench/results/phase5_qwen35_remote_strict_matrix_20260308T224358Z.json
- Finalized follow-up:
benchmarks/phase5_awareness_realbench/results/phase5_qwen35_remote_strict_matrix_20260308T235628Z.json
- Lower-cost finalized follow-up:
benchmarks/phase5_awareness_realbench/results/phase5_qwen35_remote_strict_matrix_20260309T010353Z.json
- Current result:
- lane is runnable and measurable end-to-end,
- runtime now leads on score overall (
0.250000vs0.194444), - and with
gpqa_max_tokens=256it also wins overall latency (6823.816 msvs7503.000 ms).
- Early extension:
gsm8kfinalized thinking AB3 is directionally positive:phase5_qwen35_remote_strict_matrix_20260310T022347Z.json- runtime
0.197917vs vLLM0.177083, runtime7174.829 msvs vLLM7643.231 ms
- First strict thinking artifact:
- [~] Tighten runtime thinking-mode output contract for Qwen3.5.
- Current issue:
- close-form finalize now recovers parseable final answers, but long reasoning tasks are still too expensive and quality remains modest.
- One-example probe artifacts:
benchmarks/phase5_awareness_realbench/results/qwen35_thinking_gpqa_oneoff_runtime_1024_20260308T230352Z.jsonbenchmarks/phase5_awareness_realbench/results/qwen35_thinking_gpqa_oneoff_vllm_1024_20260308T230352Z.json
- Next focus:
- the old
512runtime cap and long-decode host-buffer corruption are fixed, - the
gpqa_max_tokens=256sweep already removed the worst latency collapse, - the remaining blocker is improving closed-form thinking quality beyond the current modest score while keeping this lower-cost lane,
- and understanding why
aime25stays0.0on both backends even after raising the reasoning budget and adding AIME-specific prompt/finalize guidance.
- the old
- Current issue:
- Fix sampled Qwen3.5 reproducibility on the runtime path.
- Root cause was harness-side:
scripts/phase5_awareness_realbench.pyshared-firstarm_a_controlrequest skipped the request seed and task-specific decode payload.
- Fixed state:
- repeated sampled runtime-only reruns are identical:
phase5_repro_runtime_ifeval_s7_raw_seedfix_r1.jsonphase5_repro_runtime_ifeval_s7_raw_seedfix_r2.json
- post-fix sampled strict matrix is now promotable:
phase5_qwen35_remote_strict_matrix_20260308T220806Z.json
- repeated sampled runtime-only reruns are identical:
- Root cause was harness-side:
- Investigate the intermittent same-VM first tool-turn CUDA retry in Qwen3.5 wrapper runs.
- Observed in
benchmarks/same_vm_mvp/logs/runtime_20260307T171918Z.log:compute/ops.cu:765invalid argument during prefill gather, request recovered on retry and smoke still passed.
- Observed in
- Turn the same-VM Hermes path into an explicit demoable MVP flow with local runtime + local CPU tools + ORPO loop entrypoint.
- Current entrypoints:
scripts/hermes_same_vm_mvp.py,scripts/run_samevm_qwen35_stack.sh,scripts/samevm_full_mvp_demo.py,scripts/run_samevm_full_mvp.sh - Current multimodal additions already wired in code:
samevm_multimodal_status,samevm_embed,samevm_rerank,samevm_tts,samevm_stt - New proof/demo entrypoints:
scripts/bootstrap_samevm_multimodal.sh,scripts/samevm_stack_probe.py,benchmarks/same_vm_mvp/README.md - New proof artifacts:
- canonical full MVP proof:
benchmarks/same_vm_mvp/results/samevm-full-mvp-aws-v15.json - runtime-admin Hermes proof:
benchmarks/same_vm_mvp/results/samevm-q35-runtime-admin-proof-v5_20260307T212852Z.json - Hermes SQLite query proof:
benchmarks/same_vm_mvp/results/hermes-demo-sqlite-query-v2.json - Hermes RAG search proof:
benchmarks/same_vm_mvp/results/hermes-demo-rag-search-v1.json - local stack proof:
benchmarks/same_vm_mvp/results/samevm-stack-probe-aws-v5.json - ORPO control-plane proof:
benchmarks/same_vm_mvp/results/samevm-orpo-probe-aws_20260307T215307Z.json - Qwen3.5 ORPO reload proof:
benchmarks/same_vm_mvp/results/samevm-orpo-reload-q35-fixed.json
- canonical full MVP proof:
- Hot-reload/hand-off now exists on AWS:
- local ORPO output is merged into a full HF model dir,
- packed into a new monolith container,
- restarted as a second local runtime and verified with a real chat response.
- MVP contract now covered in the canonical run:
- runtime health
- Hermes runtime-status call
- Hermes multimodal-status call
- basic non-thinking runtime smoke with first-turn tool calling
- extended thinking runtime smoke with exact-match checks
- SQLite + RAG + embedding + reranking
- TTS + STT
- ORPO reload sidecar proof
- Current entrypoints:
- Audit the current harness for stubbed-tool paths and lock the scope.
- Direct Phase 5 runtime/vLLM lane is not stubbed.
- Same-VM Hermes wrapper had a localized optional-import stub path in
/Users/andrewcorrea/treni/scripts/hermes_same_vm_mvp.py; fixed. - Phase 3 still contains synthetic profiles by design;
realistic_v1only reduces that bias.
- Re-prove dynamic Qwen-family runtime support on AWS.
qwen35(0.8B) direct inference restored on the live host.qwen35_4bdirect inference proved on the same host with correct pool sizing.qwen25(Qwen2.5-0.5B-Instruct) packed fresh and direct inference proved.
- Clean the AWS host down to the active Qwen same-VM target set after the compatibility repro.
- removed stale
monolith_qwen05*artifacts - removed temporary
qwen25host cache/artifacts after proving backward compatibility - kept the active
qwen35(0.8B),qwen35_4b(4B), and multimodal model caches
- removed stale
- Deep-clean stale same-VM training/checkpoint/debug artifacts after the 4B promotion sweep.
- removed the old
q35-orpo-notemplate-1772992302training tree - removed
checkpoint-1fromsamevm-orpo-reload-q35-fixed_20260308T182430Z - pruned old debug WAVs and surplus worker logs
- current AWS root disk is back to about
4.0Gfree
- removed the old
- Pack and prove
qwen35_9bon a larger GPU host.- Current Lambda sweep is still blocked by provider-side
insufficient-capacityplus Cloudflare1015rate limiting.
- Current Lambda sweep is still blocked by provider-side
- Build and run a real model-dependent same-VM agent comparison suite.
- Canonical current selector artifacts:
benchmarks/same_vm_mvp/results/samevm-agent-compare-aws-r2-qwen35.jsonbenchmarks/same_vm_mvp/results/samevm-agent-compare-aws-r2-qwen35_4b.json
- Current result on AWS A10G:
qwen35(0.8B) =10/10qwen35_4b(4B) =2/10
- This selector lane is now historical for
4B; the repaired full suite below is the current source of truth.
- Canonical current selector artifacts:
- Promote the ORPO self-improvement loop from the current Qwen2.5 demo model to the main Qwen3.5 target family.
- Current passing proof:
benchmarks/same_vm_mvp/results/samevm-orpo-reload-q35-fixed.json - Current passing canonical full MVP:
benchmarks/same_vm_mvp/results/samevm-full-mvp-aws-v15.json
- Current passing proof:
- Recover the stricter extended/thinking same-VM runtime smoke lane as a separate quality target.
- Current MVP gate now includes the thinking profile and passes in
benchmarks/same_vm_mvp/results/samevm-full-mvp-aws-v15.json. - Extended non-thinking profile now also passes cleanly:
benchmarks/qwen35_smoke/results/postmvp-extended_20260308T185130Z.json
- Current MVP gate now includes the thinking profile and passes in
- Revalidate real Hermes tool availability after the wrapper import/stub audit.
- file/code tools now load again in the same-VM wrapper (
read_file,write_file,search_files,patch,execute_code) - direct live
qwen35tool-call smoke passes at the runtime level - live Hermes single-tool RAG search succeeds on raw-PDF-ingested local data
- file/code tools now load again in the same-VM wrapper (
- Fix
qwen35_4bexact-output and tool-call contract parity in the same-VM Hermes path.- Root cause was a decoder bug in the cached linear-attention step path.
- Repaired full-suite artifact:
benchmarks/same_vm_mvp/results/samevm-agent-full-aws-r4-qwen35_4b_20260310T184433Z.json
- Repaired result:
15/15
qwen35_4bnow passes:- direct runtime smoke
- direct PDF RAG
- direct embed/rerank
- direct TTS/STT
- Hermes runtime-status/RAG/SQLite/memory/execute_code
- Refresh the short model-selector lane for
qwen35_4bafter the decoder fix.- The old
samevm-agent-compare-aws-r2-qwen35_4b.jsonartifact is stale. - Use the repaired full suite as the current truth until the short compare lane is rerun cleanly.
- The old
- Run the first real same-VM multimodal proof pass after bootstrap.
- Artifact:
benchmarks/same_vm_mvp/results/samevm-stack-probe-aws-v5.json - Confirmed on AWS:
samevm_multimodal_status- Qwen TTS
- Qwen ASR STT on the generated WAV
- Qwen embedding + reranking
- SQLite + RAG in the same local tool worker
- Artifact:
- Run a live operator-style validation pass on the current AWS deployment.
- Direct generation speed:
- mean end-to-end throughput:
112.37 tok/s - mean decode-only throughput:
121.90 tok/s
- mean end-to-end throughput:
- Hermes tool proofs now include:
benchmarks/same_vm_mvp/results/hermes-demo-sqlite-query-v2.jsonbenchmarks/same_vm_mvp/results/hermes-demo-rag-search-v1.jsonbenchmarks/same_vm_mvp/results/hermes-tts-v2.jsonbenchmarks/same_vm_mvp/results/hermes-stt-v2.json
- Real-world document caveat is now explicit:
- extracted PDF text ingests/searches correctly,
- raw PDF parsing is not yet native in the worker
- Qwen3.5-4B feasibility on the current AWS host:
- GPU memory looks plausible on the A10G
24 GBbox, - current root disk headroom (
~12 GB) is the first practical blocker for download + pack
- GPU memory looks plausible on the A10G
- Direct generation speed:
- Move same-VM multimodal bootstrap into an isolated environment for AWS runs.
- Active AWS runs are now executed from
/home/ubuntu/.venvs/hermes-treni. - Local Mac cleanup remains separate from the current AWS experiment path.
- Active AWS runs are now executed from
- Add a worker-side multimodal cache clear path so local tool models do not keep starving the main runtime GPU.
- New endpoint:
POST /v1/mm/clear_cache - New Hermes tool:
samevm_multimodal_clear_cache - Status now reports
loaded_model_count,loaded_models, and CUDA allocation/reservation.
- New endpoint:
- Add a true vision-encoder parity lane for Qwen3.5.
- Current state:
- runtime probe only validates multimodal placeholder handling,
- current vLLM launch is
--language-model-only, so multimodal cases fail by configuration.
- Current state:
- Wire Q/K head RMS-norm into decoder path (
q_norm_weight/k_norm_weight) and rerun strict Qwen3.5 smoke check.- Artifacts (
qnorm-check1,seed=7,2 samples/task):phase5_qwen35_runtime_vs_vllm_matrix_20260302T225529Z.json. - Result: still negative (
rt_score=0.0000,vllm_score=0.0625; runtime latency1880.622 msvs187.453 ms), so missing linear-attn parity remains the dominant blocker.
- Artifacts (
- Recover awareness uplift on Qwen3.5 with task-aware paper mode (
gpqaretries on, summaryifevalretries off).32/task:phase5_awareness_realbench_qwen35-papersummaryfix-runtime-conf045-s32-ifevaloff_20260303T222841Z.json(overall B-A=+0.015624).- 3-seed (
16/task,s7/s17/s27):...ifevaloff-rpt-*mean+0.020833.
- Calibrate paper-trigger metrics for runtime-native uncertainty before next Phase 5 claim run.
- New evidence (
2026-03-03):- paper-mode selection bug is fixed in harness (
phase5_awareness_realbench.py), - default paper thresholds over-trigger on runtime (
max_entropytrue on16/16cases inqwen35-paperfix-runtime-sweep-p1_4_20260303T202135Z.json), - summary-mode calibration fix is now implemented (
uncertainty_source=runtime_summary+ guarded vote rule), reducing retries (16 -> 9) and removing the immediate negative delta (phase5_awareness_realbench_qwen35-papersummaryfix-runtime-sanity2_20260303T204120Z.json), - task-aware follow-up (disable summary retries on IFEval) now yields the first repeatable positive signal:
phase5_awareness_realbench_qwen35-papersummaryfix-runtime-conf045-s32-ifevaloff_20260303T222841Z.json(overall B-A=+0.015624)- 3-seed (
s7/s17/s27,16/task) mean+0.020833.
- compact invalid-parse recovery prompt + invalid-parse confidence gate (
--invalid-parse-retry-confidence-max) now reduce overhead while preserving mean quality on repeatability:invmax=0.733-seed (16/task) artifacts:phase5_awareness_realbench_qwen35-papersummaryfix-runtime-compact-invmax073-s16_20260303T232029Z.json,...-rpt-s17_20260303T232254Z.json,...-rpt-s27_20260303T232516Z.json- quality unchanged vs prior baseline (
overall B-A mean +0.020833), latency overhead reduced (+712.276 ms -> +404.603 ms). 32/taskconfirmation:phase5_awareness_realbench_qwen35-papersummaryfix-runtime-compact-invmax073-s32_20260303T232755Z.jsonkeptoverall B-A=+0.015624while lowering latency overhead (+618.068 ms -> +326.187 ms).
- paper-mode selection bug is fixed in harness (
- Next implementation target:
- reduce absolute GPQA malformed-output rate on first pass (current retries are still dominated by
invalid_parsefailures from decode quality, not uncertainty only).
- reduce absolute GPQA malformed-output rate on first pass (current retries are still dominated by
- New evidence (
- Add GPU-side BF16/F16 cold tensor conversion path (
TRENI_TENSOR_CONVERT_GPU) and validate Qwen cold upload ablation on G5. - Stabilize preload upload cold variance (intermittent
decoder_tensor_h2dspikes) with explicit page-residency strategy (TRENI_TENSOR_HOST_PREFETCH) and verify with runtime/all-backend reruns. - Run TTFT softmax-kernel pass on AWS G5 (
lt0_sync0) and confirm effect. - Replace single-thread norm kernels (
rmsnorm/layernorm) with row-parallel reductions and rerun cold/warm matrix. - Isolate seq2seq/Bart TTFT hotspot via step0 stage profiling (
TRENI_STEP0_PROFILE) and implementseq_q=1attention follow-up (tiny-kernel + direct K/V cache write). - Run 3x repeatability for the new default
seq_q=1attention path and publish mean/std. - Resolve strict Week-3 parity gate: rebuilt parity container (
monolith_phase3_qbm.bin, qwen+bart+minilm) and strict external-HF parity now passes (checked=3,failed=0,missing=0). - Add strict attention backend selector + repeatable A/B harness (
customvscudnn_sdpaproxy) with runtime env overrides and summary reporting. - Run AWS G5 attention backend A/B and deconfound call-order effects with reverse-order rerun (
attn_backend_ab_rev_20260222T144736Z). - Cache attention runtime env config values once per process (remove per-call
getenvoverhead in request path). - Add
seq_q=1hybrid tuning knobs (TRENI_ATTN_SEQ1_USE_CUBLAS_QK/PV) and run warm/cold matrix on G5. - Fuse
seq_q=1softmax+PV custom path and retune seq1 QK block sizing; rerun warm/cold matrix (seq1_hybrid_fused_20260222T192656Z). - Make
cudnn_sdpaproxy behavior explicit opt-in (TRENI_ATTN_ALLOW_SDPA_PROXY=1) and keep strict fused-only semantics by default. - Probe fused cuDNN SDPA availability on H100 across alignment/shape/layout sweeps and pip/system cuDNN sources (
cudnn_sdpa_h100_probe_20260222T1935Z). - Add hard A/B validation guard: fail frontend runs when fused marker is missing or runtime was built with
TRENI_WITH_CUDNN=0. - Add fused frontend stage profiler (
TRENI_ATTN_CUDNN_FRONTEND_PROFILE) and capture miss-cost probe artifacts. - Publish strict fused frontend A/B rerun with fixed
qwenmodel + warmed query set (attn_backend_ab_frontend_20260222T220111Z). - Publish repeatability proof matrix for custom vs fused frontend (
attn_backend_frontend_matrix_20260222T221948Z, 3 repeats each forwarm_fixedandmixed_churn). - Publish frontend claim-strength report (paired deltas + CI95) for the repeatability matrix (
attn_backend_frontend_claim_report_20260222T222958Z). - Add fused miss-trace + startup-preload knobs (
TRENI_ATTN_CUDNN_FRONTEND_TRACE_MISSES,TRENI_HTTP_PRELOAD_PROMPTS, frontend A/B preload flag). - Run strict frontend matrix A/B
no_preloadvsstartup_preload_4promptsand publish compare report (attn_backend_frontend_missmit_compare_20260222T225215Z). - Fix runtime preload prompt splitter bug (
TRENI_HTTP_PRELOAD_PROMPTS) and verify multi-run execution from logs (run=1/4 ... run=4/4). - Run strict frontend matrix A/B
no_preloadvsstartup_preload_benchmark_queriesand publish compare report (attn_backend_frontend_missmit_compare_20260222T231335Z). - Add shape-level seq1 prebuild controls (
TRENI_ATTN_CUDNN_FRONTEND_PREBUILD_SEQ1_MAX_KV,TRENI_ATTN_CUDNN_FRONTEND_PREBUILD_HEAD_DIM) and expose them in frontend scripts. - Validate no-preload fused cold TTFT fix with startup shape prebuild (
prebuild_startup_nopreload_probe_20260222T232932Z). - Run no-preload frontend matrix probe with shape prebuild and compare against no-preload baseline (
attn_backend_frontend_matrix_20260222T233003Z,attn_backend_frontend_missmit_compare_20260222T233116Z). - Tune shape-prebuild range (
seq_kv_max: 16 -> 10) and reduce startup penalty while preserving no-preload fused TTFT/full (prebuild_startup10_nopreload_probe_20260222T235944Z). - Probe cuDNN frontend heuristic modes (
A/B/FALLBACK) for startup/build relief on current path. - Run tuned shape-prebuild (
seq_kv_max=10) matrix probe and compare against priorseq_kv_max=16matrix (attn_backend_frontend_matrix_20260223T000256Z,attn_backend_frontend_missmit_compare_20260223T000343Z). - Run lower-range shape-prebuild bound probe (
seq_kv_max=8) and confirm request-path regression (prebuild_startup8_nopreload_probe_20260223T000600Z). - Add shape-gated fused policy controls (
TRENI_ATTN_CUDNN_FRONTEND_PREBUILD_SEQ1_MIN_KV,TRENI_ATTN_CUDNN_FRONTEND_MIN_SEQ_KV) and expose them in frontend matrix flags. - Fix strict-mode frontend gate fallback path so low-shape custom fallback remains inference-valid (
inference.used=trueunder strict fused runs). - Run 3x hybrid no-preload startup probe (
prebuild_hybrid10_nopreload_probe_r{1,2,3}_20260223T002214Z) and lock startup/request repeatability. - Run 3x hybrid frontend matrix (
attn_backend_frontend_matrix_20260223T001959Z) and compare vs prior tuned no-gate baseline (attn_backend_frontend_missmit_compare_20260223T002153Z). - Run broader-shape sanity probe for hybrid policy (
hybrid_shape_sanity_20260223T002857Z) and capture limitation: fused misses reappear forseq_kv>10long-prompt growth. - Add upper seq-kv gate control (
TRENI_ATTN_CUDNN_FRONTEND_MAX_SEQ_KV) and expose--attn-fused-max-seq-kvin frontend runners. - Rerun broader-shape sanity with bounded gate (
hybrid_shape_sanity_maxgate_20260223T003453Z) and confirm miss-cascade removal (miss_lines_head=[],inference.used=trueall requests). - Rerun 3x hybrid matrix with max gate (
attn_backend_frontend_matrix_20260223T003611Z) and compare vs prior hybrid policy (attn_backend_frontend_missmit_compare_20260223T003734Z). - Add per-request attention backend telemetry (
attentioncounters/shares in runtime HTTP responses) and aggregate it in benchmark artifacts. - Run coverage-instrumented fused matrix and profile sweeps (
attn_backend_frontend_matrix_20260223T011158Z,fused_coverage_profiles_20260223T011504Z,fused_coverage_cold_profiles_20260223T011534Z). - Decide lane direction from evidence: park cuDNN/frontend optimization and prioritize custom kernels (
2026-02-23). - (Parked) Replace current
cudnn_sdpaproxy path with true fused cuDNN SDPA/flash-attention frontend path, then rerun A/B. - Reduce shape-prebuild startup penalty while preserving no-preload fused TTFT/request gains (
~7.0s -> ~2.0sstartup baseline on G5 hybrid policy). - (Parked) Implement dynamic shape-reuse/coverage for fused path.
- Add/validate seq1 custom microfused path (
TRENI_ATTN_SEQ1_USE_MICROFUSED) to reducedecoder_step0_layerslaunch overhead for smallseq_kv.- G5 A/B result (
2026-02-23): no net win (mean/TTFT regressions; isolated bart p99 improvement only), so kept as opt-in and defaulted off. - Artifact summary:
benchmarks/phase2_runtime/seq1_microfused_ab/seq1_microfused_ab_summary_20260223T014848Z.jsonand.md.
- G5 A/B result (
- Add stream-cache toggles (
TRENI_LINEAR_STREAM_CACHE,TRENI_ATTN_STREAM_CACHE) and run G5 on/off A/B.- Result (
2026-02-23): near-neutral; keep cache enabled by default and prioritize higher-impact kernel paths. - Artifact summary:
benchmarks/phase2_runtime/results/stream_cache_ab_summary_20260223T015222Z.jsonand.md.
- Result (
- Prototype hash-backed registry/model-index lookups (
TRENI_REGISTRY_LOOKUP_HASH,TRENI_MODEL_INDEX_NAME_HASH) and run G5 on/off A/B.- Result (
2026-02-23): no meaningful cold/setup win on this profile; kept as opt-in and defaulted off. - Artifact summary:
benchmarks/phase2_runtime/results/registry_hash_ab_summary_20260223T020353Z.jsonand.md.
- Result (
- Fix cold-start harness startup timing granularity (
wait_for_healthnow polls every 50ms instead of 1s cadence). - Add benchmark flag
--runtime-skip-startup-smoke(default true) and validate cold startup impact on G5.- Result (
startup_smoke_ab_hf_20260223T030059Z): startup-to-healthy488.027 -> 404.184 ms(-17.18%) and start-to-first-response705.454 -> 622.167 ms(-11.81%) with smoke skipped. - Runtime default also moved to skip startup smoke unless explicitly disabled (
TRENI_SKIP_STARTUP_SMOKE=0).
- Result (
- Run custom cold-path A/B probes (
TRENI_TENSOR_ENV_CACHE,TRENI_TENSOR_H2D_CHUNK_MB,TRENI_TENSOR_HOST_REGISTER) on G5.- Result (
2026-02-23): all near-neutral/noise-level on this profile; keep as optional knobs and prioritize upload/decoder kernel hotspots.
- Result (
- Add per-tensor upload hotspot profiler (
TRENI_TENSOR_UPLOAD_TOPK) and run cold probe on G5.- Result (
tensor_upload_topk_probe_20260223T190829Z): dominant upload hotspot ismodel.embed_tokens.weight(~79.3 ms,~63.8%ofdecoder_tensor_uploadin that probe).
- Result (
- Add/benchmark container-level readahead hint (
TRENI_CONTAINER_WILLNEED) on G5.- Result (
container_willneed_ab8_20260223T191145Z): modest but repeatable cold-total gain (start->first-response: -1.94%,startup: -3.02%), request-path TTFT/full near-flat. - Runtime default moved to enable this hint unless explicitly disabled (
TRENI_CONTAINER_WILLNEED=0).
- Result (
- Validate
TRENI_CONTAINER_WILLNEED + TRENI_TENSOR_HOST_REGISTERcombo on G5.- Result (
container_hostreg_ab8_20260223T191255Z): no clear gain beyond readahead-only profile.
- Result (
- Validate staged-H2D upload path (
TRENI_TENSOR_H2D_STAGING) with chunk-size follow-up and decide lane status.- Result (
h2d_staging_followup_summary_20260224T101324Z): bothmin64/chunk32(8-run A/B) andmin64/chunk128(3-run probe) regress materially on this G5 profile. - Decision: keep staged-H2D as opt-in experimental path, default-off, and continue Track A cold optimization on the non-staging custom path.
- Result (
- Run non-staging H2D chunk-size matrix (
TRENI_TENSOR_H2D_CHUNK_MB=0/64/128, 8 runs each) on G5.- Result (
h2d_chunk_matrix_summary_20260224T101730Z): request-path and upload-stage deltas were near-neutral in this initial run set (later superseded by2026-02-28full-depth AB3 promotion to default0).
- Result (
- Implement and benchmark host page-touch pre-fault path (
TRENI_TENSOR_HOST_TOUCH) on G5.- Result (
host_touch_ab_summary_20260224T102444Z):decoder_tensor_h2ddecreased butdecoder_tensor_prefetch/upload increased, yielding net request regression (full +7.73%,infer +8.22%). - Decision: keep host-touch path as opt-in/default-off, not part of canonical Track A settings.
- Result (
- Run synchronized upload diagnostic probe (
TRENI_TENSOR_UPLOAD_SYNC=0/1, 3 runs each) to isolate conversion vs transfer cost.- Result (
upload_sync_probe_summary_20260224T102618Z): conversion is measurable with sync (~6 ms) but H2D remains dominant (~118 ms), so optimization focus stays transfer-path first.
- Result (
- Run synchronized host-register probe (
TRENI_TENSOR_HOST_REGISTER=0/1,TRENI_TENSOR_UPLOAD_SYNC=1) on G5.- Result (
host_register_sync_probe_summary_20260224T102915Z): no transfer-stage gain and slight request regression, so this lane is deprioritized.
- Result (
- Implement and benchmark decoder logits u16 path (
TRENI_DECODER_LOGITS_U16_PATH) on G5.- Result (
logits_u16_ab_fix1_summary_20260224T105532Z): cold upload/setup improved slightly, but request path regressed materially (ttft/infer/full), so path remains opt-in/default-off.
- Result (
- Implement and benchmark tensor-cache hash lookup path (
TRENI_TENSOR_CACHE_HASH) on G5.- Result (
tensor_cache_hash_warm3_20260224T114126Z): near-neutral request path with slight warmp99regression (+0.149 ms) in this profile, so path remains opt-in/default-off.
- Result (
- Implement and benchmark sampler direct-store path (
TRENI_SAMPLE_DIRECT_STORE) on G5.- Result (
sample_direct_store_ab_20260224T114633Z): enabled path regressed warm request metrics (mean+0.062 ms, p95+0.076 ms, p99+0.143 ms), so it remains opt-in/default-off.
- Result (
- Implement and benchmark decoder direct-out residual path (
TRENI_DECODER_DIRECT_OUT_HIDDEN) on G5.- Result (
direct_outhidden_ab_20260224T115051Z): enabled path regressed warm request and infer metrics (mean+0.540 ms, p95+0.495 ms, p99+0.444 ms, infer+0.150 ms), so it remains opt-in/default-off.
- Result (
- Implement and benchmark multi-head seq1 attention path (
TRENI_ATTN_SEQ1_USE_MULTIHEAD) on G5.- Result (
seq1_multihead_ab_20260224T125127Z,seq1_multihead_bart_ab_20260224T125404Z): clear request-path wins on qwen warm/mixed and bart warm (including TTFT/infer improvements). - Decision: promote to default-on (
TRENI_ATTN_SEQ1_USE_MULTIHEAD=1,TRENI_ATTN_SEQ1_MULTIHEAD_MAX_KV=2048), keep off-switch for fallback.
- Result (
- Add decode-stage profiling beyond step0 (
TRENI_DECODE_STAGE_PROFILE) and publish first profile artifact (external_cold_stepn_profile_20260225T001334Z). - Add external-cold runtime env passthrough (
--runtime-env) for reproducible flag-driven A/B runs. - Run uncertainty capture A/B (
TRENI_DEMO_CAPTURE_UNCERTAINTY=1/0) with matched profile and capture decode-stage deltas (external_cold_uncert_on/off_20260225T0017*). - Rerun runtime-vLLM cold comparison on same profile (
external_cold_runtime_vllm_uncertoff_20260225T001929Z). - Add non-step0 split metrics (
decoder_stepN_logits_projvsdecoder_stepN_sample) and run immediate qwen A/B probes (lt16, fast16 GEMMEx, direct-u16-input,lt_u16workspace); all were near-neutral/regressed and reverted. - Run full-depth (
--layers 36,--pool-mb 16384) runtime-vLLM cold compare and validate hotspot shift (decoder_stepN_layersdominant). - Run full-depth preload follow-up (
preload=1andpreload=64) to isolate cache-miss vs decode-compute contribution. - Rerun full-depth seq1 hybrid matrix (
defaultvsqkvspvvsboth) and confirm default custom remains best. - Re-test full-depth direct-u16-input linear path; no gain, reverted.
- Implement full-depth FFN u16 weight path (
TRENI_DECODER_FFN_U16_PATH) and run runtime-vLLM A/B (ab2artifacts, 2026-02-25). - Implement full-depth decoder attention u16 path (
TRENI_DECODER_ATTN_U16_PATH) and run 3-seed runtime-vLLM matrix (ab3, 2026-02-25). - Re-test logits u16 on top of full-depth attention/ffn u16 (
TRENI_DECODER_LOGITS_U16_PATH) and run 3-seed runtime-vLLM matrix (ab3, 2026-02-25). - Revert regressing fused
gate+upFFN projection path after A/B regression and restore non-fused baseline. - Implement shared decode-input pre-cast reuse for full-depth u16 decode GEMMs (q/k/v and gate/up) and run 3-seed runtime-vLLM matrix.
- Add u16 cublasLt cached path (with safe fallback) for decode u16 GEMMs and run 3-seed runtime-vLLM matrix.
- Implement residual-fused u16 Lt decode path (
o_proj+ffn_downno-bias accumulate) and run 3-seed runtime-vLLM matrix. - Add full-depth FFN sub-stage profiling (
ffn_proj_cast,ffn_proj_gate,ffn_proj_up) and publish split profile artifact (external_cold_layers36_stepn_profile_ffnsub_20260226T094140Z.log). - Probe batched gate+up FFN projection follow-up and revert after regression (higher
ffn_projand slower full decode in A/B). - Implement attention qkv fused-alias path (
TRENI_DECODER_ATTN_U16_QKV_FUSED) and run 3-seed runtime-only + runtime-vLLM A/B matrices. - Promote qkv fused-alias path as default-on in the full-depth u16 lane after parity pass and repeatability wins.
- Probe
TRENI_LINEAR_LT_WORKSPACE_MBin full-depth lane and reject after regression (full 1711.213 -> 1754.568 msin trial A/B). - Implement FFN activation-to-u16 fused path (
TRENI_DECODER_FFN_ACT_U16_FUSED) and run 3-seed runtime-only + runtime-vLLM A/B matrices. - Promote FFN activation-to-u16 fused path as default-on after strict parity pass and repeatability wins.
- Probe FAST_16 compute modes on top of u16-Lt; keep as non-canonical lane and revert promotion (tiny request-full delta, noisy startup outlier in repeatability set).
- Probe full-depth
TRENI_DECODER_FFN_PROJ_U16_FUSEDin 3-seed runtime-only + runtime-vLLM A/B and reject after consistent regression. - Add/probe
TRENI_LINEAR_U16_FAST_COMPUTEin full-depth runtime-only 3-seed A/B; initial signal near-neutral/slight regression (superseded by later AB5 promotion rerun). - Probe full-depth linear Lt knobs (
TRENI_LINEAR_LT_WORKSPACE_MB=64,TRENI_LINEAR_USE_LT=0) and reject both after material regressions. - Replace process-wide Lt disable-on-first-fail with shape-scoped Lt fail cache and run full-depth 3-seed runtime-only + runtime-vLLM validation (near-neutral; no canonical shift).
- Refresh full-depth decode-stage profile (
TRENI_DECODE_STAGE_PROFILE+TRENI_DECODER_STEP_PROFILE) and relock hotspot ordering (stepN_layersdominant, FFNffn_projstill top layer sub-stage). - Implement full-depth FFN proj batched-two u16 GEMM path (
TRENI_DECODER_FFN_PROJ_U16_BATCHED2) and run 3-seed runtime-only + runtime-vLLM A/B matrices. - Promote FFN proj batched-two path as default-on after strict parity pass and stage-profile corroboration.
- Promote full-depth direct-out hidden path as default-on in this lane (
TRENI_DECODER_DIRECT_OUT_HIDDEN) after 3-seed runtime-only A/B and strict parity pass. - Add completion-length capture to external-cold harness (
completion_chars,completion_words, streamed usage fields) for runtime and vLLM. - Add fixed-token fairness controls to vLLM leg (
ignore_eos, streamed usage capture) and rerun 3-seed runtime-vLLM comparison with matchedcompletion_tokens=64. - Implement fused qkv split+bias path (
TRENI_DECODER_QKV_SPLIT_BIAS_FUSED) replacing copy+bias sequence, validate 3-seed runtime-only A/B, and promote default-on after strict parity pass. - Wire
TRENI_DECODER_LOGITS_U16_FAST_COMPUTEinto runtime logits projection path (*_f32_input_ex(..., use_fast_compute)) and run full-depth runtime-only 3-seed A/B.- Result (
2026-02-27): no material win and slight request-full regression (full +0.767 ms), so the knob is not promoted.
- Result (
- Run fixed-token runtime-vLLM sanity rerun after logits-fast hook integration.
- Result: matched
completion_tokens=64still shows runtime TTFT lead and vLLM request-full lead in this profile.
- Result: matched
- Run strict Week 3 parity after logits-fast hook integration.
- Result: pass (
checked=3,failed=0, strict).
- Result: pass (
- Implement u16 tensor-cache path (
copy_tensor_to_gpu_u16lookup/store) and add explicit env gateTRENI_TENSOR_CACHE_U16(default-on) for claim-safe A/B. - Route logits-u16 upload path through shared cached helper (
copy_tensor_to_gpu_u16) instead of uncached manual copy. - Run full-depth runtime-only 3-seed A/B for
TRENI_TENSOR_CACHE_U16=0/1.- Result: large request-path win (
full -472.529 ms,infer -471.235 ms) with near-flat TTFT.
- Result: large request-path win (
- Run same-window runtime-vLLM A/B for
TRENI_TENSOR_CACHE_U16=0/1.- Result: request-full ordering flipped from runtime slower (
+338.124 ms) to runtime faster (-98.671 ms).
- Result: request-full ordering flipped from runtime slower (
- Re-run strict Week 3 parity on final u16-cache default-on build.
- Result: pass (
checked=3,failed=0, strict).
- Result: pass (
- Add optional
TRENI_LINEAR_BATCHED2_USE_LTlane for FFN batched2 GEMMs and run full-depth A/B (ab3runtime-only +ab2runtime-vLLM).- Result (
2026-02-27T222830Z): regressed runtime request path (full +12.469 ms,infer +12.534 ms); not promoted.
- Result (
- Run higher-N full-depth repeatability on
TRENI_DECODER_FFN_PROJ_U16_BATCHED2_F32_INPUT=1+TRENI_DECODER_FFN_PROJ_U16_FAST_COMPUTE=1(ab8runtime-only).- Result (
2026-02-27T223241Z): near-noise uplift (full -0.198 ms,infer -0.101 ms); not promoted.
- Result (
- Extend FFN fused path bias-deferral logic (fold gate/up bias into fused SiLU*Up activation when
TRENI_DECODER_FFN_PROJ_U16_FUSED=1) and rerun full-depth A/B (ab3runtime-only +ab2runtime-vLLM).- Result (
2026-02-27T223458Z): runtime-only effect is negligible (full -0.383 ms), no canonical shift; not promoted.
- Result (
- Run fast-profile (
--layers 2) higher-N A/B for logits fast-compute (TRENI_DECODER_LOGITS_U16_FAST_COMPUTE=0/1, runtime-onlyab8).- Result (
2026-02-28T005529Z): near-noise movement (full -0.299 ms), stage profile unchanged; not promoted.
- Result (
- Run mixed-load p99 repeatability on canonical lane (
run_mode=mixed_load,http_runs=120, 3 runs).- Result (
2026-02-28T005626Z): stable set (mean 122.247 ms,p95 198.518 ms,p99 199.608 ms), no canonical config change.
- Result (
- Re-run strict Week 3 parity after latest follow-up patches.
- Result (
2026-02-28T005805Z): pass (checked=3,failed=0, strict).
- Result (
- Fix
phase2_runtime_benchmark.pytiming parser decimal handling (TIMING_RX) so stage telemetry preserves sub-ms values.- Result (
2026-02-28):decoder_step_profile_*fields now parse as true decimals (for exampleffn_proj_mean ~0.366 ms/layer) instead of integer-truncated values.
- Result (
- Rerun full-depth profile probes after parser fix (
cold_first_hit+warm_steady_state, qwen,layers=36).- Result (
2026-02-28T011037Z): hotspot remains FFN-heavy (decoder_step_profile_ffn_proj_mean ~0.366 ms/layer,ffn_down_resid_mean ~0.190 ms/layer,step_total_mean ~0.705 ms/layer).
- Result (
- Run full-depth warm AB3 for
TRENI_DECODER_FFN_PROJ_U16_FAST_COMPUTE=0/1on the fixed profile.- Result (
ffn_fast_compute_ab3_20260228T011146Z_summary): slight regression (request +0.317 ms,infer +0.305 ms), no stage win; not promoted.
- Result (
- Replace batched2 Lt fallback with one-call strided-batched Lt path and rerun AB3 (
TRENI_LINEAR_BATCHED2_USE_LT=0/1).- Result (
batched2lt_strided_ab3_20260228T011651Z_summary): near-noise in warm AB3 (request -0.190 ms,infer -0.194 ms, stage flat) and slight regression in runtime-only external-cold probe (full +0.579 ms); not promoted.
- Result (
- Implement FFN gate/up dual-bias fused add path (
TRENI_DECODER_FFN_BIAS_PAIR_FUSED) and run full-depth warm/cold A/B.- Result (
ffn_bias_pair_ab3_20260228T020257Z/summary.json, warm AB3): small warm gain (request -0.229 ms,p99 -0.390 ms,infer -0.090 ms) with near-flat TTFT (+0.009 ms). - Cold follow-up (
ffn_bias_pair_cold_ab2_20260228T020723Z/summary.json, 3 seeds each after extension): slight cold regression (full +1.928 ms,infer +1.875 ms), so this remains non-canonical for now.
- Result (
- Add optional batched2
seq1split-GEMM path (TRENI_LINEAR_BATCHED2_SPLIT_SEQ1) and run full-depth warm/cold AB3.- Warm AB3 (
batched2_splitseq1_ab3_20260228T025841Z/summary.json): near-noise/slight regression (request +0.014 ms,p99 +0.124 ms,infer +0.105 ms). - Cold AB3 (
batched2_splitseq1_cold_ab3_20260228T025841Z/summary.json): small gain (full -2.070 ms,infer -2.002 ms,ttft -0.021 ms). - Decision: keep opt-in and non-canonical (no warm-path win).
- Warm AB3 (
- Add optional batched2 dup-input strided lane (
TRENI_LINEAR_BATCHED2_DUP_INPUT) and run full-depth warm/cold AB3.- Warm AB3 (
batched2_dupinput_ab3_20260228T031816Z/summary.json): slight mean regression (request +0.317 ms,infer +0.293 ms,ttft +0.009 ms) despite minor p99 drop (-0.208 ms). - Cold AB3 (
batched2_dupinput_cold_ab3_20260228T031816Z/summary.json): regression (full +1.307 ms,infer +1.388 ms,ttft +0.010 ms). - Decision: keep opt-in and non-canonical.
- Warm AB3 (
- Probe dup-input v2 implementation (replace two D2D memcpys with one duplicate kernel) as warm AB2 gate and revert if not better.
- Gate AB2 (
batched2_dupinput_v2warm_ab2_20260228T032741Z/summary_gate_ab2.json): regression (request +0.438 ms,infer +0.381 ms,ttft +0.015 ms,p99 +0.217 ms). - Decision: rejected and reverted before AB3/cold expansion.
- Gate AB2 (
- Recheck prior FFN projection alternatives on current baseline with warm AB2 gates.
TRENI_DECODER_FFN_PROJ_U16_FUSED=0/1(ffn_proj_u16_fused_gate_ab2_20260228T033524Z/summary_gate_ab2.json): near-flat/slight mean regression (request +0.149 ms,infer +0.173 ms), not expanded.TRENI_DECODER_FFN_PROJ_U16_BATCHED2_F32_INPUT=0/1(ffn_proj_batched2_f32input_gate_ab2_20260228T033758Z/summary_gate_ab2.json): regression (request +0.236 ms,infer +0.248 ms,p99 +0.512 ms), not expanded.
- Probe optional linear u16
CUBLAS_COMPUTE_16Flane as warm AB2 gate and revert if non-winning.- Gate AB2 (
linear_u16_compute16f_gate_ab2_20260228T034412Z/summary_gate_ab2.json): regression (request +0.210 ms,infer +0.240 ms,p99 +0.594 ms). - Decision: rejected and reverted; no AB3 expansion.
- Gate AB2 (
- Rebaseline full-depth warm profile on explicit u16 lane (
qwen,layers=36) before next FFN probes.- Result (
warm_profile_qwen_layers36_refresh_20260228T040010Z+ run logs): active hotspot remains FFN projection (ffn_proj ~0.196 msofstep_total ~0.402 ms) under batched2.
- Result (
- Implement optional FFN gate/up contiguous pair-pack path (
TRENI_DECODER_FFN_PAIR_PACK_U16) and run warm AB3 gate.- AB3 artifact (
ffn_pair_pack_gate_ab2_20260228T040616Z/summary_ab3.json): small warm uplift (request -0.423 ms), but both off/on runs already showed contiguous pair active; non-causal for promotion. - Decision: keep implementation as experimental default-off (
TRENI_DECODER_FFN_PAIR_PACK_U16=0), non-canonical.
- AB3 artifact (
- Rerun batched2 Lt on explicit u16 lane (
TRENI_LINEAR_BATCHED2_USE_LT) with warm AB3 + cold AB3.- Warm AB3 (
batched2_use_lt_u16lane_gate_ab2_20260228T041041Z/summary_ab3.json): small gain (request -0.313 ms,infer -0.468 ms,p99 -0.511 ms). - Cold AB3 (
batched2_use_lt_u16lane_cold_ab2_20260228T041359Z/summary_ab3.json): regression (full +1.165 ms,infer +1.424 ms). - Decision: keep opt-in/non-canonical (warm-only win not enough).
- Warm AB3 (
- Add adaptive delayed-on policy for batched2 Lt (
TRENI_LINEAR_BATCHED2_USE_LT_ENABLE_AFTER_MS) and rerun full-depth warm/cold AB3.5000msAB3 (batched2_lt_enable_after_ms5000_warm_ab3_20260228T104525Z,batched2_lt_enable_after_ms5000_cold_ab3_20260228T104712Z): warm gain, but small cold full regression (+0.422 ms) remained.10000msAB3 (batched2_lt_enable_after_ms10000_warm_ab3_20260228T105028Z,batched2_lt_enable_after_ms10000_cold_ab3_20260228T105213Z): warm and cold both improved:- warm: request
-0.363 ms, infer-0.326 ms, p99-0.696 ms. - cold: startup
-4.307 ms, full-0.635 ms, infer-0.347 ms, TTFT-0.070 ms.
- warm: request
- Strict parity passed (
week3_parity_report_batched2_lt_delay10000_20260228T105329Z.json,checked=3,failed=0). - Default-path strict parity smoke also passed (
week3_parity_report_batched2_lt_defaultdelay_20260228T110825Z.json). - Same-window mixed-load A/B (
mixed_load_defaultdelay_onoff_ab3_20260228T115010Z.json) regressed with delayed-on (mean +0.846 ms,p95 +1.627 ms,p99 +0.679 ms). - Decision: keep lane opt-in/non-canonical; parser defaults remain off (
TRENI_LINEAR_BATCHED2_USE_LT=0,TRENI_LINEAR_BATCHED2_USE_LT_ENABLE_AFTER_MS=0). - Post-revert strict parity passed on defaults (
week3_parity_report_postrevert_defaults_20260228T115543Z.json).
- Rerun Track A parser-default foundation pack (warm/cold/mixed) after delayed-Lt probe and publish canonical decision artifact.
- Pack summary:
foundation_defaultdelay_pack_20260228T114315Z.json(.mdcompanion). - Warm AB3 means: request
147.258 ms, p99247.617 ms, infer128.450 ms, TTFT16.999 ms. - Cold AB3 means: startup
425.532 ms, full598.787 ms, infer580.173 ms, TTFT12.210 ms. - Mixed repeatability vs prior canonical:
mean +2.841 ms,p95 +5.587 ms,p99 +5.140 ms(mixed_load_repeatability_compare_defaultdelay_vs_prev_20260228T114748Z.json). - Decision unchanged: delayed batched2 Lt stays opt-in/non-canonical.
- Pack summary:
- Add experimental FFN batched2 Lt prewarm path (
TRENI_DECODER_FFN_BATCHED2_LT_PREWARM) and run fixed-Lt warm/cold A/B.- Warm AB2 (
batched2_lt_prewarm_warm_ab2_20260228T042453Z/summary_gate_ab2.json): small gain (request -0.328 ms,infer -0.394 ms). - Cold AB3 (
batched2_lt_prewarm_cold_ab3_20260228T042649Z/summary_ab3.json): first-hit gain (full -1.497 ms,infer -1.406 ms).
- Warm AB2 (
- Run direct same-window combo A/B (
lt=0,prewarm=0vslt=1,prewarm=1) to test promotability.- Combined summary (
batched2_lt_prewarm_combo_summary_20260228T042733Z.json): mixed outcome. - Warm AB3 (
batched2_lt_prewarm_combo_warm_ab2_20260228T042733Z/summary_ab3.json): regression (request +0.198 ms,infer +0.178 ms,p99 +0.407 ms). - Cold AB3 (
batched2_lt_prewarm_combo_cold_ab3_20260228T042733Z): improvement (full -1.099 ms,infer -0.819 ms). - Decision: keep prewarm path experimental default-off; non-canonical.
- Combined summary (
- Probe and promote
TRENI_DECODER_FFN_DOWN_U16_FAST_COMPUTEon canonical full-depth lane.- Warm AB3 (
ffn_down_fast_compute_gate_ab3_20260228T044546Z/summary_ab3.json): request-0.565 ms, infer-0.566 ms, p99-1.405 ms, TTFT-0.030 ms. - Cold AB3 (
ffn_down_fast_compute_cold_ab3_20260228T044753Z/summary_ab3.json): startup-8.405 ms, full-0.351 ms, infer-0.406 ms, TTFT-0.028 ms. - Strict parity (
week3_parity_report_ffn_down_fast_20260228T044846Z.json): pass (checked=3,failed=0, strict). - Decision: promote default-on in runtime parser (
TRENI_DECODER_FFN_DOWN_U16_FAST_COMPUTE=1).
- Warm AB3 (
- Run post-promotion retest matrix on updated canonical lane (AB3/AB5) for remaining FFN toggles.
- New structural stacked-GEMM lane
TRENI_LINEAR_BATCHED2_STACKED_SEQ1AB3: regressed warm (request +1.259 ms,infer +1.229 ms,p99 +2.830 ms) with near-flat cold full (+0.030 ms), remains experimental/default-off. TRENI_LINEAR_BATCHED2_SPLIT_SEQ1retest AB3: regressed warm (request +0.964 ms) and cold (full +1.496 ms), remains non-canonical.TRENI_LINEAR_BATCHED2_USE_LTfixed-on retest AB3: warm gain (request -0.855 ms) but cold startup/full penalty (startup +10.474 ms,full +0.330 ms); delayed-on follow-up still regressed mixed-load and remains non-canonical.lt=1 + prewarm=1combo retest: AB3 looked positive, but AB5 confirm failed on cold (full +1.152 ms,startup +3.199 ms), remains non-canonical.TRENI_DECODER_FFN_PROJ_U16_FAST_COMPUTEretest AB3: near-noise warm gain with cold regression (full +0.577 ms), non-canonical.TRENI_LINEAR_U16_FAST_COMPUTEinitial AB3 signal was mixed and stayed pending.
- New structural stacked-GEMM lane
- Revalidate
TRENI_LINEAR_U16_FAST_COMPUTEwith higher-N repeats and promote if stable.- warm+mixed AB5 (
linearfast_ab5_20260228T124736Z/summary_ab5.json):on-offstayed positive in both modes (warm request-0.139 ms, mixed request-0.139 ms). - cold AB3 (
linearfast_cold_ab3_20260228T124510Z/summary_ab3.json): near-flat full (+0.302 ms), better startup (-4.207 ms) and TTFT (-0.019 ms). - strict parity pass (
week3_parity_report_linearfast_20260228T124557Z.json,checked=3,failed=0). - post-default strict parity smoke also passed (
week3_parity_report_post_linearfast_default_20260228T125804Z.json). - Decision: promote runtime parser default
TRENI_LINEAR_U16_FAST_COMPUTE=1.
- warm+mixed AB5 (
- [~] Optimize custom-kernel best path with explicit profile split:
- fast profile (
--layers 2):decoder_stepN_logits_projfirst. - full depth (
--layers 36,--pool-mb 16384): preserve and extend post-cache lead (runtime full ~1190 msin latest same-window A/B) with deeper layer-compute work (still FFN-heavy) and higher-N repeatability. - continue mixed-load p99 and cold upload/convert/H2D in parallel.
- fast profile (
- Rerun canonical foundation pack (
warm/cold/mixedAB3) afterTRENI_LINEAR_U16_FAST_COMPUTEpromotion and compare vs prior parser-default pack.- pack root:
foundation_linearfastdefault_pack_20260228T134157Z(summary_ab3.json). - warm/cold: near-flat/slightly slower vs prior parser-default foundation (
warm request +0.101 ms,cold full +0.491 ms). - mixed: improved (
request -0.629 ms,p95 -1.281 ms,p99 -0.163 ms).
- pack root:
- Rerun same-window runtime-vLLM full-depth AB3 on updated canonical lane.
- run set:
aws_speedpass_runtime_vllm_linearfastdefault_ab3_20260228T134630Z(summary_ab3.json). - averaged first-request full: runtime
1185.186 ms, vLLM1305.971 ms(vLLM/runtime=1.102x).
- run set:
- Run higher-N same-window runtime-vLLM full-depth rerun on updated defaults (
AB5) and publish aggregate summary.- run set:
aws_speedpass_runtime_vllm_newdefaults_ab5_20260228T145502Z. - summary:
aws_speedpass_runtime_vllm_newdefaults_ab5_20260228T145502Z/summary_ab5.jsonand.md. - AB5 means:
- runtime full
1184.812 ms, runtime TTFT14.640 ms, runtime cold-total full4190.848 ms. - vLLM full
1318.675 ms, vLLM TTFT50.309 ms, vLLM cold-total full24350.818 ms. - ratios (
vLLM/runtime): full1.113x, TTFT3.436x, cold-total full5.810x.
- runtime full
- compare vs prior AB3:
compare_vs_prev_linearfastdefault_ab3.json/.mdconfirms runtime full stayed slightly better (-0.375 ms) while keeping the same direction on full and cold-total ratios.
- run set:
- Test and decide batched2-Lt default-path fast-fallback short-circuit experiment.
- isolation AB3 (
fastfallback_isolation_ab3_20260228T140122Z/summary_ab3.json) showed warm regression (request +1.155 ms) and mixed near-flat/slightly worse (mean +0.144 ms), despite cold full improvement (-0.846 ms). - Decision: reverted; keep prior canonical path.
- post-revert strict parity passed (
week3_parity_report_post_fastfallback_revert_20260228T140626Z.json).
- isolation AB3 (
- Re-evaluate
TRENI_TENSOR_H2D_CHUNK_MBon current canonical full-depth profile and promote if still positive.- cold AB3 (
h2d_chunk_cold_ab3_20260228T142114Z/summary_ab3.json):chunk0 - chunk64improved startup/full/infer and reduceddecoder_tensor_h2d/decoder_tensor_upload. - warm+mixed AB3 (
h2d_chunk_warm_mixed_ab3_20260228T142258Z/summary_ab3.json): warm improved and mixed was near-neutral/slightly better. - strict parity after promotion passed (
week3_parity_report_h2dchunk0_default_20260228T142805Z.json). - Decision: promote parser default to
TRENI_TENSOR_H2D_CHUNK_MB=0(no chunking).
- cold AB3 (
- Run full-depth post-AB5 gate sweep on current defaults and re-check delayed-Lt/FFN-proj-fast promotability.
- gate AB2 set (
fulldepth_gate_newdefaults_20260228T150709Z/summary_gate_ab2.json):- delayed-Lt: warm/mixed request deltas
-0.384 / -0.256 ms(promoted to AB3 confirmation). TRENI_DECODER_FFN_PROJ_U16_FAST_COMPUTE=1: mixed/noise signal at that gate stage (warm p99 +0.129 ms,mixed p99 +0.022 ms), not promoted in that cycle.
- delayed-Lt: warm/mixed request deltas
- delayed-Lt AB3 confirmation (
fulldepth_delayedlt_ab3_20260228T151322Z/summary_ab3.json):- warm
on-off: request-0.330 ms, infer-0.270 ms, p99-0.098 ms; - mixed
on-off: request+0.173 ms, infer+0.191 ms, p99+0.291 ms.
- warm
- Decision: keep delayed-Lt non-canonical on parser defaults.
- gate AB2 set (
- Run tuned delayed-Lt slow-gate rescue probe and verify mixed-load tail behavior.
- AB2 artifact (
delayedlt_tunedslow_ab2_20260228T152358Z/summary_gate_ab2.json). - tuned
onconfig:TRENI_LINEAR_BATCHED2_LT_SLOW_RATIO_PCT=0,TRENI_LINEAR_BATCHED2_LT_SLOW_STREAK_DISABLE=4(+ delayed-Lt envs). - deltas (
on-off):- warm: request
-0.185 ms, infer-0.054 ms, p99-0.417 ms; - mixed: request
-0.004 ms, infer-0.032 ms, p99+0.221 ms.
- warm: request
- Decision: still non-promotable (mixed near-flat with p99 regression); keep delayed-Lt non-canonical.
- AB2 artifact (
- Patch FFN-proj batched2 mixed-input fallback loop and re-gate
TRENI_DECODER_FFN_PROJ_U16_BATCHED2_F32_INPUT.- code patch:
monolith/models/linear.cucaches unsupported mixed-input batched2 GEMM combos and short-circuits repeated failed calls. - forced-Lt diagnostic before/after:
- pre-patch:
ffnproj_f32input_ltalways_20260228T153113Z.json - post-patch:
ffnproj_f32input_ltalways_patch_20260228T154942Z.json - request mean improved
175.208 -> 173.124 ms;linear_batched2_lt_failuresdropped26112 -> 1.
- pre-patch:
- canonical AB2 re-gate (
ffnproj_f32input_gate_patch_ab2_20260228T155033Z/summary_gate_ab2.json):- warm
on-off: request+0.026 ms, p99+0.099 ms; - mixed
on-off: request+0.057 ms, p99+0.446 ms.
- warm
- Decision: keep
TRENI_DECODER_FFN_PROJ_U16_BATCHED2_F32_INPUT=0canonical; patch retained for robustness.
- code patch:
- Re-run
TRENI_DECODER_FFN_PROJ_U16_FAST_COMPUTEon clean full-depth path and validate promotability.- profiled AB3 (
ffnprojfast_fullstep_ab3_20260228T160255Z/summary_ab3.json):on-offrequest-0.370 ms, infer-0.348 ms, p99-0.533 ms. - non-profiled warm AB3 (
ffnprojfast_fullwarm_ab3_20260228T160358Z/summary_ab3.json):on-offrequest-0.249 ms, infer-0.225 ms, p99-0.328 ms. - strict parity passed with explicit candidate env and temporary promoted build (
week3_parity_report_ffnprojfast_candidate_20260228T160459Z.json,week3_parity_report_ffnprojfast_default_20260228T160639Z.json). - post-promotion sanity AB3 (
ffnprojfast_default_sanity_ab3_20260228T160557Z/summary_ab3.json) stayed near-flat and directionally positive on means (default-force_offrequest-0.094 ms). - Interim result: qwen-focused path looked positive; required full foundation gate before canonical decision.
- profiled AB3 (
- Run full foundation same-window gate (
defaultvsforce_off) forTRENI_DECODER_FFN_PROJ_U16_FAST_COMPUTEand finalize canonical default.- foundation pack (
foundation_ffnprojfastdefault_pack_20260228T194204Z/summary_ab3.json) was slower vs prior canonical across warm/cold/mixed. - same-window gate AB2 (
foundation_ffnprojfast_gate_ab2_20260228T195240Z/summary_gate_ab2.json) showed warm/cold mean regressions with default (warm request +0.489 ms,cold full +0.746 ms), mixed near-flat with better tails. - Decision: keep parser canonical default
TRENI_DECODER_FFN_PROJ_U16_FAST_COMPUTE=0(opt-in only).
- foundation pack (
Track B: Internal vs External Routing
- Minimal external baseline harness.
- Matched task set and budgets.
- Internal vs external run and report (G5).
- Add explicit failure-amplification tests (timeouts/retries under load).
- Publish first multi-profile stress matrix on G5 (baseline + 5 stress profiles).
- Add cross-host benchmark harness + standalone external router server.
- Expand cross-host pilot into full split-host matrix (6 profiles).
- Optional internet multi-hop expansion (Fly.io controller/tool hops) with commercial endpoints (OpenAI/OpenRouter).
- Publish grouped commercial root-cause analysis across fairness artifacts (
commercial_gap_root_cause_20260222T222958Z).
Track B2: External Cold-Start Proof (Runtime vs PyTorch/vLLM/Ollama)
- Implement unified cold-start harness with matched prompt/output budget.
- Run G5 canonical set for all four backends.
- Publish report with startup/TTFT/full-latency plus caveat tags (BF16 vs quantized).
- Add canonical artifact links and leaderboard row for external-cold comparison.
- Run 3x all-backend repeatability after GPU-convert fix and publish summary.
- Rerun external-cold on G5 after default-on seq1 multi-head path and publish 3-run repeatability summary (
external_cold_seq1mh_default_repeatability_20260224T192020Z). - Apply first
decoder_step0_layerskernel optimization pass (seq1 multi-head softmax/PV exp-reuse) and rerun 3-run external-cold repeatability (external_cold_step0expfix_repeatability_20260224T194226Z). - Validate second
decoder_step0_layersfollow-up (seq1 multi-head shared-prob cache), compare against exp-reuse patch, and revert because it underperformed (external_cold_step0shared_repeatability_20260224T194913Z).
Track C: Agentic Loop Capability
- Freeze 3 loop scenarios and success criteria.
- Implement evaluators (success rate + steps-to-convergence).
- Run internal vs external loop benchmark (canonical G5 set complete: baseline + stress, 3 seeds each).
- Publish trace-backed capability report.
- Add file-backed
realistic_v1profile to reduce synthetic stub bias in loop scenarios. - Run realistic-v1 multi-seed loop pack (baseline + stress) and publish summary (
phase3_realistic_v1_summary_20260222T143919Z).
Track C2: Uncertainty-Awareness Ablation
- Add uncertainty metric source modes (
normalized_logprob,raw_logit_margin,hybrid,runtime_native) to Phase 3 harness. - Add independent uncertainty toggles on internal and external paths.
- Add matrix runner for uncertainty on/off ablation arms.
- Run first baseline ablation set (
runs=8, all three uncertainty sources). - Wire runtime uncertainty export/ingestion path (
runtimeHTTPuncertainty-> C2runtime_nativesource). - Run 3-seed repeatability baseline ablation set.
- Run stress-profile uncertainty ablation set.
- Publish consolidated baseline-vs-stress ablation summary.
- Run runtime-native quality-gated rerun with unified awareness payload (
awareness3, zero fallback/errors confirmed). - Retune runtime-native uncertainty policy (thresholds/mapping/decision gating) and recover positive deltas (
calib1). - Rerun runtime-native C2 baseline+stress after policy retune and relock canonical interpretation.
- Unify runtime response awareness payload (
awareness.route+awareness.generation) and keepuncertaintybackward compatible for existing clients. - Run realistic-v1 uncertainty ablation baseline+stress pair and publish comparison (
phase3_uncertainty_compare_realistic_v1_s7_20260222T144116Z).
Track C3: Real-Benchmark Awareness A/B/C
- Implement Phase 5 real benchmark harness (
scripts/phase5_awareness_realbench.py) with three arms:arm_a_control(single-pass)arm_b_awareness_retry(uncertainty-gated second pass)arm_c_awareness_consistency(uncertainty-gated consistency voting / IFEval self-check)
- Add run wrapper (
scripts/run_phase5_awareness_realbench.sh) and benchmark README (benchmarks/phase5_awareness_realbench/README.md). - Run first canonical real-data set on current GPU host (
gpqa_diamond,ifeval,gsm8k,aime25) and publish artifact pack (r5:phase5_awareness_realbench_qwen-realbench-r5-tokenizerfix2_20260301T114510Z.json). - Publish first diagnostic A/B/C deltas and failure-mode traces (
r5+ template A/Br6) with fixed token budgets. - Add HF-reference parity evaluation on the same sampled prompts and lock claim-safe interpretation (
phase5_hf_reference_qwen_r5_20260301T1900Z.json). - Improve Phase 5 math-task quality floor (prompt/eval contract and model/task fit), then rerun runtime-vs-HF parity.
Expansion
- Launch Lambda A100/H100 hosts and complete Phase 3 canonical loop reruns (baseline + stress, 3 seeds each).
- Stage
monolith_phase3.binon Lambda A100/H100 and run Phase 4 hardware-pack script (phase2 + c2) end-to-end. - Full A100 run set.
- Full H100 run set.
- Paper-grade figure/table package.
Immediate Next Actions
- Recover score on the latest Qwen3.5 fast-sampler lane without giving back the strict latency win.
- Continue isolating the remaining
ifevalfidelity gap on the one-host strict Qwen3.5 set:- exact instruction-following misses
- small decode/logit drift vs vLLM on some prompts
- stack-level repair loop policy that improves quality without erasing the latency lead
- Make the same-VM Hermes MVP explicit around
scripts/hermes_same_vm_mvp.pyandscripts/run_samevm_qwen35_stack.sh, then extend the demo flow for SQLite/RAG/ORPO. - Improve custom mixed-load p99 (decode-shape specialization + cache/write-path tuning) and publish repeatability.
- Reduce custom cold-first-hit upload/convert/H2D cost and rerun external-cold token-parity pack.