Treni

TODO

Live execution checklist and next actions.

Priority Order

Current Checklist

Track A: Cold/Hot Foundations

  • True TTFT instrumentation in runtime request path.
  • 3x cold-first-hit repeatability set (G5).
  • 3x warm steady-state repeatability set (G5).
  • Cold bottleneck fix: per-model tensor lookup index cache.
  • Cold rerun after fix with artifact pack.
  • Add stage-level cold decomposition metrics (tokenizer load, index build, tensor upload, first decode step).
  • Optimize model_tensor_index_build via fast tensor collect path and rerun 3x cold validation.
  • Rerun 3x cold validation after reverting regressed upload path (clean7) and confirm clean4 parity.
  • Add sub-stage upload instrumentation (decoder_tensor_convert, decoder_tensor_h2d, decoder_tensor_copy_total).
  • Add startup preload + tokenizer cache path to cut first-request upload/tokenizer overhead.
  • Wire request max_tokens through runtime HTTP path for token-parity comparisons.
  • Disable decoder per-step trace by default (TRENI_DEMO_TRACE opt-in).
  • Reduce remaining Qwen request-path TTFT/full gap vs vLLM (decoder per-token path).

Track B: Internal vs External Routing

  • Minimal external baseline harness.
  • Matched task set and budgets.
  • Internal vs external run and report (G5).
  • Add explicit failure-amplification tests (timeouts/retries under load).

Track B2: External Cold-Start Proof (Runtime vs PyTorch/vLLM/Ollama)

  • Implement unified cold-start harness with matched prompt/output budget.
  • Run G5 canonical set for all four backends.
  • Publish report with startup/TTFT/full-latency plus caveat tags (BF16 vs quantized).
  • Add canonical artifact links and leaderboard row for external-cold comparison.

Track C: Agentic Loop Capability

  • Freeze 3 loop scenarios and success criteria.
  • Implement evaluators (success rate + steps-to-convergence).
  • Run internal vs external loop benchmark.
  • Publish trace-backed capability report.

Expansion

  • Full A100 run set.
  • Full H100 run set.
  • Paper-grade figure/table package.

Immediate Next Actions

  1. Run 3x repeatability for token-parity external cold benchmark (max_tokens=48) on G5.
  2. Optimize Qwen decoder per-token step path to close request-path gap vs vLLM.
  3. Add timeout/retry failure-amplification tests to internal-vs-external routing.

On this page