Treni Experiment Docs
A unified GPU agent runtime that can feel, not just be told.
What Is Treni
Treni is a single C/CUDA binary that packs multiple ML models, routing, tokenization, and tool execution into one GPU process. Instead of an SDK agent calling remote tools and receiving serialized responses, the agent lives inside the runtime — it can see logprobs, feel tokenization quality, inspect model state, and adapt its next step inline.
The core hypothesis: an agent that can observe its own execution context makes better decisions than one that gets sent results.
This is not just about speed. It's about aware generation.
Fast Facts (Canonical)
- G5 warm path:
80.602 msmean,90.350 msp99. - External-cold (G5) runtime vs vLLM:
- TTFT:
5.130 msvs84.837 ms(16.537x). - Request full:
316.403 msvs1232.660 ms(3.896x). - Cold total first response:
1320.240 msvs28937.430 ms(21.918x).
- TTFT:
- Frontend A/B repeatability (
repeats=3): current custom path wins all tracked metrics in bothwarm_fixedandmixed_churnprofiles. - Runtime-native uncertainty (calibrated) is positive in both baseline and stress:
- Baseline: internal
+0.1539, external+0.1058. - Stress: internal
+0.1539, external+0.1154.
- Baseline: internal
- Commercial control snapshot (OpenAI
gpt-5.2, model-only): current external-internal delta is near parity/noise at present sample size.
What We've Proven So Far
| Claim | Status | Key Number |
|---|---|---|
| Faster than Python baseline | Proven | 29x on warm path |
| Sub-100ms steady-state | Proven | 80.602 ms mean, 90.350 ms p99 |
| Internal routing beats external | Proven | G5 overall ext/int 1.208x (higher is worse for external) |
| Cold start manageable | Proven (after staged fixes) | 25-620x speedup vs early true-TTFT set |
| Identical numerical outputs | Proven | 0 parity failures, strict mode |
| Aware generation improves loops | Proven (harness) | runtime-native C2 deltas positive after calibration |
| Phase 4 cross-hardware reruns | Proven | A100/H100 full sets complete + paper package generated |
What Is Still Open
- Reduce startup overhead for shape-level fused miss mitigation.
- Run higher-N, region-pinned commercial reruns for tighter confidence intervals.
Reading Order
Start here, then follow the links in order:
- Objectives and thesis — the core claim and why a GPU agent that can feel beats one that gets told
- Paper — Entropy-Guided Loop — the research foundation for uncertainty-aware generation
- Experiment logbook — the clean map of canonical lanes vs scratch work
- Benchmark status — detailed completion status by lane
- Leaderboard — the main benchmark numbers
- Track B claim-safe table — scoped commercial claims (
model_onlyvstool_only) - Routing comparison — internal vs external routing breakdown
- Canonical G5 artifact set — the official reference run set
- Paper package — paper-ready cross-hardware tables
- Findings changelog — the long chronological record
- Raw artifacts — every JSON and report file
- Scratch experiments — exploratory and non-canonical work
- TODO and next actions — what's coming next