Treni

Track B Claim-Safe Table

Paper-ready, scoped Track B commercial claims with explicit model-only vs tool-only splits.

Purpose

This page is the claim-safe summary for Track B commercial comparisons.

Interpretation rule:

  • external/internal > 1 means internal routing is faster.

Canonical Inputs

Mixed Task Set (Local Control, runs=8/profile)

Provider/ModelExt/Int RatioInternal ErrorExternal Error
OpenAI gpt-5.20.987x0.00000.0313
OpenRouter anthropic/claude-sonnet-4.61.066x0.00000.0313

Task-Family Split (Local Control, runs=8)

Provider/ModelModel-Only Ext/IntTool-Only Ext/Int
OpenAI gpt-5.20.958x1.136x
OpenRouter anthropic/claude-sonnet-4.61.044x1.051x

Task-Family Split (Fairness-Hardened, Local Control, runs=8)

Harness controls:

  • execution_mode=interleaved
  • pair_order=alternate
  • deterministic generation defaults (temperature=0)
  • strict tool parity enabled on tool_only
Provider/ModelModel-Only Ext/IntTool-Only Ext/IntModel-Only Int ms/tokenModel-Only Ext ms/tokenTool-Only Int ms/tokenTool-Only Ext ms/token
OpenAI gpt-5.20.971x1.038x57.65757.66337.55338.990
OpenRouter anthropic/claude-sonnet-4.61.102x1.063x61.60670.05441.21243.791

Claim-Safe Reading

  1. After fairness hardening, tool_only favors internal on both providers.
  2. model_only remains provider-sensitive: OpenAI is still near parity/slight inversion, while Sonnet favors internal.
  3. Commercial Track B claims must remain stratified by task family and paired with token-normalized metrics.

On this page