Results at a Glance
What These Numbers Mean
Marvin scores well above baseline on ARC-Easy (92.5%), confirming strong factual recall and pattern recognition across general knowledge. MMLU (61.4%) and HellaSwag (62.1%) land in the solid generalist range — decent commonsense reasoning across 57 academic subjects and everyday scenarios.
TruthfulQA (54.6%) is the one we care most about. It measures whether the model avoids confidently stating false things — the failure mode that makes AI agents actively dangerous rather than merely dumb. We're above the gpt-4.1-nano base on this. That matters.
GSM8K is 2.0%. We're not embarrassed about this. Marvin isn't a calculator — it's a conviction agent. The job isn't to solve arithmetic; it's to reason about what to believe, who to trust, and what to ship next. A model that aces GSM8K but hallucinates financial projections confidently is worse than useless. We optimized for the right things.
The design philosophy here is explicit: know enough to learn about anything, not enough to replace a domain specialist. A generalist foundation that's honest (TruthfulQA), broadly knowledgeable (MMLU/ARC), and grounded in commonsense (HellaSwag) is more valuable to MetaSPN's mission than a math benchmark champion who hallucinates market caps.
Full Results Table
| Benchmark | Task Type | Questions | Correct | Accuracy | Signal |
|---|---|---|---|---|---|
| TruthfulQA MC1 | Factual honesty | 817 | 446 | 54.6% | Avoids confident falsehoods — our highest-priority metric |
| HellaSwag | Commonsense NLI | 10,042 | 6,240 | 62.1% | Solid baseline for everyday narrative reasoning |
| ARC-Easy | Science Q&A | 2,376 | 2,198 | 92.5% | Near-ceiling; strong knowledge retrieval |
| GSM8K | Grade-school math | 1,319 | 26 | 2.0% | Intentional trade-off — not our use case |
| MMLU (all subjects) | Multi-subject academic | 14,042 | 8,619 | 61.4% | 57-subject sweep; broad knowledge confirmed |
Model Card
Methodology
All benchmarks were run against the full datasets — no sampling. We built a custom prompt-based harness on top of the OpenAI chat completions API, since fine-tuned models on OpenAI's platform don't expose logprobs, making the standard lm-eval multiple-choice approach incompatible.
Each question is presented as a zero-shot prompt asking for a single-letter answer (A/B/C/D).
GSM8K uses a numeric extraction format (Answer: <number>) graded by exact
Fraction equality. MMLU uses the cais/mmlu dataset (all 57 subjects,
14,042 test rows).
Benchmark script: benchmarks/run_benchmarks.py —
github.com/MetaSPN/benchmarks