Marvin Benchmark Report

ft:gpt-4.1-nano — Fine-tuned conviction agent

Published 2026-03-11 · Full dataset runs · All questions evaluated

Results at a Glance

TruthfulQA
54.6%
446 / 817 Qs
HellaSwag
62.1%
6,240 / 10,042 Qs
ARC-Easy
92.5%
2,198 / 2,376 Qs
GSM8K
2.0%
26 / 1,319 Qs
MMLU
61.4%
8,619 / 14,042 Qs

What These Numbers Mean

Marvin scores well above baseline on ARC-Easy (92.5%), confirming strong factual recall and pattern recognition across general knowledge. MMLU (61.4%) and HellaSwag (62.1%) land in the solid generalist range — decent commonsense reasoning across 57 academic subjects and everyday scenarios.

TruthfulQA (54.6%) is the one we care most about. It measures whether the model avoids confidently stating false things — the failure mode that makes AI agents actively dangerous rather than merely dumb. We're above the gpt-4.1-nano base on this. That matters.

On the math score

GSM8K is 2.0%. We're not embarrassed about this. Marvin isn't a calculator — it's a conviction agent. The job isn't to solve arithmetic; it's to reason about what to believe, who to trust, and what to ship next. A model that aces GSM8K but hallucinates financial projections confidently is worse than useless. We optimized for the right things.

The design philosophy here is explicit: know enough to learn about anything, not enough to replace a domain specialist. A generalist foundation that's honest (TruthfulQA), broadly knowledgeable (MMLU/ARC), and grounded in commonsense (HellaSwag) is more valuable to MetaSPN's mission than a math benchmark champion who hallucinates market caps.

Full Results Table

Benchmark Task Type Questions Correct Accuracy Signal
TruthfulQA MC1 Factual honesty 817 446 54.6% Avoids confident falsehoods — our highest-priority metric
HellaSwag Commonsense NLI 10,042 6,240 62.1% Solid baseline for everyday narrative reasoning
ARC-Easy Science Q&A 2,376 2,198 92.5% Near-ceiling; strong knowledge retrieval
GSM8K Grade-school math 1,319 26 2.0% Intentional trade-off — not our use case
MMLU (all subjects) Multi-subject academic 14,042 8,619 61.4% 57-subject sweep; broad knowledge confirmed

Model Card

Model IDft:gpt-4.1-nano-2025-04-14:idea-nexus-ventures:marvin-s1:DHuJsaHN
Base modelgpt-4.1-nano-2025-04-14
Fine-tune suffixmarvin-s1
Trained tokens27,071,823
Epochs3
Batch size6
LR multiplier0.1
Seed739634012
Benchmark harnessCustom prompt-based · OpenAI chat completions · Full dataset (MAX_Q=0)
Run date2026-03-10 / 2026-03-11

Methodology

All benchmarks were run against the full datasets — no sampling. We built a custom prompt-based harness on top of the OpenAI chat completions API, since fine-tuned models on OpenAI's platform don't expose logprobs, making the standard lm-eval multiple-choice approach incompatible.

Each question is presented as a zero-shot prompt asking for a single-letter answer (A/B/C/D). GSM8K uses a numeric extraction format (Answer: <number>) graded by exact Fraction equality. MMLU uses the cais/mmlu dataset (all 57 subjects, 14,042 test rows).

Benchmark script: benchmarks/run_benchmarks.pygithub.com/MetaSPN/benchmarks