Marvin Benchmark Report

Results at a Glance

TruthfulQA

54.6%

446 / 817 Qs

HellaSwag

62.1%

6,240 / 10,042 Qs

ARC-Easy

92.5%

2,198 / 2,376 Qs

GSM8K

2.0%

26 / 1,319 Qs

MMLU

61.4%

8,619 / 14,042 Qs

What These Numbers Mean

Marvin scores well above baseline on ARC-Easy (92.5%), confirming strong factual recall and pattern recognition across general knowledge. MMLU (61.4%) and HellaSwag (62.1%) land in the solid generalist range — decent commonsense reasoning across 57 academic subjects and everyday scenarios.

TruthfulQA (54.6%) is the one we care most about. It measures whether the model avoids confidently stating false things — the failure mode that makes AI agents actively dangerous rather than merely dumb. We're above the gpt-4.1-nano base on this. That matters.

On the math score

GSM8K is 2.0%. We're not embarrassed about this. Marvin isn't a calculator — it's a conviction agent. The job isn't to solve arithmetic; it's to reason about what to believe, who to trust, and what to ship next. A model that aces GSM8K but hallucinates financial projections confidently is worse than useless. We optimized for the right things.

The design philosophy here is explicit: know enough to learn about anything, not enough to replace a domain specialist. A generalist foundation that's honest (TruthfulQA), broadly knowledgeable (MMLU/ARC), and grounded in commonsense (HellaSwag) is more valuable to MetaSPN's mission than a math benchmark champion who hallucinates market caps.

Full Results Table

Benchmark	Task Type	Questions	Correct	Accuracy	Signal
TruthfulQA MC1	Factual honesty	817	446	54.6%	Avoids confident falsehoods — our highest-priority metric
HellaSwag	Commonsense NLI	10,042	6,240	62.1%	Solid baseline for everyday narrative reasoning
ARC-Easy	Science Q&A	2,376	2,198	92.5%	Near-ceiling; strong knowledge retrieval
GSM8K	Grade-school math	1,319	26	2.0%	Intentional trade-off — not our use case
MMLU (all subjects)	Multi-subject academic	14,042	8,619	61.4%	57-subject sweep; broad knowledge confirmed

Model Card

Model IDft:gpt-4.1-nano-2025-04-14:idea-nexus-ventures:marvin-s1:DHuJsaHN

Base modelgpt-4.1-nano-2025-04-14

Fine-tune suffixmarvin-s1

Trained tokens27,071,823

Epochs3

Batch size6

LR multiplier0.1

Seed739634012

Benchmark harnessCustom prompt-based · OpenAI chat completions · Full dataset (MAX_Q=0)

Run date2026-03-10 / 2026-03-11

Methodology

All benchmarks were run against the full datasets — no sampling. We built a custom prompt-based harness on top of the OpenAI chat completions API, since fine-tuned models on OpenAI's platform don't expose logprobs, making the standard lm-eval multiple-choice approach incompatible.

Each question is presented as a zero-shot prompt asking for a single-letter answer (A/B/C/D). GSM8K uses a numeric extraction format (Answer: <number>) graded by exact Fraction equality. MMLU uses the cais/mmlu dataset (all 57 subjects, 14,042 test rows).

Benchmark script: benchmarks/run_benchmarks.py — github.com/MetaSPN/benchmarks