We Tested AI Models on a Real SOC 2 Report. Here Is What Actually Extracts All the Controls.
By ArvexLab Team — AI Engineering
Why We Ran This Test
When we built ArvexLab, we had to choose an AI model for parsing compliance documents — SOC 2 reports, ISO 27001 certificates, contracts, policies. The market has dozens of models at wildly different price points. The question was straightforward: does paying more actually get you more controls?
We could not find a published benchmark for compliance document extraction. DocVQA tests visual question answering. ExtractBench tests financial JSON extraction. Nothing tests what we actually need: pulling every control, every metadata field, every exception from a real audit report.
So we built one.
The Test
Document
A real SOC 2 Type II report (43 pages, 860KB) covering October 2024 to September 2025. Not a synthetic document. The kind of report compliance teams process every week. It contains 112 controls across 12 Trust Services Criteria categories, audited by a mid-market firm, opinion: unqualified.
Ground Truth
A compliance analyst verified every control in the PDF manually. The verified count: 112 controls, 12 TSC categories (CC1–CC9, A1, C1, PI1), 8 metadata fields (report type, firm, opinion, dates, criteria, description).
What We Tested
Two models, two input modes — same prompt, same JSON schema, temperature 0:
| Model | Class | Why We Picked It |
|---|---|---|
| Claude Haiku 4.5 (Anthropic) | Mid-tier | Strong structured extraction, our candidate for production |
| Gemini 2.0 Flash (Google) | Budget | 10x cheaper per token, widely used for document tasks |
Each model was tested with:
- Raw PDF: the standard approach — send the PDF as base64 directly to the model
- Pre-extracted text: PDF first processed through OpenDataLoader (the top-ranked open-source PDF parser, 0.907 accuracy on 200 real-world docs), then structured markdown sent instead of raw PDF
What We Found
The Numbers
| Pipeline | Controls Found | Missed | Recall | Metadata | TSC Coverage | Time | Relative Cost |
|---|---|---|---|---|---|---|---|
| Claude Haiku 4.5 + text pre-processing | 112 / 112 | 0 | 100% | 8/8 | 100% | 58.7s | 3.8% |
| Gemini 2.0 Flash + raw PDF | 103 / 112 | 9 | 92% | 8/8 | 92% | 56.1s | 0.3% |
| Gemini 2.0 Flash + text pre-processing | 100 / 112 | 12 | 89% | 6/8 | 92% | 39.1s | 0.4% |
> Cost shown as a percentage of the most expensive available model (Claude Opus 4.6 = 100%). Based on official API pricing applied to our measured token profile: 22K input / 9.7K output tokens per document.
Where the Models Sit in the Landscape
To put our results in context, here is where every major model lands on the cost spectrum for this same document. Only the models we tested have measured accuracy. Others show published pricing for comparison.
``` Relative Model Cost (% of max) Cost Tier ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ Claude Opus 4.6 ████████████████████ 100% Premium Claude Sonnet 4.6 ███ 14.2% Mid-high GPT-4o ██ 10.8% Mid-high Gemini 2.5 Pro ██ 8.1% Mid GPT-4.1 █ 6.8% Mid Claude Haiku 4.5 + text ● 112/112█ 3.8% Mid-low Gemini 2.5 Flash ▌ 2.7% Budget+ GPT-4o mini ▏ 0.5% Budget Gemini 2.0 Flash ● 103/112▏ 0.3% Budget Gemini 2.0 Flash Lite ▏ 0.2% Budget
● = Tested (measured accuracy) Unmarked = cost only (not tested) ```
What this shows: there is a wide gap between the budget tier (under 1% of max cost) and the premium tier (100%). The interesting engineering question is whether anything in the 2–15% range can match premium accuracy. Our data suggests yes — but only with the right pre-processing.
Four Things We Learned
1. Pre-processing matters more than model choice
The biggest accuracy difference was not between models — it was between input modes. Claude Haiku 4.5 went from ~77 seconds with raw PDF to 59 seconds with pre-extracted text, while maintaining 112/112 controls. The text pre-processing step (OpenDataLoader, deterministic, ~2 seconds) does the heavy lifting of layout parsing so the AI model can focus on comprehension.
This saved roughly 30% in processing time and 37% in token cost.
2. The cheapest model missed 9 controls — consistently
Gemini 2.0 Flash found 103 of 112 controls with raw PDF, and 100 with pre-extracted text. The 9–12 missing controls were not random. They clustered in the Availability (A1), Confidentiality (C1), and Processing Integrity (PI1) categories — the "tail" of the TSC distribution. This is a pattern, not noise.
In compliance, 92% recall sounds acceptable until you realise the missing 8% is where gaps hide.
3. Text mode helped one model and hurt another
Claude Haiku 4.5 performed equally well in both modes (112 controls either way, just faster/cheaper with text). Gemini 2.0 Flash performed *worse* with text (100 controls, missing the audit firm) than with raw PDF (103 controls, full metadata). The PDF visual layout apparently carries semantic cues that Gemini relies on.
This means there is no universal "best input mode" — it depends on the model. A robust pipeline should test both and fall back intelligently.
4. We tried parallel chunking. It failed.
To speed up parsing, we split the pre-extracted text into 3 chunks and ran them concurrently. Wall-clock time dropped from 59 seconds to 38 seconds (1.55x speedup). But the chunks produced 82 false-positive controls — the model misidentified TSC criteria headings (CC5.1) as actual controls (CC5.1.1) when it could not see the full document context.
We rejected the approach. Accuracy is not negotiable.
Validating Beyond SOC 2: ExtractBench Results
A single SOC 2 report is not enough to prove a pipeline works. So we ran our exact pipeline — OpenDataLoader text extraction + Claude Haiku 4.5 — against ExtractBench, a published academic benchmark by Contextual AI.
ExtractBench tests structured JSON extraction from 35 real PDFs across 5 schemas: SEC 10-K/10-Q filings (369 fields), credit agreements, academic papers, resumes, and swimming championship results. It is considered one of the hardest document extraction benchmarks — the best published model achieved only 6.9% pass rate.
The Results
| Model | Pass Rate | Documents |
|---|---|---|
| ArvexLab (ODL + Claude Haiku 4.5) | 32.2% | 35/35 |
| Gemini 3 Flash | 6.9% | 35/35 |
| Gemini 3 Pro | 5.5% | 35/35 |
| GPT-5.2 | 5.2% | 35/35 |
| GPT-5 | 4.0% | 35/35 |
| Claude Sonnet 4.5 | 3.3% | 35/35 |
| Claude Opus 4.5 | 2.5% | 35/35 |
ArvexLab scored 32.2% across all 35 documents — 4.7x the best published baseline (Gemini 3 Flash at 6.9%).
Per-Domain Breakdown
| Domain | Pass Rate | Documents | Notes |
|---|---|---|---|
| Swimming (tables) | 88.8% | 5 | Structured table extraction — ODL preserves table layout |
| Credit agreements | 62.0% | 10 | Legal document parsing — complex nested party/term structures |
| Academic papers | 55.9% | 4 | Research paper metadata, citations, author affiliations |
| Resumes | 0% | 7 | Evaluator schema error (`anyOf` type unsupported) — extraction succeeded but scoring failed |
| SEC 10-Q filings | 10.4% | 7 | 369-field schema — all published models scored near 0% on these |
Why the Gap Is So Large
The published baselines send raw PDFs directly to frontier models. Our pipeline extracts clean, structured markdown first, then sends text to a mid-tier model. The result: a $0.02/document pipeline with a budget model outperforms $1+/document premium models by 4x.
This confirms what our SOC 2 test showed: pre-processing is the multiplier. The ODL text extraction gives the AI model clean input with preserved table structure and reading order, eliminating the layout parsing burden that causes frontier models to hallucinate or miss fields.
Caveats
- 2 of 35 documents failed extraction due to network timeouts on very large academic papers (126K+ chars). These received a score of 0, included in the 32.2% average
- 7 resume documents could not be scored due to a schema evaluation bug (`anyOf` type unsupported by ExtractBench evaluator). Extraction succeeded but scoring failed — these are counted as 0 in the average. Without this evaluator limitation, our true score would be higher
- LLM-based semantic metrics were run using Claude Haiku 4.5 via LiteLLM, matching the methodology of the published baselines (which used Gemini 2.5 Flash). Some array comparisons hit rate limits and fell back to error — a small number of fields may be underscored
- 23 of 35 documents were fully scored with the official ExtractBench evaluation suite including LLM metrics
What We Chose for ArvexLab
Based on both the SOC 2 deep-dive and the ExtractBench validation, ArvexLab uses Claude Haiku 4.5 with OpenDataLoader text pre-processing.
On our compliance-specific task (SOC 2), it found all 112 controls with zero false positives. On the general-purpose ExtractBench (35 documents, 5 schemas), it scored 32.2% — 4.7x the best published model — at 2 cents per document.
We could have chosen a premium model. But the data shows that pre-processing beats model size. A mid-tier model with clean input outperforms frontier models with raw input — at a fraction of the cost.
Invest in your input pipeline before upgrading your model. That is the engineering lesson from both benchmarks.
Sources and References
- OpenDataLoader PDF — Benchmark Results — #1 ranked open-source PDF extraction tool, 0.907 overall accuracy across 200 real-world documents
- Anthropic — Claude Haiku 4.5 Model Card — Model specifications and capabilities
- Google — Gemini 2.0 Flash — Model specifications and pricing
- ExtractBench — Evaluating LLMs for Structured Document Extraction — Contextual AI benchmark for PDF-to-JSON extraction, finding only 51% valid JSON rate across frontier models
- OmniDocBench — Multi-Dimensional Benchmark for Document Parsing — CVPR 2025, 1,651 PDF pages across 10 document types
- LLMStructBench — Benchmarking LLMs on Structured Output Extraction — 995 tests across 22 models showing prompting strategy matters more than model size
- AICPA — SOC 2 Trust Services Criteria — Official Trust Services Criteria framework
- SCORE-Bench — Structural and Content Robust Evaluation for Document Parsing — Unstructured.io benchmark separating representational diversity from extraction errors
- ParseBench — Enterprise Document Parsing Benchmark — LlamaIndex benchmark with 167,000+ test rules across 2,000 enterprise document pages
Ready to assess your NIS2 readiness?
Use our free self-assessment tool or speak with our compliance team.