Back to Resources
AI in Compliance17 April 202610 min read

We Tested AI Models on a Real SOC 2 Report. Here Is What Actually Extracts All the Controls.

By ArvexLab Team — AI Engineering

Why We Ran This Test

When we built ArvexLab, we had to choose an AI model for parsing compliance documents — SOC 2 reports, ISO 27001 certificates, contracts, policies. The market has dozens of models at wildly different price points. The question was straightforward: does paying more actually get you more controls?

We could not find a published benchmark for compliance document extraction. DocVQA tests visual question answering. ExtractBench tests financial JSON extraction. Nothing tests what we actually need: pulling every control, every metadata field, every exception from a real audit report.

So we built one.

The Test

Document

A real SOC 2 Type II report (43 pages, 860KB) covering October 2024 to September 2025. Not a synthetic document. The kind of report compliance teams process every week. It contains 112 controls across 12 Trust Services Criteria categories, audited by a mid-market firm, opinion: unqualified.

Ground Truth

A compliance analyst verified every control in the PDF manually. The verified count: 112 controls, 12 TSC categories (CC1–CC9, A1, C1, PI1), 8 metadata fields (report type, firm, opinion, dates, criteria, description).

What We Tested

Two models, two input modes — same prompt, same JSON schema, temperature 0:

ModelClassWhy We Picked It
Claude Haiku 4.5 (Anthropic)Mid-tierStrong structured extraction, our candidate for production
Gemini 2.0 Flash (Google)Budget10x cheaper per token, widely used for document tasks

Each model was tested with:

  • Raw PDF: the standard approach — send the PDF as base64 directly to the model
  • Pre-extracted text: PDF first processed through OpenDataLoader (the top-ranked open-source PDF parser, 0.907 accuracy on 200 real-world docs), then structured markdown sent instead of raw PDF

What We Found

The Numbers

PipelineControls FoundMissedRecallMetadataTSC CoverageTimeRelative Cost
Claude Haiku 4.5 + text pre-processing112 / 1120100%8/8100%58.7s3.8%
Gemini 2.0 Flash + raw PDF103 / 112992%8/892%56.1s0.3%
Gemini 2.0 Flash + text pre-processing100 / 1121289%6/892%39.1s0.4%

> Cost shown as a percentage of the most expensive available model (Claude Opus 4.6 = 100%). Based on official API pricing applied to our measured token profile: 22K input / 9.7K output tokens per document.

Where the Models Sit in the Landscape

To put our results in context, here is where every major model lands on the cost spectrum for this same document. Only the models we tested have measured accuracy. Others show published pricing for comparison.

``` Relative Model Cost (% of max) Cost Tier ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ Claude Opus 4.6 ████████████████████ 100% Premium Claude Sonnet 4.6 ███ 14.2% Mid-high GPT-4o ██ 10.8% Mid-high Gemini 2.5 Pro ██ 8.1% Mid GPT-4.1 █ 6.8% Mid Claude Haiku 4.5 + text ● 112/112█ 3.8% Mid-low Gemini 2.5 Flash ▌ 2.7% Budget+ GPT-4o mini ▏ 0.5% Budget Gemini 2.0 Flash ● 103/112▏ 0.3% Budget Gemini 2.0 Flash Lite ▏ 0.2% Budget

● = Tested (measured accuracy) Unmarked = cost only (not tested) ```

What this shows: there is a wide gap between the budget tier (under 1% of max cost) and the premium tier (100%). The interesting engineering question is whether anything in the 2–15% range can match premium accuracy. Our data suggests yes — but only with the right pre-processing.

Four Things We Learned

1. Pre-processing matters more than model choice

The biggest accuracy difference was not between models — it was between input modes. Claude Haiku 4.5 went from ~77 seconds with raw PDF to 59 seconds with pre-extracted text, while maintaining 112/112 controls. The text pre-processing step (OpenDataLoader, deterministic, ~2 seconds) does the heavy lifting of layout parsing so the AI model can focus on comprehension.

This saved roughly 30% in processing time and 37% in token cost.

2. The cheapest model missed 9 controls — consistently

Gemini 2.0 Flash found 103 of 112 controls with raw PDF, and 100 with pre-extracted text. The 9–12 missing controls were not random. They clustered in the Availability (A1), Confidentiality (C1), and Processing Integrity (PI1) categories — the "tail" of the TSC distribution. This is a pattern, not noise.

In compliance, 92% recall sounds acceptable until you realise the missing 8% is where gaps hide.

3. Text mode helped one model and hurt another

Claude Haiku 4.5 performed equally well in both modes (112 controls either way, just faster/cheaper with text). Gemini 2.0 Flash performed *worse* with text (100 controls, missing the audit firm) than with raw PDF (103 controls, full metadata). The PDF visual layout apparently carries semantic cues that Gemini relies on.

This means there is no universal "best input mode" — it depends on the model. A robust pipeline should test both and fall back intelligently.

4. We tried parallel chunking. It failed.

To speed up parsing, we split the pre-extracted text into 3 chunks and ran them concurrently. Wall-clock time dropped from 59 seconds to 38 seconds (1.55x speedup). But the chunks produced 82 false-positive controls — the model misidentified TSC criteria headings (CC5.1) as actual controls (CC5.1.1) when it could not see the full document context.

We rejected the approach. Accuracy is not negotiable.

Validating Beyond SOC 2: ExtractBench Results

A single SOC 2 report is not enough to prove a pipeline works. So we ran our exact pipeline — OpenDataLoader text extraction + Claude Haiku 4.5 — against ExtractBench, a published academic benchmark by Contextual AI.

ExtractBench tests structured JSON extraction from 35 real PDFs across 5 schemas: SEC 10-K/10-Q filings (369 fields), credit agreements, academic papers, resumes, and swimming championship results. It is considered one of the hardest document extraction benchmarks — the best published model achieved only 6.9% pass rate.

The Results

ModelPass RateDocuments
ArvexLab (ODL + Claude Haiku 4.5)32.2%35/35
Gemini 3 Flash6.9%35/35
Gemini 3 Pro5.5%35/35
GPT-5.25.2%35/35
GPT-54.0%35/35
Claude Sonnet 4.53.3%35/35
Claude Opus 4.52.5%35/35

ArvexLab scored 32.2% across all 35 documents — 4.7x the best published baseline (Gemini 3 Flash at 6.9%).

Per-Domain Breakdown

DomainPass RateDocumentsNotes
Swimming (tables)88.8%5Structured table extraction — ODL preserves table layout
Credit agreements62.0%10Legal document parsing — complex nested party/term structures
Academic papers55.9%4Research paper metadata, citations, author affiliations
Resumes0%7Evaluator schema error (`anyOf` type unsupported) — extraction succeeded but scoring failed
SEC 10-Q filings10.4%7369-field schema — all published models scored near 0% on these

Why the Gap Is So Large

The published baselines send raw PDFs directly to frontier models. Our pipeline extracts clean, structured markdown first, then sends text to a mid-tier model. The result: a $0.02/document pipeline with a budget model outperforms $1+/document premium models by 4x.

This confirms what our SOC 2 test showed: pre-processing is the multiplier. The ODL text extraction gives the AI model clean input with preserved table structure and reading order, eliminating the layout parsing burden that causes frontier models to hallucinate or miss fields.

Caveats

  • 2 of 35 documents failed extraction due to network timeouts on very large academic papers (126K+ chars). These received a score of 0, included in the 32.2% average
  • 7 resume documents could not be scored due to a schema evaluation bug (`anyOf` type unsupported by ExtractBench evaluator). Extraction succeeded but scoring failed — these are counted as 0 in the average. Without this evaluator limitation, our true score would be higher
  • LLM-based semantic metrics were run using Claude Haiku 4.5 via LiteLLM, matching the methodology of the published baselines (which used Gemini 2.5 Flash). Some array comparisons hit rate limits and fell back to error — a small number of fields may be underscored
  • 23 of 35 documents were fully scored with the official ExtractBench evaluation suite including LLM metrics

What We Chose for ArvexLab

Based on both the SOC 2 deep-dive and the ExtractBench validation, ArvexLab uses Claude Haiku 4.5 with OpenDataLoader text pre-processing.

On our compliance-specific task (SOC 2), it found all 112 controls with zero false positives. On the general-purpose ExtractBench (35 documents, 5 schemas), it scored 32.2% — 4.7x the best published model — at 2 cents per document.

We could have chosen a premium model. But the data shows that pre-processing beats model size. A mid-tier model with clean input outperforms frontier models with raw input — at a fraction of the cost.

Invest in your input pipeline before upgrading your model. That is the engineering lesson from both benchmarks.

Sources and References

Ready to assess your NIS2 readiness?

Use our free self-assessment tool or speak with our compliance team.

Get NIS2 Insights Weekly

Stay ahead of EU compliance requirements. Practical guidance on NIS2, DORA, and third-party risk management delivered to your inbox.