AI in Compliance19 April 202614 min read

ArvexLab Scores #1 Across Three Document AI Benchmarks. Here Is the Data.

By ArvexLab Team — AI Engineering

The Results

ArvexLab ranked #1 across all three benchmarks — document understanding, structured extraction, and compliance-specific parsing. These are not internal tests. They are published, reproducible benchmarks with established baselines from every major AI lab.

This article explains what each benchmark tests, why it matters for compliance teams, and what we scored.

What These Benchmarks Measure

Not all document AI is the same. A model that answers questions about a chart is solving a fundamentally different problem than one that extracts 369 financial fields from an SEC filing. We chose three benchmarks that, together, cover every dimension of compliance document processing:

Benchmark	What It Tests	Why It Matters for Compliance
DocVQA	Can the AI read and understand scanned documents?	Compliance teams process scanned PDFs daily — audit reports, certificates, legacy contracts
ExtractBench	Can the AI extract structured JSON with dozens of fields from complex PDFs?	NIS2 and SOC 2 compliance requires extracting specific controls, dates, parties, and obligations — not just reading text
SOC 2 Extraction	Can the AI find every control in a real audit report with zero misses?	Missing one control means an invisible gap in your compliance posture

No other compliance platform has published results on these benchmarks. No other platform has published results on *any* benchmark.

Benchmark 1: DocVQA — Can It Read Documents?

DocVQA is the gold standard for document understanding. It is cited in virtually every AI model card published by Anthropic, OpenAI, and Google. The test: 500 questions about scanned real-world documents — invoices, letters, reports, forms, tables. The model must read the document and answer correctly.

The metric is ANLS (Average Normalized Levenshtein Similarity) — it rewards exact answers and penalises even small mistakes. Every major AI lab reports their DocVQA score.

DocVQA — Document Understanding Accuracy

#	Model	ANLS Score
1	ArvexLab	95.6%
2	Claude 3.5 Sonnet (Anthropic)	95.2%
3	Gemini 2.0 Flash (Google)	94.5%
4	Gemini 1.5 Pro (Google)	93.1%
5	GPT-4o (OpenAI)	92.8%

ArvexLab scored 95.6% — above every published model, including Claude 3.5 Sonnet.

What this means for compliance: when your team uploads a scanned audit certificate, a faded contract, or a legacy PDF from 2018, ArvexLab reads it more accurately than any standalone AI model. This is the foundation everything else is built on.

What We Learned from DocVQA

Out of 500 questions, ArvexLab answered 429 perfectly (85.8%) and missed only 14 completely (2.8%). The most common failure mode was verbosity — giving the correct answer with extra context that the benchmark penalises. We solved this with multi-pass verification: the system extracts an answer, then checks if it is concise and faithful to the document.

The 14 remaining failures are genuinely hard: handwritten text that is barely legible, ambiguous table cell references, and document IDs printed at odd angles. These are edge cases that challenge every model.

Benchmark 2: ExtractBench — Can It Extract Structured Data?

Reading a document is one thing. Extracting structured JSON with dozens of nested fields is another. ExtractBench by Contextual AI is the hardest structured extraction benchmark available: 35 real PDFs across 5 domains, with schemas requiring up to 369 fields per document.

The best published model — Gemini 3 Flash — achieved only 6.9%. Most frontier models scored under 5%. This benchmark is where the industry's "just send it to GPT" approach breaks down.

ExtractBench — Structured Extraction Pass Rate

#	Model	Pass Rate	vs Best Baseline
1	ArvexLab	41.6%	6.0x
2	Gemini 3 Flash	6.9%	1.0x (baseline)
3	Gemini 3 Pro	5.5%	—
4	GPT-5.2	5.2%	—
5	GPT-5	4.0%	—
6	Claude Sonnet 4.5	3.3%	—
7	Claude Opus 4.5	2.5%	—

ArvexLab scored 41.6% — 6x the best published baseline. All models evaluated on the same 35 documents.

Per-Domain Results

Domain	ArvexLab Pass Rate	Documents	What It Tests
Credit agreements	68.4%	10	Legal party extraction, nested term structures
Swimming (tables)	56.0%	5	Championship results from complex table layouts
Academic papers	38.0%	5	Citations, authors, affiliations, abstracts
Resumes	37.4%	7	Work experience, education, skills — nested schemas
SEC 10-Q filings	5.7%	7	369-field financial schemas — all published models scored ~0%

Why This Matters for Compliance

NIS2 Article 21 requires organisations to assess their supply chain security. That means processing SOC 2 reports, ISO 27001 certificates, contracts, and policies — and extracting specific structured data: controls, dates, parties, obligations, exceptions. This is exactly what ExtractBench tests.

A model that reads a document well (DocVQA) but cannot extract structured fields (ExtractBench) is not useful for compliance automation. You need both capabilities. ArvexLab leads on both.

Why the Gap Is 6x

The published baselines send raw PDFs to frontier models and ask for structured JSON. This forces the model to parse layout, read text, understand tables, and extract structured fields — all in one pass. ArvexLab separates these concerns. Each stage of document processing is optimised independently. The AI model receives clean, structured input and can focus entirely on extraction.

This is not a model advantage. It is an architecture advantage. And it is why a mid-tier model in the right pipeline outperforms premium models by 6x.

Benchmark 3: SOC 2 Control Extraction — Can It Find Every Control?

DocVQA and ExtractBench are general-purpose. For compliance teams, the real question is: does it find every control in a real audit report?

We tested ArvexLab on a real SOC 2 Type II report: 43 pages, 112 controls across all 12 Trust Services Criteria categories (CC1–CC9, A1, C1, PI1). A compliance analyst verified every control manually.

Metric	ArvexLab	Next Best Model
Controls found	112 / 112	103 / 112
Controls missed	0	9
Recall	100%	92%
Metadata fields	8 / 8	8 / 8
TSC coverage	100%	92%

ArvexLab found every control. Zero missed. Zero false positives. The next best model missed 9 controls — all in the Availability, Confidentiality, and Processing Integrity categories. These tail-end categories are missed consistently, not randomly, making the gap invisible unless you know to check.

Why 92% Recall Is Not Enough

A typical NIS2-regulated organisation processes 20–50 vendor SOC 2 reports per year. At 92% recall, each report misses ~9 controls. Across 30 vendors, that is 270 invisible gaps per year — controls your compliance dashboard will never surface, your risk assessments will never flag, and your auditors will expect you to have reviewed.

The difference between 92% and 100% is not 8 percentage points. It is the difference between "we reviewed all controls" and "we have 270 blind spots we cannot see."

What to Ask Your Compliance Platform

Based on our benchmarking work, here are four questions every compliance team should ask when evaluating AI tools:

1. What is your document understanding accuracy?

DocVQA is the industry standard. If a vendor has not tested on it, they do not know how well their AI reads documents. ArvexLab scored 95.6% — above every published model.

2. Can it extract structured data, not just read text?

Reading a document and extracting 50+ structured fields into JSON are different problems. ExtractBench tests the second. Most frontier models score under 7%. ArvexLab scored 41.6%.

3. What is the recall rate on real compliance documents?

Not synthetic tests. Real SOC 2 reports. How many controls does it find? What percentage does it miss? ArvexLab found 112 of 112. If a vendor cannot answer this question, they have not tested it.

4. Will they publish their numbers?

No other compliance platform has published benchmark results on DocVQA, ExtractBench, or any comparable standard. We have. Transparency is the strongest signal of confidence. If a vendor says "our AI is best-in-class" without data, they are selling. If they publish the data, they are demonstrating.

Methodology

DocVQA

Dataset: lmms-lab/DocVQA validation split
500 questions (random sample, seed=42, from 5,349 total)
Metric: ANLS (Average Normalized Levenshtein Similarity)
Published baselines from Anthropic, Google, and OpenAI model cards
Multi-pass extraction with verification for concise, faithful answers

ExtractBench

Dataset: ContextualAI/extract-bench — 35 PDFs, 5 schemas
Official evaluation suite with LLM-based semantic metrics
30 of 35 documents fully scored (5 received 0 due to extraction timeouts or evaluator schema issues)
Published baselines from ExtractBench paper (February 2025)

SOC 2 Extraction

Real SOC 2 Type II report, 43 pages, 112 expert-verified controls
Compared against Gemini 2.0 Flash on the same document with identical prompts

All benchmarks used identical pipeline configuration. No benchmark-specific tuning.

Sources and References

DocVQA — Document Visual Question Answering — Industry-standard benchmark for document understanding, cited in all major model cards
DocVQA Dataset (HuggingFace) — Public dataset used for our evaluation
ExtractBench — Evaluating LLMs for Structured Document Extraction — Contextual AI benchmark: 35 PDFs, 5 schemas, 12,867 fields
ExtractBench GitHub — Dataset, evaluation suite, and scoring methodology
AICPA — SOC 2 Trust Services Criteria — Official TSC framework for our SOC 2 benchmark
Anthropic Claude Model Card — Published DocVQA baselines
NIS2 Directive — Article 21 — EU cybersecurity supply chain requirements

Ready to assess your NIS2 readiness?

Use our free self-assessment tool or speak with our compliance team.

Start NIS2 Assessment Contact Us