New Research: Fixing LLM Benchmarking for Tax, Finance & Accounting — Why current evals fail
Back to Home
Research NoteNovember 2025TaxOS Research Team

Fixing LLM Benchmarking for Tax, Finance & Accounting

Why current LLM evaluation methods fail for professional services — and how we're building a better framework.

Work in Progress

The Problem

Standard LLM benchmarks like MMLU, HumanEval, and HellaSwag measure general intelligence — but they completely miss what matters for tax, finance, and accounting professionals.

These fields require:

  • Precise regulatory knowledge — IRC sections, GAAP standards, SOX compliance
  • Multi-step reasoning — calculating tax liability across multiple jurisdictions
  • Document understanding — parsing 1040s, balance sheets, K-1s
  • Temporal accuracy — distinguishing 2023 vs 2024 tax code changes
  • Professional judgment — when to escalate vs automate

Current benchmarks don't test any of this. A model can score 90% on MMLU but still fail to correctly calculate depreciation or identify a Sec 179 deduction.

Our Approach

We're building TaxBench, FinBench, and AcctBench — domain-specific evaluation frameworks that test what actually matters. We benchmark top SOTA models (GPT-4, Claude 3.5, Gemini 1.5) and leading open-source models against rigorous professional standards:

1. Real-World Task Simulation

Instead of multiple-choice questions, we evaluate models on actual workflows:

  • "Calculate AMT for a taxpayer with ISO stock options and rental income"
  • "Identify missing Schedule C deductions from bank statements"
  • "Draft a response to IRS Notice CP2000 with supporting documentation"

2. Citation & Source Verification

Accuracy isn't enough; models must prove their work. We test for:

  • Correct citation of specific IRC sections and subsections
  • Reference to relevant case law and revenue rulings
  • Ability to distinguish between primary authority and guidance

3. Temporal Versioning

Tax code changes every year. We test models on:

  • Current year rules (2025 tax code)
  • Historical scenarios (2020-2024 variations)
  • Cross-year comparisons (TCJA vs pre-TCJA)

4. Multi-Modal Document Understanding

Models must process:

  • Scanned PDFs with handwritten notes
  • Excel spreadsheets with complex formulas
  • Email threads with client questions
  • IRS notices with embedded calculations

5. Error Cost Weighting

Not all mistakes are equal. We penalize:

  • Critical errors — wrong tax owed, missed deadlines (10x penalty)
  • Compliance errors — incorrect form selection, audit triggers (5x penalty)
  • Minor errors — formatting issues, typos (1x penalty)

6. Professional Judgment Testing

When should AI escalate to a human? We test:

  • Ambiguous scenarios requiring interpretation
  • High-stakes decisions (audit risk, penalty exposure)
  • Novel fact patterns not covered by training data

Why This Matters

Without domain-specific benchmarks, we're flying blind. Companies buy "AI tax software" based on generic metrics that don't predict real-world performance.

This creates:

  • Wasted investment — firms buy tools that fail in production
  • Compliance risk — AI makes mistakes that trigger audits
  • Lost trust — professionals abandon AI after bad experiences

We're building TaxBench/FinBench/AcctBench to change this — giving practitioners a way to evaluate models on what actually matters before deploying them.

Current Status

We are compiling one of the most comprehensive datasets for LLM benchmarking in tax, finance, and accounting. Unlike academic exams, our dataset encompasses a vast array of real-world factors, edge cases, and complex regulatory nuances.

We're actively developing:

  • 📊 TaxBench v1 — Comprehensive real-world tax scenarios across multiple jurisdictions
  • 💰 FinBench v1 — Advanced financial statement analysis and fraud detection patterns
  • 📈 AcctBench v1 — Deep GAAP compliance and audit support workflows

Expected Release: Q1 2026

We will publish results for major SOTA models and leading open-source alternatives. Concurrently, we are developing specialized infrastructure and domain-adapted models designed to set the new standard for accuracy and reliability in the industry.

Want to Contribute?

We're looking for CPAs, tax attorneys, and accounting professionals to help validate our benchmark scenarios.

Back to HomeLast updated: November 2025