Claude Opus 4.5 scored 67%. Sourcetable scored 100%. On the same benchmark. Here's what the test covers, why domain-specific beats general-purpose, and what it means for your analysis work.
Andrew Grosser
June 1, 2026 • 9 min read
Benchmark scores are easy to misuse. '100%' sounds impressive until you realize you don't know what was being tested. This article breaks down exactly what the Vals.ai finance agent benchmark tests, why general-purpose LLMs like Claude score lower than purpose-built platforms, and what the 33-point gap actually means for financial analysis work.
| Benchmark | Sourcetable | Claude Opus 4.5 | What It Tests |
|---|---|---|---|
| Vals.ai Finance | 100% | 67% | Financial analysis agent tasks |
| Rows.com Spreadsheet | 100% | Not tested | Spreadsheet AI tasks |
The Vals.ai finance agent benchmark evaluates AI systems on real financial analysis tasks: interpreting financial statements, performing ratio analysis, answering questions about market data, making investment-relevant calculations, and reasoning about financial scenarios. These are the tasks financial analysts do every day — not abstract reasoning puzzles or general knowledge questions.
Claude is a general-purpose language model. It reasons well about many topics including finance, but it lacks financial data access, institutional analysis frameworks, and purpose-built financial reasoning. Sourcetable combines Claude-level language understanding with financial-domain infrastructure: 500+ data APIs, built-in Monte Carlo and factor model implementations, and years of financial domain-specific training and optimization. Domain-specific beats general-purpose on domain-specific tasks.
Rows.com published a benchmark evaluating AI systems on standard spreadsheet tasks: formula generation, data manipulation, chart creation, and analysis. Sourcetable scored 100% — making it the first AI spreadsheet to achieve perfect scores on both this benchmark and Vals.ai. The fact that we beat Rows.com on their own benchmark reflects the depth of our spreadsheet-specific capabilities.
A 33-point gap on a financial benchmark isn't abstract — it means 33% more finance tasks executed correctly. For a financial analyst running complex analysis workflows, that translates to fewer errors, less manual verification, and more confidence in AI-assisted conclusions. General-purpose LLMs are excellent tools. For financial analysis specifically, purpose-built wins.
What 100% means for you:
The world's most powerful analytical platform — free to try
100% benchmark scores. 500+ financial APIs. Spreadsheet interface. No coding required.
Start Free Trial →