Spider2 Under the Microscope: Data Drift, Engine Gaps, and the Case for Spider2-E [v3]
Summary: Our work on Spider2-Snow uncovered several benchmark issues that materially affect evaluation reliability. Most failures were not caused by query reasoning errors, but by instability in the benchmark itself: live data drift, cross-engine incompatibilities, revoked datasets, and ambiguous natural-language questions. We analyze these failure modes, discuss why result-only evaluation is increasingly insufficient for talk-to-data systems, and introduce Spider2-E, a more reproducible evaluation framework.
Introduction
Spider2 has quickly become one of the most important benchmarks for evaluating Text-to-SQL systems. Unlike earlier datasets with simplified schemas, it tests agents on real-world enterprise databases. Strong performance on Spider2 is widely interpreted as evidence of real capability.
But benchmarks are only as reliable as the environments they run in. Recent work by Jin et al. (CIDR 2026) has shown that annotation errors—incorrect SQL, schema misunderstandings, and ambiguous questions—are widespread across BIRD and Spider2-Snow, and that correcting them can change model rankings. Our work is complementary but addresses a distinct problem: even when annotations are correct, the evaluation environment itself can be non-reproducible.
While working with the Snowflake variant, Spider2-Snow, we manually analyzed every non-exact match produced during evaluation. The results were unexpected.
Of the 32 mismatches we investigated, 28 were caused by issues in the benchmark itself rather than query reasoning failures.
This suggests that a significant portion of what Spider2-Snow measures today is not model capability—but properties of the evaluation environment.
|
Cause |
Count |
System's
fault? |
|
Data drift |
20 |
No |
|
Engine differences (BQ vs SF) |
5 |
No |
|
Database unavailable |
3 |
No |
|
Query logic error |
4 |
Yes |
Our goal is not to criticize Spider2, but to help strengthen it.
We present a structured investigation to surface the patterns, failure modes, and design requirements that would inform the architecture of a production-grade evaluation framework.
Also, our investigation revealed something deeper. Even when the benchmark environment is stable, natural language queries are inherently ambiguous. A single question can contain 10–12 distinct points of ambiguity—any of which can change the result. Resolving these ambiguities is the central challenge for any production talk-to-data system.
The Core Issue: Evaluating Against a Moving Target
1. Data Drift
The largest source of mismatch is data drift.
Many Spider2 datasets are live and continuously updated, while the "correct" answers are based on a fixed snapshot from an unknown point in time. When the data changes, correct queries can appear wrong.
We observed several recurring patterns:
Data disappears — Queries targeting historical data return no results because the dataset is a rolling window. The original data simply no longer exists.
Data never existed in this environment — Some queries reference datasets that were never mirrored into Snowflake.
Rankings shift over time — A correct "top 10" query returns a slightly different ordering because the underlying data has changed.
Schemas evolve — Columns change meaning or format (e.g., postal codes replaced by city names).
In some cases, the benchmark penalizes for returning correct answers to current data, or even for finding data that now exists but didn't when the benchmark was created.
2. Engine Differences
Spider2-Snow inherits queries originally designed for BigQuery. However, SQL is not perfectly portable across engines.
We found several unavoidable mismatches:
Missing functionality — BigQuery provides functions that Snowflake does not (e.g., statistical UDFs).
Non-equivalent constructs — Some BigQuery behaviors (like struct serialization) cannot be replicated exactly.
Performance differences — Queries that run efficiently in BigQuery may time out in Snowflake.
Numerical and spatial differences — Small differences in floating-point or geospatial computations can change results.
These are not model failures—they are system incompatibilities.
3. Database Unavailability
In several cases, the underlying data was simply unavailable due to revoked data shares.
These queries are unevaluable by any system.
A Structural Weakness: Result-Only Evaluation
Spider2 evaluates outputs, not methods.
This creates a vulnerability: an agent can produce the correct answer without actually querying the database (e.g., by hardcoding results).
The benchmark includes some detection, but it is limited. We identified one such case during our analysis.
This raises a broader concern: high scores do not always imply real capability.
Implications for Reported Scores
The True Ceiling Is Lower Than 100%
Because of environment-related issues, even a perfect system cannot reliably reach 100%. In practice, the achievable ceiling is meaningfully lower under current evaluation conditions.
Scores Change Over Time
Since some datasets are live, the benchmark itself changes. An identical system evaluated at different times may receive different scores.
Very High Scores Require Scrutiny
Very high scores may reflect:
Favorable timing (less data drift)
Benchmark artifacts
Or unintended exploitation
Four Distinct Sources of Benchmark Failure
The issues we observed fall into four fundamentally different categories:
|
Problem |
Nature |
Remedy |
|
Annotation errors |
Incorrect gold labels or SQL |
Better verification and curation |
|
Data drift |
Non-stationary datasets |
Snapshotting and versioning |
|
Engine mismatch |
Semantic differences across SQL engines |
Engine-aware evaluation |
|
Query ambiguity |
Multiple valid interpretations |
Multi-answer or ambiguity-aware evaluation |
These problems are often conflated, but they require fundamentally different remedies.
Recommendations for Benchmark Design
Reliable benchmarks require more than correct answers—they require stable evaluation conditions.
Production systems must handle evolving data and ambiguous questions. But benchmarks serve a different purpose: they are supposed to isolate and measure system capability under controlled conditions. When environmental variance dominates measured error, benchmark scores become difficult to interpret scientifically.
We suggest focusing on two dimensions:
1. Annotation Quality (What is measured)
Prior work, notably Jin et al., has demonstrated the impact of annotation errors on benchmark reliability. Their remedies—improved verification pipelines, corrected SQL, and clearer questions—address this dimension directly.
Ensure gold SQL is correct
Remove ambiguity in questions or accept several valid SQL translations
2. Evaluation Stability (How it is measured)
Use snapshot data, not live datasets
Document assumptions (data version, engine, timing)
Support engine-specific behavior where needed
Version the benchmark over time
Allow tolerance for minor numerical differences
Evaluate how answers are produced, not just results
Spider2-E: A More Stable Evaluation Framework
To explore these ideas, we built Spider2-E, a modified version of Spider2-Snow.
Key changes:
Removed unstable instances
Data drift cases
Unavailable databases
Handled engine differences explicitly
Improved ground truth SQL
Removed outdated queries
Added corrected versions
Strengthened evaluation
Detect hardcoding
Validate query structure
Treat empty results consistently
|
Artifact |
Upstream |
Spider2-E |
Delta |
|
Instances |
547 |
524 |
−23 |
|
Gold SQL files |
120 |
118 |
−2 net |
|
Gold result CSVs |
1,544 |
1,469 |
−75 net |
|
Eval script |
upstream |
patched |
hardcoding rejection + empty-result fix |
Dataset reduced from 547 → 524 instances
Evaluation made more reproducible and stable
Beyond Benchmarks: The Ambiguity Problem in Real Queries
Our work on Spider2 revealed something that goes beyond benchmark reliability: natural language queries are deeply ambiguous, and resolving those ambiguities is the central challenge for any production talk-to-data system.
We catalogued nine recurring categories of ambiguity. On average, each query contained 10–12 distinct points of ambiguity—none of them obvious from reading the question casually, and all of them capable of changing the result.
|
Category |
Description |
Frequency |
|
Schema ambiguity |
Multiple tables or columns map to the same concept |
Most common |
|
Temporal ambiguity |
Unclear date fields, boundaries, timezone handling |
Very common |
|
Filtering ambiguity |
Vague criteria like "recent,"
"large," "active" |
Very common |
|
Aggregation ambiguity |
COUNT vs SUM, grouping granularity, pre/post
filtering |
Common |
|
Join ambiguity |
Multiple valid join paths between tables |
Common |
|
Null/edge-case handling |
Include or exclude NULLs, zeros, special values |
Common |
|
Output format ambiguity |
Column names, aliases, sort order |
Common |
|
Semantic ambiguity |
Domain-specific term interpretation |
Moderate |
|
External knowledge dependency |
Information required from outside the instruction |
Moderate |
Consider a seemingly straightforward Google Analytics question:
“For each visitor who made at least one transaction in February 2017, how many days elapsed between their first visit and their first transaction?”
This single question contains 11 distinct ambiguities:
Which date field should be used?
Should timestamps be interpreted in GA timezone or UTC?
What counts as a transaction?
What counts as a visit?
Each choice can change the final result.
These ambiguities are not artifacts of the benchmark. They are inherent properties of natural language questions asked against complex schemas. A production-grade talk-to-data system must be able to resolve these ambiguities.
Conclusion
Spider2 is a major step forward for evaluating Text-to-SQL systems. It brings real-world complexity into benchmarking in a way earlier datasets did not.
But our analysis shows that a large portion of measured error is not due to model failure, but to evaluation instability. If benchmarks are used to guide research and compare systems, they must measure what they intend to measure—consistently and reproducibly.
Our analysis also exposed a more fundamental problem: natural language queries are deeply ambiguous. Each question carries multiple implicit assumptions about schema mapping, temporal boundaries, filtering criteria, and aggregation logic. Current benchmarks treat each question as having a single correct answer, but in practice, several reasonable interpretations—and therefore several valid SQL translations—often exist. Any production talk-to-data system must confront this problem directly.
Annotation errors and evaluation instability are not mutually exclusive—both degrade benchmark reliability, but they require fundamentally different remedies. Annotation errors require improved verification pipelines; evaluation instability requires data stabilization, engine-aware semantics, and versioned artifacts. Benchmark correctness is not just a property of annotations; it is a property of the entire evaluation system.
Spider2-E is one attempt to move in that direction. We hope this work contributes to a broader conversation about how to build benchmarks that are not just challenging, but also reliable.
Appendix: All 32 Non-Exact Instances
|
Instance |
Root Cause |
Notes |
|
sf006 |
data_drift |
Branch counts differ due to data refresh |
|
sf008 |
data_drift |
Home price index updated (6.55 → 6.8) |
|
sf012 |
data_drift |
2012 FEMA claim amounts differ; NFIP dataset updated multiple times |
|
sf037 |
data_drift |
POI IDs changed from hex to UUID format |
|
sf040 |
data_drift |
Address parsing data changed |
|
sf_bq009 |
data_drift |
Revenue difference value differs; GA360 data drifted |
|
sf_bq024 |
data_drift |
EVALUATION_TYPE differs (EXPCURR/EXPCHNG vs EXPMORT) |
|
sf_bq058 |
data_drift |
Optimism blockchain data not available in Snowflake |
|
sf_bq063 |
data_drift |
npm registry URLs changed |
|
sf_bq102 |
data_drift |
Genomic start positions changed |
|
sf_bq130 |
data_drift |
County vs. state rankings changed in different data window |
|
sf_bq165 |
data_drift |
MITELMAN cohort total changed (3373→3377) |
|
sf_bq190 |
data_drift |
User count changed (465 → 467) |
|
sf_bq249 |
data_drift |
Whitespace category counts updated |
|
sf_bq256 |
data_drift |
Ethereum balance sign differs due to data drift |
|
sf_bq275 |
data_drift |
GA360 fullVisitorIds differ (rolling session data) |
|
sf_bq366 |
data_drift |
Same rows, different ordering |
|
sf_ga014 |
data_drift |
GA4 session counts slightly differ |
|
sf_ga018 |
data_drift |
PLP-to-PDP conversion rate differs |
|
sf_ga021 |
data_drift |
Different cohort event type returned |
|
sf_bq111 |
engine_diff |
Missing corr_pvalue UDF in Snowflake |
|
sf_bq276 |
engine_diff |
Spatial function unavailable in Snowflake |
|
sf_bq430 |
engine_diff |
TO_JSON_STRING(STRUCT(...)) untranslatable |
|
sf_bq458 |
engine_diff |
UNNEST vs FLATTEN performance gap |
|
sf_local344 |
engine_diff |
Float boundary causes different counts |
|
sf009 |
no_db |
Spatial share revoked |
|
sf013 |
no_db |
Road-network share revoked |
|
sf029 |
no_db |
Vendor data share revoked |
|
sf_bq152 |
query_logic |
Wrong threshold conversion |
|
sf_bq416 |
query_logic |
Hardcoded values |
|
sf_bq450 |
query_logic |
Incorrect balance threshold |
|
sf_local335 |
query_logic |
Missing tie-breaking sort key |