Summary: Our work on Spider2-Snow uncovered several benchmark issues that materially affect evaluation reliability. Most failures were not caused by query reasoning errors, but by instability in the benchmark itself: live data drift, cross-engine incompatibilities, revoked datasets, and ambiguous natural-language questions. We analyze these failure modes, discuss why result-only evaluation is increasingly insufficient for talk-to-data systems, and introduce Spider2-E, a more reproducible evaluation framework.

Introduction

Spider2 has quickly become one of the most important benchmarks for evaluating Text-to-SQL systems. Unlike earlier datasets with simplified schemas, it tests agents on real-world enterprise databases. Strong performance on Spider2 is widely interpreted as evidence of real capability.

But benchmarks are only as reliable as the environments they run in. Recent work by Jin et al. (CIDR 2026) has shown that annotation errors—incorrect SQL, schema misunderstandings, and ambiguous questions—are widespread across BIRD and Spider2-Snow, and that correcting them can change model rankings. Our work is complementary but addresses a distinct problem: even when annotations are correct, the evaluation environment itself can be non-reproducible.

While working with the Snowflake variant, Spider2-Snow, we manually analyzed every non-exact match produced during evaluation. The results were unexpected.

Of the 32 mismatches we investigated, 28 were caused by issues in the benchmark itself rather than query reasoning failures.

This suggests that a significant portion of what Spider2-Snow measures today is not model capability—but properties of the evaluation environment.

Cause	Count	System's fault?
Data drift	20	No
Engine differences (BQ vs SF)	5	No
Database unavailable	3	No
Query logic error	4	Yes

Our goal is not to criticize Spider2, but to help strengthen it.

We present a structured investigation to surface the patterns, failure modes, and design requirements that would inform the architecture of a production-grade evaluation framework.

Also, our investigation revealed something deeper. Even when the benchmark environment is stable, natural language queries are inherently ambiguous. A single question can contain 10–12 distinct points of ambiguity—any of which can change the result. Resolving these ambiguities is the central challenge for any production talk-to-data system.

The Core Issue: Evaluating Against a Moving Target

1. Data Drift

The largest source of mismatch is data drift.

Many Spider2 datasets are live and continuously updated, while the "correct" answers are based on a fixed snapshot from an unknown point in time. When the data changes, correct queries can appear wrong.

We observed several recurring patterns:

Data disappears — Queries targeting historical data return no results because the dataset is a rolling window. The original data simply no longer exists.
Data never existed in this environment — Some queries reference datasets that were never mirrored into Snowflake.
Rankings shift over time — A correct "top 10" query returns a slightly different ordering because the underlying data has changed.
Schemas evolve — Columns change meaning or format (e.g., postal codes replaced by city names).

In some cases, the benchmark penalizes for returning correct answers to current data, or even for finding data that now exists but didn't when the benchmark was created.

2. Engine Differences

Spider2-Snow inherits queries originally designed for BigQuery. However, SQL is not perfectly portable across engines.

We found several unavoidable mismatches:

Missing functionality — BigQuery provides functions that Snowflake does not (e.g., statistical UDFs).
Non-equivalent constructs — Some BigQuery behaviors (like struct serialization) cannot be replicated exactly.
Performance differences — Queries that run efficiently in BigQuery may time out in Snowflake.
Numerical and spatial differences — Small differences in floating-point or geospatial computations can change results.

These are not model failures—they are system incompatibilities.

3. Database Unavailability

In several cases, the underlying data was simply unavailable due to revoked data shares.

These queries are unevaluable by any system.

A Structural Weakness: Result-Only Evaluation

Spider2 evaluates outputs, not methods.

This creates a vulnerability: an agent can produce the correct answer without actually querying the database (e.g., by hardcoding results).

The benchmark includes some detection, but it is limited. We identified one such case during our analysis.

This raises a broader concern: high scores do not always imply real capability.

Implications for Reported Scores

The True Ceiling Is Lower Than 100%

Because of environment-related issues, even a perfect system cannot reliably reach 100%. In practice, the achievable ceiling is meaningfully lower under current evaluation conditions.

Scores Change Over Time

Since some datasets are live, the benchmark itself changes. An identical system evaluated at different times may receive different scores.

Very High Scores Require Scrutiny

Very high scores may reflect:

Favorable timing (less data drift)
Benchmark artifacts
Or unintended exploitation

Four Distinct Sources of Benchmark Failure

The issues we observed fall into four fundamentally different categories:

Problem	Nature	Remedy
Annotation errors	Incorrect gold labels or SQL	Better verification and curation
Data drift	Non-stationary datasets	Snapshotting and versioning
Engine mismatch	Semantic differences across SQL engines	Engine-aware evaluation
Query ambiguity	Multiple valid interpretations	Multi-answer or ambiguity-aware evaluation

These problems are often conflated, but they require fundamentally different remedies.

Recommendations for Benchmark Design

Reliable benchmarks require more than correct answers—they require stable evaluation conditions.

Production systems must handle evolving data and ambiguous questions. But benchmarks serve a different purpose: they are supposed to isolate and measure system capability under controlled conditions. When environmental variance dominates measured error, benchmark scores become difficult to interpret scientifically.

We suggest focusing on two dimensions:

1. Annotation Quality (What is measured)

Prior work, notably Jin et al., has demonstrated the impact of annotation errors on benchmark reliability. Their remedies—improved verification pipelines, corrected SQL, and clearer questions—address this dimension directly.

Ensure gold SQL is correct
Remove ambiguity in questions or accept several valid SQL translations

2. Evaluation Stability (How it is measured)

Use snapshot data, not live datasets
Document assumptions (data version, engine, timing)
Support engine-specific behavior where needed
Version the benchmark over time
Allow tolerance for minor numerical differences
Evaluate how answers are produced, not just results

Spider2-E: A More Stable Evaluation Framework

To explore these ideas, we built Spider2-E, a modified version of Spider2-Snow.

Key changes:

Removed unstable instances

Data drift cases
Unavailable databases

Handled engine differences explicitly

Improved ground truth SQL
Removed outdated queries
Added corrected versions

Strengthened evaluation

Detect hardcoding
Validate query structure
Treat empty results consistently

Artifact	Upstream	Spider2-E	Delta
Instances	547	524	−23
Gold SQL files	120	118	−2 net
Gold result CSVs	1,544	1,469	−75 net
Eval script	upstream	patched	hardcoding rejection + empty-result fix

Dataset reduced from 547 → 524 instances
Evaluation made more reproducible and stable

Beyond Benchmarks: The Ambiguity Problem in Real Queries

Our work on Spider2 revealed something that goes beyond benchmark reliability: natural language queries are deeply ambiguous, and resolving those ambiguities is the central challenge for any production talk-to-data system.

We catalogued nine recurring categories of ambiguity. On average, each query contained 10–12 distinct points of ambiguity—none of them obvious from reading the question casually, and all of them capable of changing the result.

Category	Description	Frequency
Schema ambiguity	Multiple tables or columns map to the same concept	Most common
Temporal ambiguity	Unclear date fields, boundaries, timezone handling	Very common
Filtering ambiguity	Vague criteria like "recent," "large," "active"	Very common
Aggregation ambiguity	COUNT vs SUM, grouping granularity, pre/post filtering	Common
Join ambiguity	Multiple valid join paths between tables	Common
Null/edge-case handling	Include or exclude NULLs, zeros, special values	Common
Output format ambiguity	Column names, aliases, sort order	Common
Semantic ambiguity	Domain-specific term interpretation	Moderate
External knowledge dependency	Information required from outside the instruction	Moderate

Consider a seemingly straightforward Google Analytics question:

“For each visitor who made at least one transaction in February 2017, how many days elapsed between their first visit and their first transaction?”

This single question contains 11 distinct ambiguities:

Which date field should be used?
Should timestamps be interpreted in GA timezone or UTC?
What counts as a transaction?
What counts as a visit?

Each choice can change the final result.

These ambiguities are not artifacts of the benchmark. They are inherent properties of natural language questions asked against complex schemas. A production-grade talk-to-data system must be able to resolve these ambiguities.

Conclusion

Spider2 is a major step forward for evaluating Text-to-SQL systems. It brings real-world complexity into benchmarking in a way earlier datasets did not.

But our analysis shows that a large portion of measured error is not due to model failure, but to evaluation instability. If benchmarks are used to guide research and compare systems, they must measure what they intend to measure—consistently and reproducibly.

Our analysis also exposed a more fundamental problem: natural language queries are deeply ambiguous. Each question carries multiple implicit assumptions about schema mapping, temporal boundaries, filtering criteria, and aggregation logic. Current benchmarks treat each question as having a single correct answer, but in practice, several reasonable interpretations—and therefore several valid SQL translations—often exist. Any production talk-to-data system must confront this problem directly.

Annotation errors and evaluation instability are not mutually exclusive—both degrade benchmark reliability, but they require fundamentally different remedies. Annotation errors require improved verification pipelines; evaluation instability requires data stabilization, engine-aware semantics, and versioned artifacts. Benchmark correctness is not just a property of annotations; it is a property of the entire evaluation system.

Spider2-E is one attempt to move in that direction. We hope this work contributes to a broader conversation about how to build benchmarks that are not just challenging, but also reliable.

Appendix: All 32 Non-Exact Instances

Instance	Root Cause	Notes
sf006	data_drift	Branch counts differ due to data refresh
sf008	data_drift	Home price index updated (6.55 → 6.8)
sf012	data_drift	2012 FEMA claim amounts differ; NFIP dataset updated multiple times
sf037	data_drift	POI IDs changed from hex to UUID format
sf040	data_drift	Address parsing data changed
sf_bq009	data_drift	Revenue difference value differs; GA360 data drifted
sf_bq024	data_drift	EVALUATION_TYPE differs (EXPCURR/EXPCHNG vs EXPMORT)
sf_bq058	data_drift	Optimism blockchain data not available in Snowflake
sf_bq063	data_drift	npm registry URLs changed
sf_bq102	data_drift	Genomic start positions changed
sf_bq130	data_drift	County vs. state rankings changed in different data window
sf_bq165	data_drift	MITELMAN cohort total changed (3373→3377)
sf_bq190	data_drift	User count changed (465 → 467)
sf_bq249	data_drift	Whitespace category counts updated
sf_bq256	data_drift	Ethereum balance sign differs due to data drift
sf_bq275	data_drift	GA360 fullVisitorIds differ (rolling session data)
sf_bq366	data_drift	Same rows, different ordering
sf_ga014	data_drift	GA4 session counts slightly differ
sf_ga018	data_drift	PLP-to-PDP conversion rate differs
sf_ga021	data_drift	Different cohort event type returned
sf_bq111	engine_diff	Missing corr_pvalue UDF in Snowflake
sf_bq276	engine_diff	Spatial function unavailable in Snowflake
sf_bq430	engine_diff	TO_JSON_STRING(STRUCT(...)) untranslatable
sf_bq458	engine_diff	UNNEST vs FLATTEN performance gap
sf_local344	engine_diff	Float boundary causes different counts
sf009	no_db	Spatial share revoked
sf013	no_db	Road-network share revoked
sf029	no_db	Vendor data share revoked
sf_bq152	query_logic	Wrong threshold conversion
sf_bq416	query_logic	Hardcoded values
sf_bq450	query_logic	Incorrect balance threshold
sf_local335	query_logic	Missing tie-breaking sort key

Spider2 Under the Microscope: Data Drift, Engine Gaps, and the Case for Spider2-E [v3]

Introduction

The Core Issue: Evaluating Against a Moving Target

1. Data Drift

2. Engine Differences

3. Database Unavailability

A Structural Weakness: Result-Only Evaluation

Implications for Reported Scores

The True Ceiling Is Lower Than 100%

Scores Change Over Time

Very High Scores Require Scrutiny

Four Distinct Sources of Benchmark Failure

Recommendations for Benchmark Design

1. Annotation Quality (What is measured)

2. Evaluation Stability (How it is measured)

Spider2-E: A More Stable Evaluation Framework

Removed unstable instances

Handled engine differences explicitly

Strengthened evaluation

Beyond Benchmarks: The Ambiguity Problem in Real Queries

Conclusion

Appendix: All 32 Non-Exact Instances

Deploy agentic systems your business can depend on

Partnership

Product

Industry Spotlight

Company

Resources

Spider2 Under the Microscope: Data Drift, Engine Gaps, and the Case for Spider2-E [v3]

Introduction

The Core Issue: Evaluating Against a Moving Target

1. Data Drift

2. Engine Differences

3. Database Unavailability

A Structural Weakness: Result-Only Evaluation

Implications for Reported Scores

The True Ceiling Is Lower Than 100%

Scores Change Over Time

Very High Scores Require Scrutiny

Four Distinct Sources of Benchmark Failure

Recommendations for Benchmark Design

1. Annotation Quality (What is measured)

2. Evaluation Stability (How it is measured)

Spider2-E: A More Stable Evaluation Framework

Removed unstable instances

Handled engine differences explicitly

Strengthened evaluation

Beyond Benchmarks: The Ambiguity Problem in Real Queries

Conclusion

Appendix: All 32 Non-Exact Instances

EMERGENCE WORLD: A Laboratory for Evaluating Long-horizon Agent Autonomy

Deploy agentic systems your business can depend on

Partnership

Product

Industry Spotlight

Company

Resources