Spider2 Under the Microscope: Data Drift, Engine Gaps, and the Case for Spider2-E

Summary: Our work on Spider2-Snow uncovered several benchmark issues that materially affect evaluation reliability. Most failures were not caused by query reasoning errors, but by instability in the benchmark itself: live data drift, cross-engine incompatibilities, revoked datasets, and ambiguous natural-language questions. We analyze these failure modes, discuss why result-only evaluation is increasingly insufficient for talk-to-data systems, and introduce Spider2-E, a more reproducible evaluation framework.

Introduction

Spider2 has quickly become one of the most important benchmarks for evaluating Text-to-SQL systems. Unlike earlier datasets with simplified schemas, it tests agents on real-world enterprise databases. Strong performance on Spider2 is widely interpreted as evidence of real capability.

But benchmarks are only as reliable as the environments they run in. Recent work by Jin et al. (CIDR 2026) has shown that annotation errors—incorrect SQL, schema misunderstandings, and ambiguous questions—are widespread across BIRD and Spider2-Snow, and that correcting them can change model rankings. Our work is complementary but addresses a distinct problem: even when annotations are correct, the evaluation environment itself can be non-reproducible.

While working with the Snowflake variant, Spider2-Snow, we manually analyzed every non-exact match produced during evaluation. The results were unexpected.

Of the 32 mismatches we investigated, 28 were caused by issues in the benchmark itself rather than query reasoning failures.

This suggests that a significant portion of what Spider2-Snow measures today is not model capability—but properties of the evaluation environment.

Cause Count System's fault?
Data drift 20 No
Engine differences (BQ vs SF) 5 No
Database unavailable 3 No
Query logic error 4 Yes

Our goal is not to criticize Spider2, but to help strengthen it.

We present a structured investigation to surface the patterns, failure modes, and design requirements that would inform the architecture of a production-grade evaluation framework.

Also, our investigation revealed something deeper. Even when the benchmark environment is stable, natural language queries are inherently ambiguous. A single question can contain 10–12 distinct points of ambiguity—any of which can change the result. Resolving these ambiguities is the central challenge for any production talk-to-data system.

The Core Issue: Evaluating Against a Moving Target

1. Data Drift

The largest source of mismatch is data drift.

Many Spider2 datasets are live and continuously updated, while the "correct" answers are based on a fixed snapshot from an unknown point in time. When the data changes, correct queries can appear wrong.

We observed several recurring patterns:

  • Data disappears — Queries targeting historical data return no results because the dataset is a rolling window. The original data simply no longer exists.

  • Data never existed in this environment — Some queries reference datasets that were never mirrored into Snowflake.

  • Rankings shift over time — A correct "top 10" query returns a slightly different ordering because the underlying data has changed.

  • Schemas evolve — Columns change meaning or format (e.g., postal codes replaced by city names).

In some cases, the benchmark penalizes for returning correct answers to current data, or even for finding data that now exists but didn't when the benchmark was created.

2. Engine Differences

Spider2-Snow inherits queries originally designed for BigQuery. However, SQL is not perfectly portable across engines.

We found several unavoidable mismatches:

  • Missing functionality — BigQuery provides functions that Snowflake does not (e.g., statistical UDFs).

  • Non-equivalent constructs — Some BigQuery behaviors (like struct serialization) cannot be replicated exactly.

  • Performance differences — Queries that run efficiently in BigQuery may time out in Snowflake.

  • Numerical and spatial differences — Small differences in floating-point or geospatial computations can change results.

These are not model failures—they are system incompatibilities.

3. Database Unavailability

In several cases, the underlying data was simply unavailable due to revoked data shares.

These queries are unevaluable by any system.

A Structural Weakness: Result-Only Evaluation

Spider2 evaluates outputs, not methods.

This creates a vulnerability: an agent can produce the correct answer without actually querying the database (e.g., by hardcoding results).

The benchmark includes some detection, but it is limited. We identified one such case during our analysis.

This raises a broader concern: high scores do not always imply real capability.

Implications for Reported Scores

The True Ceiling Is Lower Than 100%

Because of environment-related issues, even a perfect system cannot reliably reach 100%. In practice, the achievable ceiling is meaningfully lower under current evaluation conditions.

Scores Change Over Time

Since some datasets are live, the benchmark itself changes. An identical system evaluated at different times may receive different scores.

Very High Scores Require Scrutiny

Very high scores may reflect:

  • Favorable timing (less data drift)

  • Benchmark artifacts

  • Or unintended exploitation

Four Distinct Sources of Benchmark Failure

The issues we observed fall into four fundamentally different categories:

Problem Nature Remedy
Annotation errors Incorrect gold labels or SQL Better verification and curation
Data drift Non-stationary datasets Snapshotting and versioning
Engine mismatch Semantic differences across SQL engines Engine-aware evaluation
Query ambiguity Multiple valid interpretations Multi-answer or ambiguity-aware evaluation

These problems are often conflated, but they require fundamentally different remedies.

Recommendations for Benchmark Design

Reliable benchmarks require more than correct answers—they require stable evaluation conditions.

Production systems must handle evolving data and ambiguous questions. But benchmarks serve a different purpose: they are supposed to isolate and measure system capability under controlled conditions. When environmental variance dominates measured error, benchmark scores become difficult to interpret scientifically.

We suggest focusing on two dimensions:

1. Annotation Quality (What is measured)

Prior work, notably Jin et al., has demonstrated the impact of annotation errors on benchmark reliability. Their remedies—improved verification pipelines, corrected SQL, and clearer questions—address this dimension directly.

  • Ensure gold SQL is correct

  • Remove ambiguity in questions or accept several valid SQL translations

2. Evaluation Stability (How it is measured)

  • Use snapshot data, not live datasets

  • Document assumptions (data version, engine, timing)

  • Support engine-specific behavior where needed

  • Version the benchmark over time

  • Allow tolerance for minor numerical differences

  • Evaluate how answers are produced, not just results

Spider2-E: A More Stable Evaluation Framework

To explore these ideas, we built Spider2-E, a modified version of Spider2-Snow.

Key changes:

Removed unstable instances

  • Data drift cases

  • Unavailable databases

Handled engine differences explicitly

  • Improved ground truth SQL

  • Removed outdated queries

  • Added corrected versions

Strengthened evaluation

  • Detect hardcoding

  • Validate query structure

  • Treat empty results consistently

Artifact Upstream Spider2-E Delta
Instances 547 524 −23
Gold SQL files 120 118 −2 net
Gold result CSVs 1,544 1,469 −75 net
Eval script upstream patched hardcoding rejection + empty-result fix
  • Dataset reduced from 547 → 524 instances

  • Evaluation made more reproducible and stable

Beyond Benchmarks: The Ambiguity Problem in Real Queries

Our work on Spider2 revealed something that goes beyond benchmark reliability: natural language queries are deeply ambiguous, and resolving those ambiguities is the central challenge for any production talk-to-data system.

We catalogued nine recurring categories of ambiguity. On average, each query contained 10–12 distinct points of ambiguity—none of them obvious from reading the question casually, and all of them capable of changing the result.

Category Description Frequency
Schema ambiguity Multiple tables or columns map to the same concept Most common
Temporal ambiguity Unclear date fields, boundaries, timezone handling Very common
Filtering ambiguity Vague criteria like "recent," "large," "active" Very common
Aggregation ambiguity COUNT vs SUM, grouping granularity, pre/post filtering Common
Join ambiguity Multiple valid join paths between tables Common
Null/edge-case handling Include or exclude NULLs, zeros, special values Common
Output format ambiguity Column names, aliases, sort order Common
Semantic ambiguity Domain-specific term interpretation Moderate
External knowledge dependency Information required from outside the instruction Moderate

Consider a seemingly straightforward Google Analytics question:

“For each visitor who made at least one transaction in February 2017, how many days elapsed between their first visit and their first transaction?”

This single question contains 11 distinct ambiguities:

  • Which date field should be used?

  • Should timestamps be interpreted in GA timezone or UTC?

  • What counts as a transaction?

  • What counts as a visit?

Each choice can change the final result.

These ambiguities are not artifacts of the benchmark. They are inherent properties of natural language questions asked against complex schemas. A production-grade talk-to-data system must be able to resolve these ambiguities.

Conclusion

Spider2 is a major step forward for evaluating Text-to-SQL systems. It brings real-world complexity into benchmarking in a way earlier datasets did not.

But our analysis shows that a large portion of measured error is not due to model failure, but to evaluation instability. If benchmarks are used to guide research and compare systems, they must measure what they intend to measure—consistently and reproducibly.

Our analysis also exposed a more fundamental problem: natural language queries are deeply ambiguous. Each question carries multiple implicit assumptions about schema mapping, temporal boundaries, filtering criteria, and aggregation logic. Current benchmarks treat each question as having a single correct answer, but in practice, several reasonable interpretations—and therefore several valid SQL translations—often exist. Any production talk-to-data system must confront this problem directly.

Annotation errors and evaluation instability are not mutually exclusive—both degrade benchmark reliability, but they require fundamentally different remedies. Annotation errors require improved verification pipelines; evaluation instability requires data stabilization, engine-aware semantics, and versioned artifacts. Benchmark correctness is not just a property of annotations; it is a property of the entire evaluation system.

Spider2-E is one attempt to move in that direction. We hope this work contributes to a broader conversation about how to build benchmarks that are not just challenging, but also reliable.

Appendix: All 32 Non-Exact Instances

Instance Root cause Notes
sf006 data_drift Branch counts differ due to data refresh
sf008 data_drift Home price index updated (6.55 → 6.8)
sf012 data_drift 2012 FEMA claim amounts differ; NFIP dataset updated multiple times
sf037 data_drift POI IDs changed from hex to UUID format
sf040 data_drift Address parsing data changed
sf_bq009 data_drift Revenue difference value differs; GA360 data drifted
sf_bq024 data_drift EVALUATION_TYPE differs (EXPCURR/EXPCHNG vs EXPMORT)
sf_bq058 data_drift Optimism blockchain data not available in Snowflake
sf_bq063 data_drift npm registry URLs changed
sf_bq102 data_drift Genomic start positions changed
sf_bq130 data_drift County vs. state rankings changed in different data window
sf_bq165 data_drift MITELMAN cohort total changed (3373 → 3377)
sf_bq190 data_drift User count changed (465 → 467)
sf_bq249 data_drift Whitespace category counts updated
sf_bq256 data_drift Ethereum balance sign differs due to data drift
sf_bq275 data_drift GA360 fullVisitorIds differ (rolling session data)
sf_bq366 data_drift Same rows, different ordering
sf_ga014 data_drift GA4 session counts slightly differ
sf_ga018 data_drift PLP-to-PDP conversion rate differs
sf_ga021 data_drift Different cohort event type returned
sf_bq111 engine_diff Missing corr_pvalue UDF in Snowflake
sf_bq276 engine_diff Spatial function unavailable in Snowflake
sf_bq430 engine_diff TO_JSON_STRING(STRUCT(...)) untranslatable
sf_bq458 engine_diff UNNEST vs FLATTEN performance gap
sf_local344 engine_diff Float boundary causes different counts
sf009 no_db Spatial share revoked
sf013 no_db Road-network share revoked
sf029 no_db Vendor data share revoked
sf_bq152 query_logic Wrong threshold conversion
sf_bq416 query_logic Hardcoded values
sf_bq450 query_logic Incorrect balance threshold
sf_local335 query_logic Missing tie-breaking sort key
Next
Next

EMERGENCE WORLD: A Laboratory for Evaluating Long-horizon Agent Autonomy