Spider2 Under the Microscope: Data Drift, Engine Gaps, and the Case for Spider2-E [v3]

Summary: Our work on Spider2-Snow uncovered several benchmark issues that materially affect evaluation reliability. Most failures were not caused by query reasoning errors, but by instability in the benchmark itself: live data drift, cross-engine incompatibilities, revoked datasets, and ambiguous natural-language questions. We analyze these failure modes, discuss why result-only evaluation is increasingly insufficient for talk-to-data systems, and introduce Spider2-E, a more reproducible evaluation framework.

Introduction

Spider2 has quickly become one of the most important benchmarks for evaluating Text-to-SQL systems. Unlike earlier datasets with simplified schemas, it tests agents on real-world enterprise databases. Strong performance on Spider2 is widely interpreted as evidence of real capability.

But benchmarks are only as reliable as the environments they run in. Recent work by Jin et al. (CIDR 2026) has shown that annotation errors—incorrect SQL, schema misunderstandings, and ambiguous questions—are widespread across BIRD and Spider2-Snow, and that correcting them can change model rankings. Our work is complementary but addresses a distinct problem: even when annotations are correct, the evaluation environment itself can be non-reproducible.

While working with the Snowflake variant, Spider2-Snow, we manually analyzed every non-exact match produced during evaluation. The results were unexpected.

Of the 32 mismatches we investigated, 28 were caused by issues in the benchmark itself rather than query reasoning failures.

This suggests that a significant portion of what Spider2-Snow measures today is not model capability—but properties of the evaluation environment.

Cause

Count

System's fault?

Data drift

20

No

Engine differences (BQ vs SF)

5

No

Database unavailable

3

No

Query logic error

4

Yes

Our goal is not to criticize Spider2, but to help strengthen it.

We present a structured investigation to surface the patterns, failure modes, and design requirements that would inform the architecture of a production-grade evaluation framework.

Also, our investigation revealed something deeper. Even when the benchmark environment is stable, natural language queries are inherently ambiguous. A single question can contain 10–12 distinct points of ambiguity—any of which can change the result. Resolving these ambiguities is the central challenge for any production talk-to-data system.

The Core Issue: Evaluating Against a Moving Target

1. Data Drift

The largest source of mismatch is data drift.

Many Spider2 datasets are live and continuously updated, while the "correct" answers are based on a fixed snapshot from an unknown point in time. When the data changes, correct queries can appear wrong.

We observed several recurring patterns:

  • Data disappears — Queries targeting historical data return no results because the dataset is a rolling window. The original data simply no longer exists.

  • Data never existed in this environment — Some queries reference datasets that were never mirrored into Snowflake.

  • Rankings shift over time — A correct "top 10" query returns a slightly different ordering because the underlying data has changed.

  • Schemas evolve — Columns change meaning or format (e.g., postal codes replaced by city names).

In some cases, the benchmark penalizes for returning correct answers to current data, or even for finding data that now exists but didn't when the benchmark was created.

2. Engine Differences

Spider2-Snow inherits queries originally designed for BigQuery. However, SQL is not perfectly portable across engines.

We found several unavoidable mismatches:

  • Missing functionality — BigQuery provides functions that Snowflake does not (e.g., statistical UDFs).

  • Non-equivalent constructs — Some BigQuery behaviors (like struct serialization) cannot be replicated exactly.

  • Performance differences — Queries that run efficiently in BigQuery may time out in Snowflake.

  • Numerical and spatial differences — Small differences in floating-point or geospatial computations can change results.

These are not model failures—they are system incompatibilities.

3. Database Unavailability

In several cases, the underlying data was simply unavailable due to revoked data shares.

These queries are unevaluable by any system.

A Structural Weakness: Result-Only Evaluation

Spider2 evaluates outputs, not methods.

This creates a vulnerability: an agent can produce the correct answer without actually querying the database (e.g., by hardcoding results).

The benchmark includes some detection, but it is limited. We identified one such case during our analysis.

This raises a broader concern: high scores do not always imply real capability.

Implications for Reported Scores

The True Ceiling Is Lower Than 100%

Because of environment-related issues, even a perfect system cannot reliably reach 100%. In practice, the achievable ceiling is meaningfully lower under current evaluation conditions.

Scores Change Over Time

Since some datasets are live, the benchmark itself changes. An identical system evaluated at different times may receive different scores.

Very High Scores Require Scrutiny

Very high scores may reflect:

  • Favorable timing (less data drift)

  • Benchmark artifacts

  • Or unintended exploitation

Four Distinct Sources of Benchmark Failure

The issues we observed fall into four fundamentally different categories:

Problem

Nature

Remedy

Annotation errors

Incorrect gold labels or SQL

Better verification and curation

Data drift

Non-stationary datasets

Snapshotting and versioning

Engine mismatch

Semantic differences across SQL engines

Engine-aware evaluation

Query ambiguity

Multiple valid interpretations

Multi-answer or ambiguity-aware evaluation

These problems are often conflated, but they require fundamentally different remedies.

Recommendations for Benchmark Design

Reliable benchmarks require more than correct answers—they require stable evaluation conditions.

Production systems must handle evolving data and ambiguous questions. But benchmarks serve a different purpose: they are supposed to isolate and measure system capability under controlled conditions. When environmental variance dominates measured error, benchmark scores become difficult to interpret scientifically.

We suggest focusing on two dimensions:

1. Annotation Quality (What is measured)

Prior work, notably Jin et al., has demonstrated the impact of annotation errors on benchmark reliability. Their remedies—improved verification pipelines, corrected SQL, and clearer questions—address this dimension directly.

  • Ensure gold SQL is correct

  • Remove ambiguity in questions or accept several valid SQL translations

2. Evaluation Stability (How it is measured)

  • Use snapshot data, not live datasets

  • Document assumptions (data version, engine, timing)

  • Support engine-specific behavior where needed

  • Version the benchmark over time

  • Allow tolerance for minor numerical differences

  • Evaluate how answers are produced, not just results

Spider2-E: A More Stable Evaluation Framework

To explore these ideas, we built Spider2-E, a modified version of Spider2-Snow.

Key changes:

Removed unstable instances

  • Data drift cases

  • Unavailable databases

Handled engine differences explicitly

  • Improved ground truth SQL

  • Removed outdated queries

  • Added corrected versions

Strengthened evaluation

  • Detect hardcoding

  • Validate query structure

  • Treat empty results consistently

Artifact

Upstream

Spider2-E

Delta

Instances

547

524

−23

Gold SQL files

120

118

−2 net

Gold result CSVs

1,544

1,469

−75 net

Eval script

upstream

patched

hardcoding rejection + empty-result fix

  • Dataset reduced from 547 → 524 instances

  • Evaluation made more reproducible and stable

Beyond Benchmarks: The Ambiguity Problem in Real Queries

Our work on Spider2 revealed something that goes beyond benchmark reliability: natural language queries are deeply ambiguous, and resolving those ambiguities is the central challenge for any production talk-to-data system.

We catalogued nine recurring categories of ambiguity. On average, each query contained 10–12 distinct points of ambiguity—none of them obvious from reading the question casually, and all of them capable of changing the result.

Category

Description

Frequency

Schema ambiguity

Multiple tables or columns map to the same concept

Most common

Temporal ambiguity

Unclear date fields, boundaries, timezone handling

Very common

Filtering ambiguity

Vague criteria like "recent," "large," "active"

Very common

Aggregation ambiguity

COUNT vs SUM, grouping granularity, pre/post filtering

Common

Join ambiguity

Multiple valid join paths between tables

Common

Null/edge-case handling

Include or exclude NULLs, zeros, special values

Common

Output format ambiguity

Column names, aliases, sort order

Common

Semantic ambiguity

Domain-specific term interpretation

Moderate

External knowledge dependency

Information required from outside the instruction

Moderate

Consider a seemingly straightforward Google Analytics question:

“For each visitor who made at least one transaction in February 2017, how many days elapsed between their first visit and their first transaction?”

This single question contains 11 distinct ambiguities:

  • Which date field should be used?

  • Should timestamps be interpreted in GA timezone or UTC?

  • What counts as a transaction?

  • What counts as a visit?

Each choice can change the final result.

These ambiguities are not artifacts of the benchmark. They are inherent properties of natural language questions asked against complex schemas. A production-grade talk-to-data system must be able to resolve these ambiguities.

Conclusion

Spider2 is a major step forward for evaluating Text-to-SQL systems. It brings real-world complexity into benchmarking in a way earlier datasets did not.

But our analysis shows that a large portion of measured error is not due to model failure, but to evaluation instability. If benchmarks are used to guide research and compare systems, they must measure what they intend to measure—consistently and reproducibly.

Our analysis also exposed a more fundamental problem: natural language queries are deeply ambiguous. Each question carries multiple implicit assumptions about schema mapping, temporal boundaries, filtering criteria, and aggregation logic. Current benchmarks treat each question as having a single correct answer, but in practice, several reasonable interpretations—and therefore several valid SQL translations—often exist. Any production talk-to-data system must confront this problem directly.

Annotation errors and evaluation instability are not mutually exclusive—both degrade benchmark reliability, but they require fundamentally different remedies. Annotation errors require improved verification pipelines; evaluation instability requires data stabilization, engine-aware semantics, and versioned artifacts. Benchmark correctness is not just a property of annotations; it is a property of the entire evaluation system.

Spider2-E is one attempt to move in that direction. We hope this work contributes to a broader conversation about how to build benchmarks that are not just challenging, but also reliable.

Appendix: All 32 Non-Exact Instances

Instance

Root Cause

Notes

sf006

data_drift

Branch counts differ due to data refresh

sf008

data_drift

Home price index updated (6.55 → 6.8)

sf012

data_drift

2012 FEMA claim amounts differ; NFIP dataset updated multiple times

sf037

data_drift

POI IDs changed from hex to UUID format

sf040

data_drift

Address parsing data changed

sf_bq009

data_drift

Revenue difference value differs; GA360 data drifted

sf_bq024

data_drift

EVALUATION_TYPE differs (EXPCURR/EXPCHNG vs EXPMORT)

sf_bq058

data_drift

Optimism blockchain data not available in Snowflake

sf_bq063

data_drift

npm registry URLs changed

sf_bq102

data_drift

Genomic start positions changed

sf_bq130

data_drift

County vs. state rankings changed in different data window

sf_bq165

data_drift

MITELMAN cohort total changed (3373→3377)

sf_bq190

data_drift

User count changed (465 → 467)

sf_bq249

data_drift

Whitespace category counts updated

sf_bq256

data_drift

Ethereum balance sign differs due to data drift

sf_bq275

data_drift

GA360 fullVisitorIds differ (rolling session data)

sf_bq366

data_drift

Same rows, different ordering

sf_ga014

data_drift

GA4 session counts slightly differ

sf_ga018

data_drift

PLP-to-PDP conversion rate differs

sf_ga021

data_drift

Different cohort event type returned

sf_bq111

engine_diff

Missing corr_pvalue UDF in Snowflake

sf_bq276

engine_diff

Spatial function unavailable in Snowflake

sf_bq430

engine_diff

TO_JSON_STRING(STRUCT(...)) untranslatable

sf_bq458

engine_diff

UNNEST vs FLATTEN performance gap

sf_local344

engine_diff

Float boundary causes different counts

sf009

no_db

Spatial share revoked

sf013

no_db

Road-network share revoked

sf029

no_db

Vendor data share revoked

sf_bq152

query_logic

Wrong threshold conversion

sf_bq416

query_logic

Hardcoded values

sf_bq450

query_logic

Incorrect balance threshold

sf_local335

query_logic

Missing tie-breaking sort key

Next
Next

EMERGENCE WORLD: A Laboratory for Evaluating Long-horizon Agent Autonomy