Spider2 Under the Microscope: Data Drift, Engine Gaps, and the Case for Spider2-E

Insights

May 15

Written By Pralhad Deshpande - Ayan Bhowmick - Jeremy Hartman - Paul Haley

Summary: Our work on Spider2-Snow uncovered several benchmark issues that materially affect evaluation reliability. Most failures were not caused by query reasoning errors, but by instability in the benchmark itself: live data drift, cross-engine incompatibilities, revoked datasets, and ambiguous natural-language questions. We analyze these failure modes, discuss why result-only evaluation is increasingly insufficient for talk-to-data systems, and introduce Spider2-E, a more reproducible evaluation framework.

Introduction

Spider2 has quickly become one of the most important benchmarks for evaluating Text-to-SQL systems. Unlike earlier datasets with simplified schemas, it tests agents on real-world enterprise databases. Strong performance on Spider2 is widely interpreted as evidence of real capability.

But benchmarks are only as reliable as the environments they run in. Recent work by Jin et al. (CIDR 2026) has shown that annotation errors—incorrect SQL, schema misunderstandings, and ambiguous questions—are widespread across BIRD and Spider2-Snow, and that correcting them can change model rankings. Our work is complementary but addresses a distinct problem: even when annotations are correct, the evaluation environment itself can be non-reproducible.

While working with the Snowflake variant, Spider2-Snow, we manually analyzed every non-exact match produced during evaluation. The results were unexpected.

Of the 32 mismatches we investigated, 28 were caused by issues in the benchmark itself rather than query reasoning failures.

This suggests that a significant portion of what Spider2-Snow measures today is not model capability—but properties of the evaluation environment.

      
            Cause
            Count
            System's fault?
          
            Data drift
            20
            No
          
            Engine differences (BQ vs SF)
            5
            No
          
            Database unavailable
            3
            No
          
            Query logic error
            4
            Yes

Our goal is not to criticize Spider2, but to help strengthen it.

We present a structured investigation to surface the patterns, failure modes, and design requirements that would inform the architecture of a production-grade evaluation framework.

Also, our investigation revealed something deeper. Even when the benchmark environment is stable, natural language queries are inherently ambiguous. A single question can contain 10–12 distinct points of ambiguity—any of which can change the result. Resolving these ambiguities is the central challenge for any production talk-to-data system.

The Core Issue: Evaluating Against a Moving Target

1. Data Drift

The largest source of mismatch is data drift.

Many Spider2 datasets are live and continuously updated, while the "correct" answers are based on a fixed snapshot from an unknown point in time. When the data changes, correct queries can appear wrong.

We observed several recurring patterns:

Data disappears — Queries targeting historical data return no results because the dataset is a rolling window. The original data simply no longer exists.
Data never existed in this environment — Some queries reference datasets that were never mirrored into Snowflake.
Rankings shift over time — A correct "top 10" query returns a slightly different ordering because the underlying data has changed.
Schemas evolve — Columns change meaning or format (e.g., postal codes replaced by city names).

In some cases, the benchmark penalizes for returning correct answers to current data, or even for finding data that now exists but didn't when the benchmark was created.

2. Engine Differences

Spider2-Snow inherits queries originally designed for BigQuery. However, SQL is not perfectly portable across engines.

We found several unavoidable mismatches:

Missing functionality — BigQuery provides functions that Snowflake does not (e.g., statistical UDFs).
Non-equivalent constructs — Some BigQuery behaviors (like struct serialization) cannot be replicated exactly.
Performance differences — Queries that run efficiently in BigQuery may time out in Snowflake.
Numerical and spatial differences — Small differences in floating-point or geospatial computations can change results.

These are not model failures—they are system incompatibilities.

3. Database Unavailability

In several cases, the underlying data was simply unavailable due to revoked data shares.

These queries are unevaluable by any system.

A Structural Weakness: Result-Only Evaluation

Spider2 evaluates outputs, not methods.

This creates a vulnerability: an agent can produce the correct answer without actually querying the database (e.g., by hardcoding results).

The benchmark includes some detection, but it is limited. We identified one such case during our analysis.

This raises a broader concern: high scores do not always imply real capability.

Implications for Reported Scores

The True Ceiling Is Lower Than 100%

Because of environment-related issues, even a perfect system cannot reliably reach 100%. In practice, the achievable ceiling is meaningfully lower under current evaluation conditions.

Scores Change Over Time

Since some datasets are live, the benchmark itself changes. An identical system evaluated at different times may receive different scores.

Very High Scores Require Scrutiny

Very high scores may reflect:

Favorable timing (less data drift)
Benchmark artifacts
Or unintended exploitation

Four Distinct Sources of Benchmark Failure

The issues we observed fall into four fundamentally different categories:

      
            Problem
            Nature
            Remedy
          
            Annotation errors
            Incorrect gold labels or SQL
            Better verification and curation
          
            Data drift
            Non-stationary datasets
            Snapshotting and versioning
          
            Engine mismatch
            Semantic differences across SQL engines
            Engine-aware evaluation
          
            Query ambiguity
            Multiple valid interpretations
            Multi-answer or ambiguity-aware evaluation

These problems are often conflated, but they require fundamentally different remedies.

Recommendations for Benchmark Design

Reliable benchmarks require more than correct answers—they require stable evaluation conditions.

Production systems must handle evolving data and ambiguous questions. But benchmarks serve a different purpose: they are supposed to isolate and measure system capability under controlled conditions. When environmental variance dominates measured error, benchmark scores become difficult to interpret scientifically.

We suggest focusing on two dimensions:

1. Annotation Quality (What is measured)

Prior work, notably Jin et al., has demonstrated the impact of annotation errors on benchmark reliability. Their remedies—improved verification pipelines, corrected SQL, and clearer questions—address this dimension directly.

Ensure gold SQL is correct
Remove ambiguity in questions or accept several valid SQL translations

2. Evaluation Stability (How it is measured)

Use snapshot data, not live datasets
Document assumptions (data version, engine, timing)
Support engine-specific behavior where needed
Version the benchmark over time
Allow tolerance for minor numerical differences
Evaluate how answers are produced, not just results

Spider2-E: A More Stable Evaluation Framework

To explore these ideas, we built Spider2-E, a modified version of Spider2-Snow.

Key changes:

Removed unstable instances

Data drift cases
Unavailable databases

Handled engine differences explicitly

Improved ground truth SQL
Removed outdated queries
Added corrected versions

Strengthened evaluation

Detect hardcoding
Validate query structure
Treat empty results consistently

      
            Artifact
            Upstream
            Spider2-E
            Delta
          
            Instances
            547
            524
            −23
          
            Gold SQL files
            120
            118
            −2 net
          
            Gold result CSVs
            1,544
            1,469
            −75 net
          
            Eval script
            upstream
            patched
            hardcoding rejection + empty-result fix

Dataset reduced from 547 → 524 instances
Evaluation made more reproducible and stable

Beyond Benchmarks: The Ambiguity Problem in Real Queries

Our work on Spider2 revealed something that goes beyond benchmark reliability: natural language queries are deeply ambiguous, and resolving those ambiguities is the central challenge for any production talk-to-data system.

We catalogued nine recurring categories of ambiguity. On average, each query contained 10–12 distinct points of ambiguity—none of them obvious from reading the question casually, and all of them capable of changing the result.

      
            Category
            Description
            Frequency
          
            Schema ambiguity
            Multiple tables or columns map to the same concept
            Most common
          
            Temporal ambiguity
            Unclear date fields, boundaries, timezone handling
            Very common
          
            Filtering ambiguity
            Vague criteria like "recent," "large," "active"
            Very common
          
            Aggregation ambiguity
            COUNT vs SUM, grouping granularity, pre/post filtering
            Common
          
            Join ambiguity
            Multiple valid join paths between tables
            Common
          
            Null/edge-case handling
            Include or exclude NULLs, zeros, special values
            Common
          
            Output format ambiguity
            Column names, aliases, sort order
            Common
          
            Semantic ambiguity
            Domain-specific term interpretation
            Moderate
          
            External knowledge dependency
            Information required from outside the instruction
            Moderate

Consider a seemingly straightforward Google Analytics question:

“For each visitor who made at least one transaction in February 2017, how many days elapsed between their first visit and their first transaction?”

This single question contains 11 distinct ambiguities:

Which date field should be used?
Should timestamps be interpreted in GA timezone or UTC?
What counts as a transaction?
What counts as a visit?

Each choice can change the final result.

These ambiguities are not artifacts of the benchmark. They are inherent properties of natural language questions asked against complex schemas. A production-grade talk-to-data system must be able to resolve these ambiguities.

Conclusion

Spider2 is a major step forward for evaluating Text-to-SQL systems. It brings real-world complexity into benchmarking in a way earlier datasets did not.

But our analysis shows that a large portion of measured error is not due to model failure, but to evaluation instability. If benchmarks are used to guide research and compare systems, they must measure what they intend to measure—consistently and reproducibly.

Our analysis also exposed a more fundamental problem: natural language queries are deeply ambiguous. Each question carries multiple implicit assumptions about schema mapping, temporal boundaries, filtering criteria, and aggregation logic. Current benchmarks treat each question as having a single correct answer, but in practice, several reasonable interpretations—and therefore several valid SQL translations—often exist. Any production talk-to-data system must confront this problem directly.

Annotation errors and evaluation instability are not mutually exclusive—both degrade benchmark reliability, but they require fundamentally different remedies. Annotation errors require improved verification pipelines; evaluation instability requires data stabilization, engine-aware semantics, and versioned artifacts. Benchmark correctness is not just a property of annotations; it is a property of the entire evaluation system.

Spider2-E is one attempt to move in that direction. We hope this work contributes to a broader conversation about how to build benchmarks that are not just challenging, but also reliable.

Appendix: All 32 Non-Exact Instances

      
            Instance
            Root cause
            Notes
          
            sf006
            data_drift
            Branch counts differ due to data refresh
          
            sf008
            data_drift
            Home price index updated (6.55 → 6.8)
          
            sf012
            data_drift
            2012 FEMA claim amounts differ; NFIP dataset updated multiple times
          
            sf037
            data_drift
            POI IDs changed from hex to UUID format
          
            sf040
            data_drift
            Address parsing data changed
          
            sf_bq009
            data_drift
            Revenue difference value differs; GA360 data drifted
          
            sf_bq024
            data_drift
            EVALUATION_TYPE differs (EXPCURR/EXPCHNG vs EXPMORT)
          
            sf_bq058
            data_drift
            Optimism blockchain data not available in Snowflake
          
            sf_bq063
            data_drift
            npm registry URLs changed
          
            sf_bq102
            data_drift
            Genomic start positions changed
          
            sf_bq130
            data_drift
            County vs. state rankings changed in different data window
          
            sf_bq165
            data_drift
            MITELMAN cohort total changed (3373 → 3377)
          
            sf_bq190
            data_drift
            User count changed (465 → 467)
          
            sf_bq249
            data_drift
            Whitespace category counts updated
          
            sf_bq256
            data_drift
            Ethereum balance sign differs due to data drift
          
            sf_bq275
            data_drift
            GA360 fullVisitorIds differ (rolling session data)
          
            sf_bq366
            data_drift
            Same rows, different ordering
          
            sf_ga014
            data_drift
            GA4 session counts slightly differ
          
            sf_ga018
            data_drift
            PLP-to-PDP conversion rate differs
          
            sf_ga021
            data_drift
            Different cohort event type returned
          
            sf_bq111
            engine_diff
            Missing corr_pvalue UDF in Snowflake
          
            sf_bq276
            engine_diff
            Spatial function unavailable in Snowflake
          
            sf_bq430
            engine_diff
            TO_JSON_STRING(STRUCT(...)) untranslatable
          
            sf_bq458
            engine_diff
            UNNEST vs FLATTEN performance gap
          
            sf_local344
            engine_diff
            Float boundary causes different counts
          
            sf009
            no_db
            Spatial share revoked
          
            sf013
            no_db
            Road-network share revoked
          
            sf029
            no_db
            Vendor data share revoked
          
            sf_bq152
            query_logic
            Wrong threshold conversion
          
            sf_bq416
            query_logic
            Hardcoded values
          
            sf_bq450
            query_logic
            Incorrect balance threshold
          
            sf_local335
            query_logic
            Missing tie-breaking sort key

Pralhad Deshpande - Ayan Bhowmick - Jeremy Hartman - Paul Haley

Spider2 Under the Microscope: Data Drift, Engine Gaps, and the Case for Spider2-E

Introduction

The Core Issue: Evaluating Against a Moving Target

1. Data Drift

2. Engine Differences

3. Database Unavailability

A Structural Weakness: Result-Only Evaluation

Implications for Reported Scores

The True Ceiling Is Lower Than 100%

Scores Change Over Time

Very High Scores Require Scrutiny

Four Distinct Sources of Benchmark Failure

Recommendations for Benchmark Design

1. Annotation Quality (What is measured)

2. Evaluation Stability (How it is measured)

Spider2-E: A More Stable Evaluation Framework

Removed unstable instances

Handled engine differences explicitly

Strengthened evaluation

Beyond Benchmarks: The Ambiguity Problem in Real Queries

Conclusion

Appendix: All 32 Non-Exact Instances

Deploy agentic systems your business can depend on

Partnership

Product

Industry Spotlight

Company

Resources

Spider2 Under the Microscope: Data Drift, Engine Gaps, and the Case for Spider2-E

Introduction

The Core Issue: Evaluating Against a Moving Target

1. Data Drift

2. Engine Differences

3. Database Unavailability

A Structural Weakness: Result-Only Evaluation

Implications for Reported Scores

The True Ceiling Is Lower Than 100%

Scores Change Over Time

Very High Scores Require Scrutiny

Four Distinct Sources of Benchmark Failure

Recommendations for Benchmark Design

1. Annotation Quality (What is measured)

2. Evaluation Stability (How it is measured)

Spider2-E: A More Stable Evaluation Framework

Removed unstable instances

Handled engine differences explicitly

Strengthened evaluation

Beyond Benchmarks: The Ambiguity Problem in Real Queries

Conclusion

Appendix: All 32 Non-Exact Instances

EMERGENCE WORLD: A Laboratory for Evaluating Long-horizon Agent Autonomy

Deploy agentic systems your business can depend on

Partnership

Product

Industry Spotlight

Company

Resources