Deriving High Quality Domain Insights via Data Intelligence Agents

January 26, 2026

Ashish Jagmohan

Rapid advances in foundation models, code generation agents, and tool-using multi-agent frameworks have caused a surge of interest in agentic data intelligence systems. A variety of systems, algorithmic approaches, and benchmarks have been proposed over the last couple of years. These range from agents that aim to automate data-science model building often on Kaggle-style competitions [1, 2] to systems that seek to automate data pipelines and workflows by orchestrating commonly used data tools [3]. Other benchmarks target specific parts of the data pipeline. For example, the “talk to your data” setting focuses on agents that translate natural language into SQL/database queries [4], with “realistic” settings representative of enterprise workloads with large schemas, multiple dialects, and multi-step queries [5]. Similarly, CoDA [6] focuses on collaborative multi-agent automation for visualization workflows. Yet other approaches look at training end-to-end agentic models [7] and use agentic scaffolds with verification and iterative plan refinement to improve analysis planning [8]. At the same time, foundation model vendors have started to productize along these directions. Coding agents [9, 10, 11] armed with tool-use, file-manipulation and code execution capabilities enable interactive data analysis and artifact-generation workflows.

Despite these advances in benchmarks, agent designs, and product experiences, there still exist important deployment gaps, from performance and reliability to issues around data security, privacy, and governance models. In this post, we’ll focus on reliable and useful task completion. Literature shows significant bottlenecks. In [2], even the best-performing agent solves only 34% of data-analysis tasks. In [3], results are even more sobering for real operational settings; state-of-art agents complete only 14.0% of tasks end-to-end, struggling with brittle GUI actions, multi-tool coordination, and cloud-hosted pipeline steps. Other benchmarks like [5] paint a similar picture, with top-of-leaderboard agents demonstrating performance that would be unacceptable for real-world automation. This largely matches our experience in real-world settings – general AI agents are not yet reliable enough to deploy out-of-the-box in specialized domains. Together, these results sharpen the key question for practitioners - not whether an agent can generate SQL or Python, but whether it can do so reliably, consistently, and to completion in complex and specialized settings.

The Agentic Data Intelligence Stack

‍

A useful way to think about an agentic data intelligence system is a layered stack that turns raw data into useful and actionable insights. The foundational layer consists of the data and knowledge stores that ground the system. This includes the raw/curated data itself, but also specialized domain knowledge (data semantics, business logic, constraints, objectives), and procedural and personalized user memory that helps the system both improve over time, as well as adapt to user preferences.

Above this sits the tool layer. This includes standard data processing and pipeline tools for data processing. Beyond, it also includes domain-specific analysis and data science tools and algorithms, and finally standard and custom visualization and reporting tools, including generating dashboards and reports. A key design choice here is dynamic vs. static tooling. Some capabilities can be created on the fly by code-generation agents, e.g., a one-off analysis script. But in many cases, static product-grade tools are preferable: (i) Mature systems already exist for core pipeline tasks and reimplementation is a reliability tax; (ii) For specialized domain algorithms, correctness/interpretability/auditability matter, and teams may want human validated implementations. In practice, high-quality systems may mix both dynamic code for fast user iteration, and stable tools for the critical path.

The agents layer leverages these specialized tools to generate artifacts and insights from the raw data. Metadata & ETL agents may perform a host of upstream tasks including metadata enhancement, quality and lineage checks, and data transformations. Data science & analysis agents use the data and metadata produced by these upstream agents to do the investigative work, doing exploratory data analysis, generating and testing hypotheses, training and evaluating models, and leveraging specialized data science and analysis tools when needed. The output of the data science agent is often a set of artifacts ranging from structured extracts (tables, features, derived datasets) to unstructured multimodal artifacts (plots, text, notebooks). Finally, insight generation agents process these artifacts to generate reports and insights that a human can act upon, much like a human data scientist would.

In practice, the agent layer is often an iterative loop, with validation, multi-agent consensus, and refinement implicitly woven throughout. In the remainder of this post, we’ll focus primarily on the insight generation part of the stack, discussing the problem of producing accurate, consistent, actionable insights. The other components, especially the upstream metadata and ETL agents, will be topics for future posts.

Characterizing and Improving Insight Quality

Once upstream data science and analysis agents have created artifacts, the system enters the insight generation phase. In this phase, a set of agents takes the produced artifacts and distills them into conclusions a human can use. As we briefly touched upon in the previous section, these artifacts are multimodal and heterogeneous. They may include tables and data extracts, notebooks, textual notes, charts and figures, dashboards, even screenshots of tool outputs. Thus, insight generation is more than just creating a summary or report; the problem is interpreting evidence across modalities. A key question that arises is how to measure and how to improve the quality of these generated insights. In this section, we discuss key metrics and approaches to improve them.

Accuracy

Accuracy is the first and most fundamental measure of insight quality, defined as the correctness of factual claims. In practice, an unreliable data intelligence system is a non-starter. Untrustworthy claims and recommendations will not be operationalized, and mostly right is generally not good enough, especially when errors are hard to detect.

In our experience, model choice is critical. Insight agents need foundation models with enough multimodal and analytical strength to faithfully interpret plots, parse notebook outputs, and compose insights across heterogenous data. But there are several other considerations. Insight accuracy often depends on the source used to compute an insight. For example, some facts are best derived from structured sources (recomputing statistics from a data extract) rather than “reading” them off an image. Further, insight generation requires a workflow that can reason over multiple artifacts jointly, since many real insights only emerge when you connect signals across outputs (e.g., a distribution shift visible in a plot, and a metric change confirmed in a table). This gives rise to a practical tradeoff between breadth and accuracy. Many insights require synthesizing multiple artifacts, but as the number and complexity of artifacts grows, error rates rise. Both these considerations motivate co-optimization between analysis agents and insight agents. Upstream analysis agents should emit artifacts that are optimized for utility and reliability in insight generation downstream.

Finally, accuracy improves significantly with the use of auto-validation techniques within self-refining iterative agentic loops. Thus, candidate insights should be cross-checked against primary artifacts (via re-running or otherwise verifying computations), uncertainties and errors should be flagged, and the process should be iterated until accuracy is confirmed.

Consistency

Consistency, at a high-level, implies that similar data leads to similar insights. The key complexity here derives from the fact that real-world data analysis is inherently combinatorial. From any non-trivial set of artifacts, there are many plausible observations you could make, of which only a small fraction is worth surfacing. This space of potential insights grows with the number and diversity of artifacts. So, the core question becomes - how does an agent reliably search this enormous (often effectively unbounded) insight space and converge on the same high-value conclusions across multiple runs? In our experience, vanilla foundation models and naive agents are often wildly inconsistent run over run. This is due to their inherent stochasticity in sampling from this large space, with small sampling differences yielding different interpretations.

Note that determinism in the narrow sense isn’t the right target here. Approaches that emphasize token-level reproducibility (e.g. [12]) are poorly suited to this problem. Instead of token-level determinism, we need semantic consistency, where meaningfully similar artifacts lead to meaningfully similar insights, even if the exact wording varies. In practice, we’ve found that multiple approaches can help with this. One family of approaches is the use of multi-agent ensembles. Here multiple agents may come up with different proposals, followed by critique/debate/consensus, using a variety of topologies [13]. Another family of approaches uses grounding and scoping, by injecting domain knowledge and curated examples of desirable insight types, to constrain the agent’s search towards useful insights.

Actionability

Actionability is the idea that, of the large number of potential insights that can be gleaned from data analysis, only a small number provide value in terms of driving decisions or follow-up actions. In most enterprise settings, actionability is crucial; an insight is only useful if helps prioritize actions/identify root-causes/trigger work etc. It is instructive to contrast this to the consistency problem. Rather than explicitly trying to force an agent to say the same thing every time, you may get better stability by optimizing the system to recognize what is actionable and to preferentially generate those insights on each run. If the system reliably knows which insights matter, then semantic consistency can emerge naturally as a byproduct of repeatedly selecting the most decision-relevant insights.

Again, there are multiple approaches that can be leveraged here, many of which are related to the discussion we had around consistency, since they operate similarly by trying to constrain the insight space. Domain knowledge around what insights matter, and what thresholds trigger action, and curated examples of strong vs. weak insights can both help significantly. Most important perhaps is the use of human-in-the-loop (HITL) feedback, where human experts evaluate the generated insights and provide feedback. Such HITL interaction, paired with a system that can reflect on and learn from feedback over time, provides improving actionability over time.

Accuracy, consistency, and actionability are perhaps the most foundational metrics for high-quality insight generation. Beyond these, real deployments also care about other issues such as latency/cost, privacy, security and governance, auditability, robustness to data quality, etc. on which we will have more to say in subsequent posts. On the one hand, optimizing data intelligence agents remains inherently domain-specific, depending on specialized knowledge of business considerations, data semantics, and operational workflows. On the other hand, there are general techniques that hold the promise of transfer across domains. As we’ve discussed, these include structured multi-agent systems with validation, consensus and refinement; explicit representation and use of domain knowledge; and iterative learning from human feedback. Together, these techniques will be crucial in turning agentic data intelligence systems from impressive prototypes to real-world enterprise deployments.

References

[1] Guo et al., “DS-Agent: Automated Data Science via Case-Based Reasoning” (2024), arXiv:2402.17453

[2] Jing et al., “DSBench: Data Science Agents vs. Experts” (2025), arXiv:2409.07703

[3] Cao et al., “Spider2-V: Automating Data Science & Engineering Workflows?” (2024), arXiv:2407.10956

[4] BIRD-Bench, https://bird-bench.github.io/

[5] Spider 2.0, https://spider2-sql.github.io/

[6] Xu et al., “CoDA: Collaborative Data-Visualization Agents” (2025), arXiv:2510.03194

[7] Zhang et al., “DeepAnalyze-8B: Agentic LLM for Autonomous Data Science” (2025), arXiv:2510.16872

[8] Nam et al., “DS-STAR: Data Science Agent via Iterative Planning and Verification” (2025), arxiv:2509.21825

[9] OpenAI Codex, https://openai.com/codex/

[10] Claude Code, https://claude.com/product/claude-code

[11] Gemini CLI, https://geminicli.com/

[12] He et al., “Defeating Nondeterminism in LLM Inference”, https://thinkingmachines.ai/blog/defeating-nondeterminism-in-llm-inference/

[13] Kim et al., “Towards a Science of Scaling Agent Systems” (2025), arxiv: 2512.08296

‍

Deriving High Quality Domain Insights via Data Intelligence Agents

The Agentic Data Intelligence Stack

Characterizing and Improving Insight Quality

Accuracy

Consistency

Actionability

References

More From the Journal

Agentic AI for Enterprise Needs Much More than LLMs

Agentic AI Automation of Enterprise: Verification before Generation

Stable Evolution: Controlling Drift in AI-Assisted Programming

See what Emergence can do for you.