Agentic AI for Enterprise Needs Much More than LLMs
Enterprise AI has already crossed its first major threshold. Large language models can read, write, reason, translate intent into code, and synthesize complex artifacts with remarkable fluency. In many organizations, the question “can the model generate this?” has effectively been answered.
What remains unresolved—and increasingly dominant—is not whether systems can produce seemingly impressive outputs, but whether they can reliably determine when a task has been completed correctly, a distinction that marks the boundary between demonstrations and production-grade autonomy. Enterprises do not care about text, SQL, or plans in isolation; they purchase systems that complete tasks that are correct, compliant, auditable, and repeatable. When correctness cannot be established cheaply and mechanically, the dream of autonomous enterprise, even with the advent of LLMs, becomes hard to achieve. When correctness cannot be established, human review expands to fill the gap, resulting in AI slop. It also compounds risk of silent failures, and agentic systems regress to little more than drafting assistants. This is why verification matters today, not as a narrow checkpoint between development and deployment, but as a new control layer positioned between machine autonomy and human–machine collaboration. When verification moves to this boundary, it ceases to be a binary pass/fail judgment about whether software is “good enough to ship” and becomes a routing mechanism for action: verified paths authorize autonomous execution, failed proofs with counterexamples trigger regeneration or repair, and unverifiable cases are escalated to humans or blocked entirely. In this role, verification acts as the dispatcher that determines whether a task flows through a machine-only path, a collaborative path, or a prohibited one, transforming verification from a passive safety net into an active control surface that governs how and when intelligence is allowed to act in real world settings.
Agentic AI verification, sometimes also called as LLM-as-a-Judge, has been used and studied extensively where deterministic verification is not possible [1]. However, it is a paradoxical situation when we use one AI model to judge the results of another AI model, where both are probabilistic in their outputs. This post argues that AI combined with formal verification is not optional for enterprise agentic systems. It is the only way to turn probabilistic generation into dependable task completion. The argument draws on two complementary ideas: the notion of verification before generation in enterprise automation [2], and the “bridge-building” approach to AI-assisted formal reasoning articulated by [3]. Together, they point toward a new architecture for agentic AI—one where autonomy is earned through rigorous proof and formal guarantees, and not as best effort expectations.
Why Now
Formal verification receded from prominence over the past decade not because it failed, but because it was misaligned with how software was built and deployed. The dominant economic loop favored speed over guarantees: ship quickly, observe failures, patch fast. Cloud infrastructure, feature flags, and rollbacks made post-deployment correction cheap, while formal methods optimized for worst-case correctness before execution. At the same time, software complexity shifted toward distributed systems, evolving schemas, and socio-technical semantics that resisted clean formalization, and the tooling costs of verification remained front-loaded and specialist-heavy. Machine learning further widened the gap, as probabilistic, opaque models did not fit the deterministic assumptions of classical verification. Formal methods were quarantined to domains where failure was existential, not because they were obsolete, but because their cost-benefit profile no longer pencilled out.
That cost structure has now flipped. Agentic AI systems act continuously, trigger irreversible downstream effects, and compound errors silently, making the “patch later” model untenable. Verification has re-entered precisely because correctness must be established before action, not after observation. Crucially, the hardest historical bottleneck—specification—has become dramatically cheaper. Modern AI systems can extract properties from natural language, propose contracts (requirements for interactions between components of a system using assumptions and guarantees/obligations), and generate proof sketches, turning formalization into a scalable process rather than a bespoke craft. At the same time, enterprises increasingly demand semantic correctness under all allowed interpretations of their rules, not just plausibility on sampled inputs, and human review collapses under the volume of autonomous actions. Formal verification is the only mechanism whose cost scales with rules rather than actions, allowing autonomy to grow without linear human oversight. When software merely suggested, mistakes were tolerable; now that software acts, verification becomes unavoidable.
Autonomy is a systems property, not a model property
Most discussions of agentic AI implicitly assume that autonomy increases as models become more capable. In practice, the opposite is often observed. As agents tackle more complex tasks—multi-table queries, cross-system analyses, multi-step change plans—the cost of validating their outputs grows faster than the benefits of automation.
The reason is structural. Autonomy does not live inside the model; it emerges from the surrounding system or what is also referred to as the scaffolding. That system must decide when an output is acceptable, when it must be regenerated, when it should be escalated to a human, and when execution must be blocked entirely. Without strong correctness signals, these decisions default to human judgment.
In enterprise workflows, correctness is rarely subjective. Reports must obey canonical definitions. Queries must respect schema semantics and policy constraints and pipelines must preserve properties. Change plans must be reversible and bounded in risk. The central bottleneck, then, is not generation quality but verifiability. An agent can only be autonomous to the extent that its outputs can be checked automatically. Where verification is cheap and reliable, autonomy can be expanded and human step in only rare and critical scenarios. Where verification is weak or ambiguous, autonomy collapses back into extensive human review.
Why testing and heuristics are insufficient
Enterprises already test extensively such as results from the current week are compared to last week’s numbers to ensure consistency. Dashboards are eyeballed for anomalies, etc. These checks are useful, but they answer a limited question: does this appear correct on the data we tried?
Many of the most damaging errors in enterprise analytics pass these tests effortlessly. A query may silently double-count due to a join mistake. A report may include data outside an allowed time window. A metric may violate its own definition while still producing plausible numbers. These failures are semantic, not syntactic, and they often require deep domain knowledge to be able to understand them.
Formal verification addresses a different question altogether: does this artifact satisfy the specified property for all valid states consistent with the model? Even when applied narrowly—only to critical constraints—it transforms correctness from a matter of judgment into a matter of proof.
This distinction resulted in mature engineering disciplines adopting formal verification extensively because testing could not provide the guarantees they needed.
Formal verification is already a production technology
It is important to emphasize that formal methods are not speculative or confined to academia. They are already indispensable in domains where failure is unacceptable.
Hardware
In semiconductors, specifically modern hardware design, verification is no longer a secondary activity but the dominant engineering cost, and the data makes this starkly clear. The 2024 Wilson Research Group Functional Verification Study [4] reports that first-silicon success has fallen to just 14%, the lowest figure recorded in over two decades of tracking, while 75% of projects now finish behind schedule. Verification routinely consumes 60–70% of total chip development effort, yet simulation-only strategies cannot keep pace with SoC complexity. The fundamental barrier is mathematical: exhaustively testing even a modest double-precision floating-point division unit would require 2¹²⁸ test vectors, a space so vast that running one test per nanosecond would exceed the age of the universe. As Kern and Greenstreet [5] observe in their survey, industrial hardware verification matured precisely because simulation cannot scale to these astronomical state spaces. This is the gap formal verification targets directly. Rather than sampling behaviors, formal tools use SAT and SMT solvers to reason symbolically over the entire input space, either proving correctness for all possible states or producing a minimal counterexample that maps back to a concrete waveform showing exactly where and how the design fails.
In EDA, the key distinction between functional and formal verification is the question each answer. Functional verification (simulation, emulation, prototyping) asks whether a design behaves correctly on the scenarios that were exercised; formal verification asks whether there exists any legal execution, over all time and all inputs, that violates the specification. Bryant & Kukula [6], in Formal Methods for Functional Verification, describe how hardware formal verification spans everything from equivalence checking to temporal-property proofs over sequential systems, and how the field’s practicality accelerated when verification problems were reduced to decision procedures over logic. Hardware has proven uniquely fertile ground for formal methods for three reasons: the cost of failure is extreme, post-deployment fixes are often impossible and clocked sequential logic maps naturally to state-machine reasoning. As a result, equivalence checking, proving RTL matches synthesized netlists, that optimizations preserve behavior, and that ECOs introduce no regressions, has become a mandatory sign-off step in ASIC flows, delivering binary guarantees that would otherwise require massive regression suites. Property-based formal verification now routinely handles the control logic where simulation systematically fails: cache coherence protocols, bus arbiters, CDC/reset logic, power-management state machines, and security monitors enforcing isolation and privilege boundaries. Sandia National Laboratories’ survey [7] explicitly notes that formal methods are already widely used in safety- and security-critical digital systems because testing provides weak assurance for these requirements. In practice, this translates into measurable ROI: vendors and adopters report that formal apps routinely eliminate entire bug classes earlier in the cycle, saving weeks per IP block and reducing re-spin risk.
Seen in historical context, from foundational work on concurrency and state machines in the 1970s and 1980s, through symbolic methods like BDDs (Binary Decision Diagrams) in the 1990s, which provided a canonical representation enabling symbolic model checking at industrial scale, to today’s SAT/SMT-driven verification, that could outperform BDD approaches for equivalence and property checking, producing counterexamples faster and avoiding BDD memory blow-up (landmark paper, Symbolic Model Checking without BDDs [8]), the role of formal verification in hardware has steadily expanded from theory to sign-off infrastructure. It has not replaced simulation; it has taken ownership of the correctness properties that simulation cannot scale to, becoming the mechanism by which hardware correctness is operationalized at production scale.
Software
While hardware verification was driven by the high cost of fabrication, software verification has been driven by the high need for trust. Formal verification emerged and succeeded in software layers where correctness defines the trust boundary. This precedent matches the current need of Agentic AI: a move from "testing happy paths" to "proving safety properties."
At the root of the software stack, where errors compromise every upstream application, empirical testing is mathematically insufficient over a combinatorial space of execution paths. The seL4 microkernel [10] set the standard by providing a machine-checked proof that its C implementation refines its abstract specification, guaranteeing that properties like isolation and authority confinement hold for every possible path. This was a necessity for systems where a single isolation failure collapses the security model. Similarly, CompCert [9] introduced the concept of a formally verified optimizing compiler. By proving semantic preservation—that the generated machine code behaves exactly as the source code specifies—it eliminated the need for defensive coding against compiler bugs. These examples established that mathematical rigor is compatible with performant, low-level systems programming.
Compelling evidence for formal verification in the enterprise comes from the hyperscalers. As cloud configurations exploded in complexity, Amazon Web Services (AWS) and Microsoft Azure realized that no amount of penetration testing could guarantee security. A customer’s environment—comprising thousands of IAM roles, security groups, and VPC peering connections—creates a massive configuration state space. IAM policy graphs explode in size with rule composition.
To solve this, AWS deployed Zelkova [11], an automated reasoning engine powered by SMT solvers. When a user asks, "Can this S3 bucket be accessed from the public internet?", Zelkova does not send packets to check. It translates the policies into logic and mathematically proves whether a satisfying assignment (a path) exists. This transforms security from a probabilistic assessment ("we didn't find a hole") to a deterministic guarantee ("no hole exists").
Enterprise agentic systems similarly require formalized domains, policies expressed in constrained DSLs, and explicit invariants. With these foundations, they can reason over intent against formally defined safety properties before admitting state transitions.
From natural language to proofs: building the bridges
Formal verification, however, only works when systems are represented precisely. This is where the integration of AI and formal methods becomes essential.
In their work on integrating AI with the Lean proof assistant, [1] emphasizes that verification is not merely about proving theorems. It is about constructing representations that capture domain meaning. Lean can only verify what has been made explicit. The challenge, therefore, is not just proof automation but formalization.
The pattern that emerges is a sequence of bridges. First, the domain must be modeled formally: schemas, properties, policies, and semantic definitions. Second, these models must be exposed through a domain-specific language that both humans and machines can manipulate. Third, reusable lemmas—facts about the domain—must be established once and reused everywhere. Finally, automation and AI assistance make it feasible to apply this machinery repeatedly.
Enterprise agentic systems require exactly this pipeline. Task intent must be translated into explicit contracts. Contracts must generate proof obligations. Agents may propose candidate solutions, but acceptance depends on discharging those obligations. AI is invaluable here—not as an arbiter of correctness, but as a generator of structured candidates and proof sketches that formal tools can check.
A semantic verification example: start_date and end_date correctness
To see how this works in practice, consider a deceptively simple analytics task.
A user asks: “Show me all active customer subscriptions for Q4.”
The enterprise domain has a clear semantic definition of “active subscription”:
- each subscription has a start_date and an end_date
- a subscription is active at time t if start_date ≤ t < end_date
- end_date may be NULL to represent “still active”
Generation without verification
An agent generates the following SQL:
SELECT *
FROM subscriptions
WHERE end_date <= '2024-12-31';
The query runs successfully. It returns rows. On cursory inspection, it even looks plausible. But it is semantically incorrect. It includes subscriptions that ended before Q4 and excludes subscriptions that start mid-quarter but are active during part of Q4.
Traditional testing might not catch this unless carefully designed. A human reviewer might or might not notice, depending on context.
Formalizing the semantics
Instead, suppose we ask the system to state properties that any active subscription must satisfy, the system is likely to generate the following properties:
- s.start_date < Q4_end
- s.end_date > Q4_start ∨ s.end_date IS NULL
These properties are part of the task contract.
Translating the query to semantics
The SQL is translated into a semantic representation—relational algebra or a typed logical form—that makes predicate structure explicit. From this representation, the verifier attempts to prove that the properties hold for all rows selected by the query.
Proof failure and counterexample
The proof fails. The verifier constructs a counterexample:
- start_date = '2020-01-01'
- end_date = '2022-06-01'
This row satisfies the query predicate (start_date <= '2024-12-31') but violates one of the properties that active subscriptions must satisfy, since the subscription ended long before Q4.
The failure is precise and explanatory. It does not say “this looks wrong.” It says: there exists a subscription that your query includes that violates the definition of active. Furthermore, it flags which specific property failed, which helps the system correct the query.
Regeneration with semantic feedback
Given this counterexample, the agent regenerates:
SELECT *
FROM subscriptions
WHERE start_date <= '2024-12-31'
AND (end_date IS NULL OR end_date >= '2024-10-01');
The verifier rechecks the properties. This time, the proof completes and the query is accepted.
No human review was required. The error was caught before execution, using domain semantics rather than heuristics.
From examples to architecture
This small example illustrates a general pattern. Agentic systems should not ask whether an output “seems correct.” They should ask whether it satisfies an explicit semantic contract and that is when verification becomes the gatekeeper of autonomy.
At scale, this leads to a different system architecture. Agents generate candidates, not final answers, whereas verifiers translate those candidates into semantic forms and check obligations derived from contracts. Failures produce structured counterexamples that guide regeneration. Only artifacts that pass verification advance to execution or human review.
Crucially, autonomy is no longer a binary decision. As more constraints become hard-verifiable, more of the workflow can be automated safely. Whereas if the constraint is proxy-verifiable or can be guaranteed to an envelope of risk, there could be limited autonomy. In the case where verification is weak or impossible, the system degrades gracefully to human oversight.
Why this is unavoidable for enterprise AI?
There is a simple scaling law at work. Human review costs grow linearly with output volume. Verification costs, once formalized, grow sublinearly. Enterprises that wish to deploy agentic AI beyond toy use cases must shift correctness from people to machines.
Formal verification is how other safety-critical and correctness-critical systems solved this problem. AI changes the economics by making formalization cheaper and more accessible. Together, they enable a new class of proof-carrying tasks, where outputs are accompanied by machine-checkable evidence of correctness.
The future of enterprise agentic AI is not defined by larger models alone. It is defined by systems that know when to stop, when to retry, and when an answer is truly done.
Legend
References
[1] You, R., Cai, H., Zhang, C., Xu, Q., Liu, M., Yu, T., Li, Y., & Li, W. (2026). Agent-as-a-Judge. https://arxiv.org/pdf/2601.05111
[2] https://www.emergence.ai/blog/agentic-ai-automation-of-enterprise-verification-before-generation
[3] https://siddhartha-gadgil.github.io/automating-mathematics/posts/integrating-ai-lean/
[4] 2024 Wilson Research Group Functional Verification Study: https://resources.sw.siemens.com/en-US/white-paper-2024-wilson-research-group-ic-asic-functional-verification-trend-report/
[5] C. Kern et al., “Formal verification in hardware design: a survey,” ACM Trans. Des. Autom. Electron. Syst., 1999: https://dl.acm.org/doi/10.1145/307988.307989
[6] Bryant, R.E., Kukula, J.H. (2003). Formal Methods for Functional Verification. In: Kuehlmann, A. (eds) The Best of ICCAD. Springer, Boston, MA. https://doi.org/10.1007/978-1-4615-0292-0_1
[7] Armstrong, Robert C., et al. "Survey of Existing Tools for Formal Verification." , Nov. 2014. https://doi.org/10.2172/1166644
[8] Biere, A., Cimatti, A., Clarke, E., Zhu, Y. (1999). Symbolic Model Checking without BDDs. In: Cleaveland, W.R. (eds) Tools and Algorithms for the Construction and Analysis of Systems. TACAS 1999. Lecture Notes in Computer Science, vol 1579. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-49059-0_14
[10] Klein, G., Elphinstone, K., Heiser, G., Andronick, J., Coutts>> K., & Elkaduwe, D. (2009). "seL4: Formal Verification of an OS Kernel". Proceedings of the 22nd ACM Symposium on Operating Systems Principles (SOSP), pp. 207-220
More From the Journal

Agentic AI for Enterprise Needs Much More than LLMs

Agentic AI Automation of Enterprise: Verification before Generation
