30 May 2026 Research Deep-Dive

When Your AI Agent's Parts Are Fine but the Whole Is Broken

A new paper formalises how multi-component LLM agents produce logically incoherent outputs even when each component works correctly — and gives you a runtime metric to detect it.

Most production AI systems are not single models. They are pipelines — multiple components chained together, each handling a piece of the workflow. A classifier feeds a summariser, which feeds a decision-maker, which feeds a reporter. Each component is tested individually. Each passes its benchmarks. And yet the composed system can produce outputs that are logically impossible.

A paper from Anany Kotawala, appearing at three ICML 2026 workshops, formalises this failure mode and gives practitioners tools to detect and fix it. If you are building multi-component agent systems — and if you are running enterprise AI, you almost certainly are — this is worth understanding.

The Problem: Composition Breaks Coherence

The paper's central observation is deceptively simple. Each LLM component in a multi-agent system produces probabilistic claims about its part of the problem. When those claims are assembled, the composition can violate basic probability axioms — even though every individual component is internally consistent.

The author calls this "locally coherent, globally incoherent." Think of it as the AI equivalent of three witnesses to an accident. Each witness is honest and internally consistent about what they saw. But if you combine their statements without accounting for overlaps and dependencies, the composite story can be logically impossible.

This is not a theoretical concern. In a clinical trial agent that chains together a medical coder, a safety assessor, and a reporting module — each component might be individually accurate, but the composed output could assign contradictory severity levels to the same adverse event. In a financial compliance pipeline, a risk scorer and a policy checker might each produce valid outputs that, when combined, violate regulatory thresholds.

What the Paper Provides

A runtime metric (eps*) — the compositional residual. This measures how far the composed output is from being jointly coherent. Crucially, it is computable from the system's own outputs at runtime — you do not need ground truth labels or human evaluation.

A product-structure dichotomy — a theoretical result that tells you when local coherence is sufficient (and when it is not). If the components have a certain independence structure, composition preserves coherence. If they do not, you need the full framework.

A deterministic repair method — the hierarchical Boyle-Dykstra projection takes an incoherent composition and projects it onto the nearest coherent point. This is not a heuristic; it is a deterministic algorithm with convergence guarantees.

A sequential monitoring framework — an anytime-valid e-process that detects coherence drift in production. This gives you early warning when your multi-component system starts producing incoherent outputs, without waiting for downstream failures.

The Empirical Picture

The paper evaluates 1,876 ensemble cliques on a four-LLM panel (frontier and mid-tier models). The results are sobering:

Finding	Detail
Incoherent cliques	33–94% of compositions have non-zero residual
Regret from incoherence	+0.115 nats per bet under proportional allocation
Predictability	Rayleigh-quotient prediction within 7% on 3/4 relation classes

The paper also tests three intuitive mitigations that practitioners commonly reach for:

Retrieval — giving components access to each other's context. Result: fails or regresses.
Partition-aware prompting — explicitly telling components about the partition structure. Result: fails or regresses.
Aggregator-LLM — adding a final LLM to reconcile component outputs. Result: fails or regresses.

None of these work. The failure is structural, not informational. You need the mathematical framework to detect and repair it.

Why This Matters for Enterprise AI

Three implications for enterprise leaders

Component-level testing is not enough. Runtime coherence monitoring is essential. And common "more context" fixes may be amplifying the problem.

First, testing components in isolation is necessary but not sufficient. Every enterprise AI pipeline tests its individual components. Almost none test the composition for coherence. This paper shows that the gap between component-level quality and system-level coherence can be large — and predictable.

Second, runtime monitoring matters more than pre-deployment testing. The anytime-valid e-process gives you a way to detect coherence drift in production without ground truth labels. For regulated industries where you cannot wait for a downstream failure to discover a problem, this is a significant practical tool.

Third, common architectural patterns may be making things worse. The finding that retrieval, partition-aware prompting, and aggregator-LLMs all fail or regress is a useful caution. If your current approach to multi-component coordination is "give each component more context" or "add a reconciler," this paper suggests you may be amplifying the problem rather than fixing it.

What This Means Practically

Measure compositional coherence. The eps* metric is computable from your system's own outputs. Implement it as a runtime health check on your pipelines.
Monitor for drift. The anytime-valid e-process gives you early warning. Do not wait for downstream failures to discover that your composition has become incoherent.
Evaluate your component boundaries. If your components share implicit dependencies (common in clinical, financial, and compliance workflows), you are in the regime where composition can fail.
Do not trust "more context" fixes. The paper shows that retrieval and prompting-based mitigations do not work for this failure class. You need the structural approach.

Key Takeaways

The failure is structural, not informational. Multi-component LLM agents produce incoherent outputs because of how components interact — not because any individual component is wrong. More context does not fix it.
The failure is predictable and measurable. The eps* metric is computable at runtime from your system's own outputs. You do not need ground truth to know when your composition is breaking.
Component-level testing is not enough. If you are only testing individual components, you are missing a failure mode that affects 33–94% of compositions in the study.

Paper Details

Title: Locally Coherent, Globally Incoherent: Bounding Compositional Incoherence in Multi-Component LLM Agents
Author: Anany Kotawala
Status: Preprint (2026), appearing at ICML 2026 Workshops: CTB, AgenticUQ, FAGEN
Categories: cs.AI, cs.CL
ArXiv: 2605.30335

Building multi-component AI systems? Let's talk about how to monitor for compositional coherence.

Get in Touch