Closing the Loop: Evaluation, Debate, and Discovery

Q: Why is evaluation the bottleneck for autonomous knowledge graphs?

Every edge inserted, relationship inferred, and hypothesis proposed can be wrong, and the only way to know is to verify — but a single LLM judge has inconsistent calibration across domains. The 2026 literature shows verification is itself becoming agentic: the evaluator must be as sophisticated as the generator.

Q: How does multi-agent debate help, and how can it backfire?

For contested edges, two agents argue opposing positions and a moderator decides, which surfaces evidence a single judge misses. But naive debate can amplify error when agents share a base model and bias — so debate needs a strict grounding constraint (cite exact node IDs), a bounded number of rounds, and a moderator confidence threshold.

Q: What is autonomous discovery over a knowledge graph?

Discovery samples concept pairs connected at two hops with no direct edge and proposes the plausible missing relationship — for example whether one curriculum concept is a prerequisite of another — constrained to the graph structure to suppress hallucinated links. Each candidate then runs through the same evaluate-debate-abstain pipeline.

July 3, 2026 · 14 min read

Vadim Nicolai

Senior Software Engineer

The most stubborn bottleneck in autonomous knowledge graphs is not retrieval accuracy or latency — it is evaluation. Every edge inserted, every relationship inferred, every hypothesis proposed can be wrong, and the only way to know is to verify. But verification is itself becoming an agentic problem, and the 2026 literature is blunt about it: the evaluator must be as sophisticated as the generator. The question is no longer whether to close the loop but how — and the answer is a layered design that combines a deterministic rule engine, an agent-as-judge, multi-agent debate for contested edges, and autonomous discovery, all gated by a hard abstain-under-uncertainty rule.

This is article #5, the final guardrail in the Autonomous Knowledge Graphs series. It closes the loop over the graph that #1 builds, #2 reasons over, #3 repairs, and #4 remembers. Every design in the series obeys the same engineering constraints: a control plane built on LlamaIndex — DeepSeek as the LLM client, its PropertyGraphIndex for retrieval — with the autonomous loop itself written in plain Python rather than run by a workflow or graph-orchestration engine, over a Cloudflare D1 concept-graph data plane (concepts, concept_edges, lesson_concepts), with a thin TypeScript layer applying every write; DeepSeek-only model egress through one Cloudflare AI Gateway; a grounding-first record on every write — {confidence, reason, source, evidence} with bi-temporal valid_at/recorded_at stamps; and invalidate-not-delete at every irreversible step. The worked example throughout is the AI-engineer curriculum concept graph — concepts linked by prerequisite, builds_on, contrasts_with, part_of, related, and applies_to. Here the loop runs with a ≥ 0.80 commit bar on every edge and grounding-first provenance throughout.

Loading diagram…

Why Evaluation Is the Bottleneck — and Going Agentic

Traditional pipelines rely on rule-based consistency checks or a single LLM call for triple validation, and both break under ambiguity. S-Path-RAG retrieves an answer path with an iterative LLM loop (Fu et al., 2026, arXiv:2603.23512), but something still has to decide whether the retrieved path is correct — and a single LLM judge, even grounded with graph context, calibrates inconsistently across domains. SAGE formalizes the response: Judge Agents work alongside a rule engine to verify logical compliance against dynamically-built graphs (Shi et al., 2026, arXiv:2604.09285). The rule engine handles deterministic checks (subject-object consistency, cardinality), and the Judge Agent assigns a compliance score. This design makes evaluation an explicit, separable component and gates it at the 0.80 commit bar — high enough to keep hallucinated edges out of downstream queries, at a deliberate cost to recall. (SAGE's paper evaluates on a service-agent dialogue dataset; the threshold here is a design choice, not a reported number.)

Reference Architecture: SAGE as a Python Judge + Debate Loop

SAGE supplies the core evaluator — a Judge Agent scoring candidate edges against a rule engine — which this design implements as a plain Python judge-and-debate pipeline — hand-written control flow over LlamaIndex model calls, not a workflow or graph framework. Each edge proposal (from construction, from multi-hop reasoning, or from discovery) is passed to the judge. The rule engine runs first: every node must have a type and every relation must be in the closed edge-type vocabulary (prerequisite, related, part_of, builds_on, contrasts_with, applies_to). An edge that passes gets a judge score in [0, 1]. At the commit bar of ≥ 0.80 it is committed; below the reject bar of 0.50 it is rejected outright (the hard abstain path, so debate is never wasted on clearly-wrong data); and in the 0.50–0.79 band it is escalated to debate. The deterministic rule engine deliberately runs before any LLM call so the cheapest checks reject the easy failures first.

The Loop, Concretely: Judge, Debate, Abstain

For contested edges, the loop triggers a 2-agent-plus-1-moderator debate over a fixed 3 rounds, inspired by Contestable Multi-Agent Debate's arena-based argumentative computation with uncertainty-aware escalation (Nguyen et al., 2026, arXiv:2605.14495) and the role-switching courtroom structure of PROClaim (Chowdhury et al., 2026, arXiv:2603.28488). One agent argues for the edge (marshalling supporting evidence), one argues against (searching the graph for contradictions), and the moderator aggregates and scores. Because a well-documented failure mode of multi-agent debate is that it can amplify error when the agents share a base model and bias, the debate is bounded to those 3 rounds and the moderator must reach ≥ 0.80 confidence before accepting; otherwise the edge is quarantined for human review. Contestable Multi-Agent Debate's uncertainty-aware escalation is built precisely to contain that failure mode.

The single most important constraint is grounding-first: the moderator must cite exact concept IDs from the graph before scoring. Without it, two debating agents that share a base model will mutually reinforce a hallucination and the moderator cannot escape the shared bias — the error-cascade failure mode the debate literature warns about.

The loop also corrects itself after the fact: a periodic re-check re-judges up to 25 recently-recorded edges and invalidates any that now score below the 0.50 reject bar, while a coverage report surfaces the live graph's concept count, active-edge count, edge-type distribution, and orphan concepts so drift is visible between ticks.

Autonomous Discovery: Two-Hop Sampling for New Hypotheses

Closing the loop is not only about evaluating proposed edges; it is also about discovering new ones worth evaluating. discover() samples concept pairs connected at two hops with no direct active edge — preferring high-degree, foundational concepts and dense topic clusters where a missing link is most consequential — and proposes the plausible unobserved relationship between them, constrained to the graph structure to suppress hallucinated links. The traversal borrows from GraphSearch's graph-aware recursive planner (Liu et al., 2026, arXiv:2601.08621), and the bias toward high-degree concepts is analogous to HyperGraphPro's progress-aware reward shaping, which steers exploration toward relevant graph regions (Park et al., 2026, arXiv:2601.17755). Each discovered candidate runs through the same evaluate-debate-abstain pipeline. The most ambitious version of this is a persistent research loop: AI-Supervisor runs a multi-agent gap-discovery → method → evaluation cycle over a knowledge-graph "research world model" (Long, 2026, arXiv:2603.24402) — and the honest caveat from the 2026 corpus is that such systems generate plausible links but rarely close the loop with real confirmation.

Failure Modes: Error Cascades in Debate

Multi-agent debate sounds robust until it becomes an echo chamber. When both debating agents share a base model and prompt bias, they can converge on a wrong edge faster than a single judge would — citing fabricated graph context that the moderator, on the same model, cannot escape. The grounding-first constraint (cite node IDs or do not score) is the primary defense. A second failure mode is debate fatigue: running three rounds on every contested edge does not scale to real-time insertion, so discovery edges are batched for asynchronous processing while only the construction agent's per-lesson proposals are evaluated synchronously.

Numbered Limitations

Judge agents are not production-stable. Judge scores shift with prompt template on the same edge; the 0.80 bar buys precision at the cost of recall, so some valid edges are rejected. The 2026 corpus is explicit that agent judges are not yet reliable enough for unattended production.
Naive debate can amplify error. Without strict moderator grounding and a confidence threshold, debate can converge on wrong conclusions faster than a single judge — a well-documented multi-agent-debate failure mode that Contestable Multi-Agent Debate's uncertainty-aware escalation and PROClaim's role-switching are designed to mitigate.
Autonomous discovery is unbounded. Two-hop candidate sampling over a large graph is combinatorial; the two-hop horizon and per-batch caps are deliberate limits that will miss some real hypotheses (anything more than two hops apart is never proposed).
Evaluation latency is non-deterministic. Agent judging plus a debate step is too slow for synchronous graph operations at scale, which is why discovery is asynchronous.
Abstain creates orphan data. Rejected and abstained edges are logged but not automatically reassessed; reclaiming that cold store of potential insights needs a human or a future re-evaluation pass.
No closed self-improving eval loop yet. The corpus shows multi-agent judges and self-generated benchmarks emerging, but none is a fused, self-improving evaluation loop — this design layers complementary checks rather than learning the evaluator.

Decision Table: Single Judge vs Debate vs Human Review

Edge type	Recommended evaluator	Why
Low-stakes common relations (`related`, `applies_to`)	Single agent-as-judge at ≥ 0.80	cheapest path; debate adds no precision on easy cases
Contested edges between core concepts (is it `prerequisite` or merely `related`?)	2-agent + moderator debate	surfaces contradictions a single judge misses; moderator must clear 0.80
Structural claims that reshape the curriculum (a new `prerequisite` chain)	Human review after agent pre-screen	the agent narrows; a person decides the irreversible call
Autonomous-discovery candidates	Judge then debate, asynchronous	unbounded volume; batch off the real-time path

Run a single judge by default, reserve debate for genuinely contested or high-stakes edges, and always honour the abstain rule.

Conclusion

Closing the loop in 2026 is not a single technique but a layering of complementary verification: deterministic rules, agentic judges, adversarial debate, and human override — each with its own cost-latency-precision trade-off, and no single paper offering a turnkey solution. The practical implication for any team building autonomous knowledge graphs is that evaluation is a first-class loop with its own design constraints and decision tables, not an afterthought solved by one LLM call. Start simple — a single judge with a hard threshold — add debate only when the false-positive cost justifies it, and always honour the abstain rule. The graph must be trustworthy first, complete second.

Frequently Asked Questions

Why is evaluation the bottleneck for autonomous knowledge graphs? Every edge, inference, and hypothesis can be wrong, and a single LLM judge calibrates inconsistently across domains. The 2026 literature shows verification is becoming agentic: the evaluator must be as sophisticated as the generator.

What is agent-as-judge evaluation? A Judge Agent scores a candidate edge against the supporting sub-graph, usually alongside a rule engine that checks schema and cardinality first. Edges clearing 0.80 are committed; the rest go to debate or are rejected. SAGE formalizes this judge-plus-rule-engine pattern.

How does multi-agent debate help, and how can it backfire? Two agents argue opposing positions and a moderator decides, surfacing evidence a single judge misses. But naive debate can amplify error when agents share a base model and bias — so it needs a strict grounding constraint (cite node IDs), bounded rounds, and a moderator confidence threshold.

What is autonomous discovery over a knowledge graph? Discovery samples concept pairs that sit two hops apart with no direct edge and proposes the plausible missing relationship — for example whether one curriculum concept is a prerequisite of another — constrained to the graph structure to suppress hallucinated links. Each candidate then runs through the same evaluate-debate-abstain pipeline.

Why is abstain-under-uncertainty the default? It is better to miss a valid edge than to insert a hallucinated one that pollutes every downstream query. When confidence is not reached, the edge is logged for human review rather than committed.

Autonomous Knowledge Graphs — the series

Autonomous Knowledge Graph Construction: Graphs That Build Themselves (autonomy: high)
Reasoning Over the Graph: From GraphRAG to Planning Agents (autonomy: high)
Self-Healing Knowledge Graphs: Graphs That Fix Themselves (guardrail)
The Graph as Agent Memory (autonomy: medium)
Closing the Loop: Evaluation, Debate, and Discovery (this article — guardrail)

The worked example throughout is the AI-engineer curriculum concept graph — built, reasoned over, repaired, remembered, and now evaluated. A sibling series, The Autonomous Sales Fleet, climbs the same autonomy ladder in a different domain.

References

Ling Shi et al. SAGE: A Service Agent Graph-guided Evaluation Benchmark. 2026. arXiv:2604.09285. https://arxiv.org/abs/2604.09285
Truong Thanh Hung Nguyen et al. Contestable Multi-Agent Debate with Arena-based Argumentative Computation for Multimedia Verification. 2026. arXiv:2605.14495. https://arxiv.org/abs/2605.14495
Masnun Nuha Chowdhury et al. Courtroom-Style Multi-Agent Debate with Progressive RAG and Role-Switching for Controversial Claim Verification (PROClaim). 2026. arXiv:2603.28488. https://arxiv.org/abs/2603.28488
Rong Fu et al. S-Path-RAG: Semantic-Aware Shortest-Path Retrieval Augmented Generation for Multi-Hop Knowledge Graph Question Answering. 2026. arXiv:2603.23512. https://arxiv.org/abs/2603.23512
Jiajin Liu et al. GraphSearch: Agentic Search-Augmented Reasoning for Zero-Shot Graph Learning. 2026. arXiv:2601.08621. https://arxiv.org/abs/2601.08621
Jinyoung Park et al. HyperGraphPro: Progress-Aware Reinforcement Learning for Structure-Guided Hypergraph RAG. 2026. arXiv:2601.17755. https://arxiv.org/abs/2601.17755
Yunbo Long. AI-Supervisor: Autonomous AI Research Supervision via a Persistent Research World Model. 2026. arXiv:2603.24402. https://arxiv.org/abs/2603.24402

Why Evaluation Is the Bottleneck — and Going Agentic​

Reference Architecture: SAGE as a Python Judge + Debate Loop​

The Loop, Concretely: Judge, Debate, Abstain​

Autonomous Discovery: Two-Hop Sampling for New Hypotheses​

Failure Modes: Error Cascades in Debate​

Numbered Limitations​

Decision Table: Single Judge vs Debate vs Human Review​

Conclusion​

Frequently Asked Questions​

Autonomous Knowledge Graphs — the series​

References​