Reasoning Over the Graph: From GraphRAG to Planning Agents
Agentic GraphRAG treats the knowledge graph not as a static index to retrieve from once, but as a state space to reason over one node at a time. GraphRAG proved that structured knowledge could be retrieved at generation time — but a one-shot subgraph either drowns the LLM in irrelevant triples or misses the one critical edge. A question like "what must a learner master before agent orchestration, and which of those concepts does RAG build on?" is a sequence of decisions: which edge to follow, which concept to expand, when to backtrack. That is a planning problem, and the 2026 research corpus has converged on agentic traversal to solve it.
This is article #2 in the Autonomous Knowledge Graphs series. It reasons over the curriculum concept graph that article #1 builds, and obeys the same engineering constraints: a control plane built on LlamaIndex — DeepSeek as the LLM client, its PropertyGraphIndex for retrieval — with the autonomous loop itself written in plain Python rather than run by a workflow or graph-orchestration engine, over a Cloudflare D1 concept-graph data plane (concepts, concept_edges, lesson_concepts), with a thin TypeScript layer applying every write; DeepSeek-only model egress through one Cloudflare AI Gateway; a grounding-first record on every write — {confidence, reason, source, evidence} with bi-temporal valid_at/recorded_at stamps; and invalidate-not-delete at every irreversible step. The worked example is an explainable answer over the curriculum graph: the agent returns not just an answer but the supporting concept sub-graph as evidence.
The One-Shot Ceiling
A vanilla GraphRAG pipeline retrieves a subgraph in a single pass — often a community summary of a few dozen nodes. For "what foundational concepts must a learner master before agent orchestration?" that subgraph may contain the right prerequisite chain, but tangled with unrelated related edges, sibling part_of groupings, and contrasts_with links. The LLM must extract the answer from a noisy context. GraphRAG has no decision loop: it retrieves once and hopes the model resolves the chain. Multi-hop reasoning needs sequential decisions, and that is exactly what the 2026 traversal literature supplies.
Agentic Traversal: The 2026 Foundation
The corpus provides the blueprint. GraphSearch proposes a graph-aware query planner that recursively expands subgraphs, treating each expansion as an action conditioned on the current query state (Liu et al., 2026, arXiv:2601.08621). S-Path-RAG combines semantic shortest-path traversal with an iterative LLM loop to reduce the search space to the most promising paths (Fu et al., 2026, arXiv:2603.23512). Both argue traversal should be adaptive: the next hop depends on what the previous hop found.
The reinforcement-learning thread pushes further. GraphDancer uses two-stage curriculum post-training — single-hop navigation first, then multi-hop reasoning in a Think-Act-Observe loop (Bai et al., 2026, arXiv:2602.02518). GraphWalker introduces a synthetic-trajectory curriculum that teaches the agent to reflect on mistakes and recover from invalid paths (Xu et al., 2026, arXiv:2603.28533). AgentGL applies graph-conditioned curriculum RL that gradually widens the exploration horizon (Sun et al., 2026, arXiv:2604.05846); GraphScout gives the LLM an intrinsic exploration bonus so it avoids dead ends (Ying et al., 2026, arXiv:2603.01410); and TKG-Thinker adds a temporal dimension, learning to prefer recent edges over stale ones (Jiang et al., 2026, arXiv:2602.05818). These are method papers; none reports a benchmark in the grounding corpus used here, so the design parameters below are stated as design choices, not borrowed numbers.
Reference Architecture: A Planning Loop over the Concept Graph
HyperGraphPro contributes the load-bearing idea: progress-aware reward reshaping that credits the agent for stepping toward the answer at each hop, not only at the end (Park et al., 2026, arXiv:2601.17755). This design maps that progress logic into a plain Python traversal loop without RL training — the reward becomes an explicit per-round confidence check. The loop is deterministic over a read-only snapshot of the D1 concept graph; only the per-round answer synthesis calls DeepSeek (through LlamaIndex), so the whole thing runs offline and unit-tested.
The state tracks the current question, the visited concept IDs, the frontier set (capped at 12 concepts), the accumulated evidence sub-graph, and a scalar confidence in [0, 1]. The agent seeds from the concepts named in the question — up to 3, most-specific name first — then each round:
- Expand: collect the active
concept_edgesneighbours of the frontier (both directions), recording the edges used, until the 12-concept cap is hit. Already-visited concepts are not re-added, so the frontier cannot loop. - Synthesize: a single DeepSeek call answers the question using only the retrieved sub-graph and returns an answer plus a
[0, 1]confidence — the explicit per-round progress signal. - Verify: at ≥ 0.80 it emits the answer plus the supporting sub-graph; otherwise the query is focused on the new evidence and control loops, up to 4 rounds, then abstains.
Total LLM invocations per query run roughly 1–4 — one synthesis call per round, a deliberate multiplier over a single-shot GraphRAG call, traded for reliability on multi-hop queries. The exact token cost depends on the graph and query mix and is not asserted here.
The Loop in Detail
Round 1 seeds from the concepts named in the question (up to 3, most-specific first), expands their neighbourhood, and focuses the query for the next hop. Round 2 expands the new frontier, and if the evidence sub-graph already contains a clear path the agent can exit early at ≥ 0.80. Rounds 3–4 are reached only for ambiguous or many-hop questions, each focus narrowing the relation types followed. The output is an answer — "before agent orchestration a learner needs function calling and agent memory, both of which build on prompting fundamentals" — together with the prerequisite/builds_on sub-graph that justifies it.
Where RL-Trained Graph Agents Take This
This prompt-based loop approximates some RL benefits through the fixed 12-concept frontier cap and the 0.80 abstain bar. A trained agent would instead learn to adjust the frontier and threshold dynamically by graph density (AgentGL), recover from invalid paths (GraphWalker), and decay confidence by edge age (TKG-Thinker). The next logical step is to fine-tune on traversal trajectories with HyperGraphPro-style reward reshaping — which needs a training pipeline most teams do not have, so the prompt-based loop is the practical default until then.
Failure Modes and Mitigations
- Over-traversal. A generic seed can flood the frontier with noise; the mitigation is seeding only from concepts whose name appears in the question, passing the top 3 most-specific matches to traversal.
- Loops. The agent can revisit a concept via different paths; the state tracks visited IDs and the next round forbids returning to explored concepts.
- Cost growth. All 4 rounds means up to 4 LLM calls per query (one synthesis each); the early-exit at ≥ 0.80 after round 2 is the primary cost control. Concrete dollar figures depend on token usage and are not claimed here.
- Abstention bias. The 0.80 bar is conservative; on genuinely ambiguous questions confidence may never reach it, and the agent abstains. That is the right trade for explanation-critical answers over the curriculum graph, less so for casual lookups.
Numbered Limitations
- Prompt dependency. Rewrite and scoring quality are determined by the prompt; it must be tested on a held-out evaluation set before trust.
- Graph quality. The agent optimises over the given graph and cannot infer missing edges; S-Path-RAG-style attribute shortest-paths (Fu et al., 2026) are one way to bridge gaps, not implemented here.
- Scalability ceiling. At most 48 concepts are examined per query (4 rounds × 12), a tiny fraction of a large graph; very long
prerequisitechains exhaust the round budget. - No persistent memory. Unlike the bi-temporal graph memory of Engram (Wang, 2026, arXiv:2606.09900), this design forgets between queries and cannot reuse prior exploration — the subject of article #4.
- Scalar evaluation. The 0.80 bar is one number; HyperGraphPro's step-level signal is richer but cannot be applied retroactively to a frozen model.
Decision Table: GraphRAG vs Agentic Traversal vs Hybrid
| Scenario | Recommended approach | Why |
|---|---|---|
| High-throughput single-hop fact lookup | GraphRAG community summary | one retrieval is enough; the multi-round multiplier is wasted |
| Multi-hop but predictable curriculum questions | Planning loop (this design) | planning without training; abstention prevents hallucination |
| Frequently-changing graph, accuracy ceiling matters | RL-trained traversal (AgentGL / GraphDancer) | a learned policy adapts the frontier dynamically |
| Explanation-critical learner guidance | Planning loop with the 0.80 abstain gate | abstain-under-uncertainty is safer than a confident guess |
For the curriculum graph's multi-hop questions — predictable, explanation-critical — the planning loop is the right default.
Closing
GraphRAG was the necessary foundation: it proved structured knowledge could be retrieved at generation time. The 2026 corpus — GraphSearch, S-Path-RAG, GraphDancer, GraphWalker, AgentGL, GraphScout, TKG-Thinker, HyperGraphPro — pushes past it by treating traversal as a sequential decision process. The design here is the deployable compromise: it brings planning to GraphRAG without RL training, and it abstains rather than hallucinate when the path is unclear. The graph stops being an index and becomes something the agent reasons over, one node at a time.
Frequently Asked Questions
What is agentic GraphRAG? It replaces one-shot subgraph retrieval with a planning agent that traverses the graph step by step — deciding which concept to expand next, focusing the query each round, and synthesizing an answer from the accumulated sub-graph — until it can answer or must abstain.
Why is multi-hop GraphRAG a planning problem, not a retrieval problem? A multi-hop question requires sequential choices: which edge to follow, which concept to expand, when to backtrack. A one-shot pass cannot make those choices; a planning agent makes each hop conditioned on what the previous hop found.
How does the agent avoid retrieving noise? Each round expands a frontier of active concept_edges neighbours capped at 12, the query is focused on what the previous hop found, and explored concepts cannot be revisited. The agent abstains after 4 rounds if confidence never clears 0.80.
When does the agent abstain? An answer is emitted only if confidence reaches the 0.80 bar within 4 rounds; otherwise it abstains rather than guess — the right default for explanation-critical answers over the curriculum graph.
Where do RL-trained graph agents fit? The 2026 RL papers learn traversal policies that adjust the frontier dynamically and recover from dead ends. This design approximates that with fixed parameters and no training; fine-tuning on traversal trajectories is the deferred next step.
Autonomous Knowledge Graphs — the series
- Autonomous Knowledge Graph Construction: Graphs That Build Themselves (autonomy: high)
- Reasoning Over the Graph: From GraphRAG to Planning Agents (this article — autonomy: high)
- Self-Healing Knowledge Graphs: Graphs That Fix Themselves (guardrail)
- The Graph as Agent Memory (autonomy: medium)
- Closing the Loop: Evaluation, Debate, and Discovery (guardrail)
A companion thread to The Autonomous Sales Fleet. Next: #3 Self-Healing Knowledge Graphs.
References
- Jiajin Liu et al. GraphSearch: Agentic Search-Augmented Reasoning for Zero-Shot Graph Learning. 2026. arXiv:2601.08621. https://arxiv.org/abs/2601.08621
- Jinyoung Park et al. HyperGraphPro: Progress-Aware Reinforcement Learning for Structure-Guided Hypergraph RAG. 2026. arXiv:2601.17755. https://arxiv.org/abs/2601.17755
- Yuyang Bai et al. GraphDancer: Training LLMs to Explore and Reason over Graphs via Two-Stage Curriculum Post-Training. 2026. arXiv:2602.02518. https://arxiv.org/abs/2602.02518
- Zihao Jiang et al. TKG-Thinker: Towards Dynamic Reasoning over Temporal Knowledge Graphs via Agentic Reinforcement Learning. 2026. arXiv:2602.05818. https://arxiv.org/abs/2602.05818
- Yuchen Ying et al. GraphScout: Empowering Large Language Models with Intrinsic Exploration Ability for Agentic Graph Reasoning. 2026. arXiv:2603.01410. https://arxiv.org/abs/2603.01410
- Rong Fu et al. S-Path-RAG: Semantic-Aware Shortest-Path Retrieval Augmented Generation for Multi-Hop Knowledge Graph Question Answering. 2026. arXiv:2603.23512. https://arxiv.org/abs/2603.23512
- Shuwen Xu et al. GraphWalker: Agentic Knowledge Graph Question Answering via Synthetic Trajectory Curriculum. 2026. arXiv:2603.28533. https://arxiv.org/abs/2603.28533
- Yuanfu Sun et al. AgentGL: Towards Agentic Graph Learning with LLMs via Reinforcement Learning. 2026. arXiv:2604.05846. https://arxiv.org/abs/2604.05846
- Liuyin Wang. Less Context, More Accuracy: A Bi-Temporal Memory Engine for LLM Agents (Engram). 2026. arXiv:2606.09900. https://arxiv.org/abs/2606.09900
