Skip to main content

Detecting Agent Defects and Drift in Production Sales Agents

· 20 min read
Vadim Nicolai
Senior Software Engineer

Your production sales agent has not crashed. There are no error logs and no timeouts. Yet something is off. The agent still sounds fluent and still follows the script, but its trajectories have grown longer and its tool calls more repetitive. This is where teams learn that agent defects are not classical code bugs. They are behavioral discrepancies between what the developer's control logic expects and what the model actually produces. The 2026 study "Characterizing Faults in Agentic AI: A Taxonomy of Types, Symptoms, and Root Causes" (arXiv:2603.06847) makes the scale concrete. It mined 13,602 issues from 40 repositories, sampled 385 faults, and validated its taxonomy with 145 developers.

Autonomy is the whole subject here. This article is the capstone of a series — The Autonomous Sales Fleet — that built one production system across ten installments, adding exactly one capability per article as one real graph, each step climbing an autonomy ladder that runs from rep-assist up to self-directed plan→act→verify loops. Every rung of that ladder is a grant of trust, and every grant can decay. Defect and drift detection is the guardrail that makes autonomy durable rather than a one-time gift: it is the continuous check that an agent promoted up the ladder has not quietly slid back down it in production.

That durability is the point a per-run pass/fail can never deliver on its own. An agent that earns the right to act without a human in the loop only keeps that right if something watches for the slow degradation no single run reveals. The monitor in this article is that watcher — it reads finished traces, flags the wandering tool loops and drifted personas that keep an agent looking fluent while it stops doing its job, and routes the failures back to the human gate that granted the autonomy in the first place. Catch the defect per run, catch the drift across runs, and the fleet can hold its autonomy instead of silently forfeiting it.

Reason→Decompose→Act→Verify: Building an Autonomous CRM Orchestrator on LangGraph

· 23 min read
Vadim Nicolai
Senior Software Engineer

Every CRM workflow engine — Salesforce Flow, HubSpot automation, a homegrown Python script — executes a pre-written script. A lead enters, a condition fires, an action runs: deterministic, safe, and brittle. Deviate from the expected path and the script breaks, or worse, silently does the wrong thing — an ambiguous email, a flaky enrichment API, a customer who replies mid-automation. The industry's reflex answer is to "throw an LLM at it," which buys flexibility but also buys hallucinations, prompt injection, and an audit trail that reads like a black box.

Production sales needs a middle ground. It needs an autonomous orchestrator that reasons about a goal, decomposes it into verifiable steps, executes only the steps that pass a governance gate, and proves every decision. That is the Reason→Decompose→Act→Verify (RDAV) pattern. It is the foundation of the autonomous CRM orchestrator described here — the first capability in a connected ten-part series, The Autonomous Sales Fleet. On the fleet's autonomy ladder this is the highest rung: RDAV is what automates the human plan step — deciding which actions a contact needs and in what order — while still earning the act step through a confidence gate and keeping a human on verify for anything below threshold. Every other capability in the series either feeds this orchestrator or constrains how much of plan→act→verify it is allowed to run unattended.

Design-Thinking Multi-Agent Expert Panels for Campaign Strategy

· 22 min read
Vadim Nicolai
Senior Software Engineer

Design-thinking multi-agent campaign strategy is what you get when you let an agent fleet own the plan step that a human normally improvises in their head. Instead of a hard-coded six-touch weekly drip, one LangGraph graph simulates a room of human experts — a strategist, a skeptic, a brand-voice lens — arguing over how a multi-touch outreach sequence should be shaped before the first email is ever drafted. On the fleet's autonomy ladder this capability sits medium: it automates the deliberation over what a campaign's touch sequence should be, then hands the resulting plan to the durable engine, which still holds every individual email for human approval before it acts. Autonomy is earned, not asserted — the panel's output is only a seed (cadence and per-touch angles), never a send.

Hierarchical Coach→Worker Delegation for Organized Agent Teams

· 26 min read
Vadim Nicolai
Senior Software Engineer

A flat agent swarm caps its own autonomy. Let every worker talk to every peer with no leader tracking progress, and the system can run for hours without anyone — human or machine — able to say whether the work was actually done. That is the ceiling this article is about. Hierarchical coach→worker delegation raises it: a single coach plans once, delegates to specialized workers, and those workers act unattended against that one plan instead of re-improvising every step. The autonomy gain is not that more agents run; it is that one durable plan governs many executions over time, so the plan→act→verify loop stops being per-run and becomes a property of the whole campaign.

On the fleet's autonomy ladder this capability sits high. The coach automates the plan step across an entire multi-touch campaign — a sequence that unfolds over weeks, not a single run — and worker subgraphs act against that plan unattended, with the human verify preserved only at each draft's approval. This article grounds that argument in two flag-gated graphs from one production agentic-sales fleet: a campaign-level coach (AA02) and a single-email organized team (AA06). It connects both to the organized-teams paper by Guo et al. (2024) and to decades of organizational evidence. The constants, enums, and feature flags below are read from the code, not from a benchmark. The claim is contrarian because the zeitgeist says "swarm good, hierarchy bad." The evidence says the opposite.

What Is Coach→Worker Delegation in Multi-Agent Systems?

Coach→worker delegation is a hierarchical pattern in which one designated agent — the coach — produces a single up-front plan and assigns scoped work to specialized worker agents, who execute that plan without each renegotiating context. It is the antithesis of a flat swarm, where every agent communicates with every other agent. The coordination cost of the flat topology grows with the number of communicating pairs: each member must reconcile conflicting plans, resolve redundant outputs, and re-derive shared context on every turn. Hierarchy collapses that cost to a single planning step.

Loading diagram…

The work that directly inspired both production graphs is Embodied LLM Agents Learn to Cooperate in Organized Teams by Guo et al. (2024). The paper shows that unstructured multi-LLM-agent groups suffer information redundancy and confusion: when every agent talks to every other agent, communication cost grows and coordination degrades. Their fix is to impose a prompt-based organizational structure with designated roles and a leader. The leader produces an initial plan; the agents execute their roles; a Criticize-Reflect process then refines the organizational prompt to shed redundant messages. The headline finding is that a single designated leader who plans and delegates raises team efficiency over a flat group. The exact message-count reduction depends on task complexity and is reported per-environment rather than as a single global delta. This maps one-to-one onto coach→worker: the leader becomes a coach emitting one up-front plan, and the members become worker subgraphs executing against it.

The Cost of Flat: Empirical Warnings from Human Teams

The most vivid illustration of flat-delegation failure is teacher absenteeism in developing countries, documented in Missing in Action: Teacher and Health Worker Absence in Developing Countries. Chaudhury, Hammer, Kremer, Muralidharan, and Rogers (2006) conducted unannounced visits to primary schools across six countries and measured who was actually present. They found absence rates ranging from roughly 19 percent to 35 percent35 percent in India and 27 percent in Indonesia at the high end Chaudhury et al., 2006. These are not teachers on sick leave; they are employees who simply do not show up because the monitoring chain is flat or broken, with no leader tracking attendance. The system looks staffed on paper but delivers a fraction of the work it was provisioned for.

A flat agent swarm carries the same exposure. When every worker communicates with every peer and no coach tracks progress, an analogous "absence" appears: agents stall, produce no output, or wander into irrelevant subtrees. Chaudhury et al. measured a 35 percent non-attendance rate at its worst — a figure that should give pause to anyone building a leaderless swarm where nothing tracks who actually did the work.

Dynamic Delegation: Shared, Hierarchical, and Deindividualized Leadership in Extreme Action Teams makes the same point inside high-stakes human teams. Klein, Ziegert, Knight, and Xiao (2006) studied extreme action teams — medical trauma units, firefighting crews — and found that over 50 percent of Israeli medical trauma team leaders changed their delegation mode mid-shift Klein et al., 2006. The switch was from shared leadership to hierarchical command, triggered when a novice joined. The three modes they identify are shared, hierarchical, and deindividualized leadership, and the critical insight is that hierarchical delegation works best when tasks are novel or novices are present. That is exactly the condition of an LLM agent team facing an unseen task: rigid hierarchy is not the default but an adaptive response to urgency and expertise gaps.

The grounded-theory study Self-Organizing Roles on Agile Software Development Teams lands the same conclusion in software. Hoda (2011) examined self-organizing Agile teams and found that all 25 teams — every one — implicitly adopted a "coach" figure (often the Scrum Master) to prevent coordination chaos Hoda, 2011. Despite the ideology of flat self-organization, the teams relied on an informal hierarchy to manage dependencies, resolve conflicts, and maintain workflow. That is a 100 percent replication rate: pure self-organization, in practice, does not exist — a coach always emerges, named or not. The lesson for agent teams is to make that coach explicit and legitimate rather than letting it surface as an accident of which agent happened to speak first.

The Architecture: How Hierarchical Agent Teams Are Structured — AA02

The first live implementation lives in backend/graphs/campaign_graph.py, a LangGraph engine that processes one durable thread per (campaign, contact) pair, persisted via Cloudflare D1 checkpoints. Every constant cited here — the enums, the clamp ranges, the defaults — is read directly from that source file and from its spec, AA02-hierarchical-campaign-coach-worker.md, not from a benchmark or a vendor claim. The baseline graph is a reactive loop: check_reply (end if the contact replied), compose_touch (generate and hold a draft), await_approval (human-in-the-loop interrupt), send_touch (the only node that sends), then schedule_next. The key addition is a coach_plan node that fires exactly once, at step 0, gated by the feature flag CAMPAIGN_COACH_ENABLED (default OFF).

Loading diagram…

When enabled, the coach makes a single make_llm(tier="standard") DeepSeek call to produce a schema-constrained plan for the whole sequence:

  • an angle of ≤200 characters,
  • a tone drawn from a fixed enum {warm, direct, formal, casual, consultative},
  • the cadence day gaps before each touch, clamped to 0–60 days,
  • a max_touches budget between 1–6,
  • a stop_criteria sentence.

The _coerce_coach_plan function clamps every field against that schema. If the LLM emits an invalid value — a tone of "aggressive", a max_touches above the budget — the function returns None and the graph fails open to static defaults: _DEFAULT_CADENCE_DAYS = [0, 4, 7, 7, 7, 7] and _DEFAULT_MAX_TOUCHES = 6. The coach's authority is bounded: it can only operate within hard limits that make structural hallucinations unreachable. An empty angle is the one field that cannot be sensibly defaulted, so it alone forces the fail-open.

The plan is stored in the checkpoint. Every subsequent cron resume reads the same plan, because coach_plan is idempotent — if state.get("coach_plan") is already populated, or the step is not 0, the node returns early. Workers honor that plan: schedule_next reads cadence_days and max_touches off state, and compose_touch folds the coach's angle and tone into the payload sent to the email_outreach delegate subgraph. This is the leader→delegation insight made durable. The sequence never re-improvises timing or messaging at each touch.

Key Benefits of Coach→Worker Over Flat Agent Architectures

The structural payoff is concrete. Without the coach, each touch re-derives its angle and timing from scratch, inviting tonal drift and overlapping arguments across a sequence that spans weeks. With the coach, the campaign stays coherent because it is sourced from one plan. The added cost is a single, bounded delegation step: one extra LLM call per thread, made once at step 0. The sequence already issues one composition call per touch, up to the max_touches budget of 6, so the coach adds one planning call on top of a baseline of six. That is the trade the organized-teams paper argues for — one planning call up front replacing the per-step renegotiation a leaderless sequence pays on every touch.

The coach is also auditable in a way an emergent leader never is. The plan rides the checkpoint stamped with Grounding-First provenance: a four-field {confidence, reason, source, evidence} envelope (_provenance) persisted alongside the decision. A declared coach_plan with provenance beats an authority no log can reconstruct. When a campaign drifts, you can read the exact plan that governed it; when a flat swarm drifts, there is nothing to read.

AA06: The Organized Team Inside One Email

The second implementation, in backend/graphs/email_orchestrator_graph.py, addresses a different scope: the multi-role team that collaborates on a single outbound email. Its constants are likewise read from that source file and its spec, AA06-organized-team-role-assignment.md. The baseline orchestrator is a deterministic StateGraph: hydrate (load the contact, company, and up to 8 company_facts rows from D1, ordered by confidence), load_history, safety_gate (a central suppression check — a SHA-256, i.e. 256-bit, hash plus per-contact do_not_contact, bounced, unsubscribed, and replied flags), recall_memory, decide_action, then compose. The new capability is a plan_roles leader node, gated by ORCHESTRATOR_ROLE_PLAN (default OFF).

Loading diagram…

This node makes one make_llm(tier="standard") call to assign an ordered role plan over a fixed role enum of 3 roles — researcher, composer, reviewer — instructed to run in that order. The _repair_role_plan function drops any role outside the enum; on an invalid response or a kill-switch (LLM_KILL_SWITCH), it fails open to the default plan ["researcher", "composer"]. The reviewer then runs as _review_draft, a deterministic grounding gate that flags an empty draft or unresolved {{template}} markers. Its verdict is stamped into graph_meta.review_passed and review_notes. A failed review degrades rather than blocks, because the orchestrator already returns drafts and never sends.

The key design choice is that the role enum is closed and maps onto the real subgraphs the workers call (email_compose, email_reply). An out-of-vocabulary role is a structural impossibility, not a runtime hope. The coach's assignment is constrained to real subgraphs — the same discipline AA02 applies to the coach's numeric fields, applied here to the role vocabulary.

This pattern directly implements the designated-role structure from Guo et al. The leader does not do the work; it assigns work to specialists and monitors output. The cost is 1 extra LLM call per email for the role assignment. The reviewer adds near-zero cost, because _review_draft makes 0 model calls. It exists to catch the one class of error a flat topology misses precisely because no single agent owns the output-quality check: an empty body, or a {{first_name}} marker that never got filled.

How Task Routing Works: When Hierarchy Must Adapt

Not all tasks need a coach. Klein et al. (2006) observed that trauma teams switched back to shared leadership when the team was experienced and the task routine. The coach→worker pattern is most valuable when the task is novel (LLM agents have no preexisting knowledge of it), novices are present (the workers have no cache of successful plans), the output requires coherence across multiple steps (a multi-touch campaign), or the subtasks are interdependent (research must precede composition).

The review Work Groups and Teams in Organizations maps the contingencies that decide when hierarchy pays. Kozlowski and Bell (2013) survey the team-effectiveness literature and identify 4 critical contingency themes — context, workflow, levels, and time Kozlowski & Bell, 2013. Hierarchical delegation works best when tasks are interdependent and demand clear accountability, but it can suppress adaptive learning if the coach never listens to worker feedback. The Criticize-Reflect process in Guo et al. is one remedy: workers criticize the coach's organizational prompt and the coach reflects an improved version.

The production graphs make the opposite trade deliberately. AA02's coach_plan is immutable for the life of the thread — it favors stability over adaptation. If the contact replies and changes the conversation context, the coach does not re-plan; the baseline reactivity (check_reply → END) takes over, and the suppression and do-not-contact gates remain authoritative on who may be contacted. Re-planning mid-campaign would require a second LLM call and risk breaking the coherence of a sequence that already spans 6 touches. (The fleet does carry separate flag-gated escape hatches — a smart-deferral parser for "circle back later" replies and a reflect-after-N replan — but the coach plan itself stays fixed.)

The ethnography Workers' Rites: Ritual Mediations and the Tensions of New Management shows what happens when an organization tries to abolish hierarchy outright. Islam and Sferrazzo (2021) found that workers in nominally "flat" organizations engage in rituals to reconstruct informal hierarchies — complete flattening creates ambiguity, not equality Islam & Sferrazzo, 2021. The pattern echoes Hoda's 100 percent coach-emergence finding from a different angle: suppress the explicit leader and a covert one reappears. For agent teams the implication is direct — an explicit, named coach is more legible and more auditable than the implicit one a leaderless swarm grows anyway.

Production Challenges and Scaling Strategies for Delegated Agent Teams

Based on the evidence and the implementation data, here is a practical decision framework for choosing between flat, coach→worker, and dynamic delegation:

ConditionFlat SwarmCoach→WorkerDynamic Delegation
Task noveltyPoor (coordination chaos)Best (coach sets context)Good (adapts if coach present)
Interdependence of subtasksPoor (conflict-resolution cost)Best (coach sequences)Good (coach adapts sequence)
Number of agents<3 acceptable3–10 ideal3–10 with feedback loop
Tolerance for latencyFlat (no delegation overhead)Accept 1 extra call per sequenceAccept 1–2 extra calls
Coherence requiredLow (single step)High (multi-step)High (adaptive coherence)

Both implementations carry an eval gate of ≥0.80 on every prompt path — the same bar the fleet uses for offline LangSmith golden datasets (agentic-sales:campaign:final_response for the campaign touch). The coach→worker pattern does not automatically lift eval scores; it improves coherence and reduces drift. The eval gate is what catches a coach plan that degrades generation quality, and the fail-open defaults ensure the system falls back to the baseline rather than shipping a bad plan.

The scaling discipline is the same in both graphs and worth stating as a rule. One plan, many executions: the coach makes a single call per sequence, not per step, capping planning cost while guaranteeing coherence. Constrain the coach's output: fixed enums, numerical clamping (1–6, 0–60), and fail-open defaults keep a hallucinated plan from reaching execution. Make delegation structural, not aspirational: the role enum must correspond to real worker subgraphs, so an unsupported role is impossible rather than a runtime fallback. Audit every plan: four-field provenance enables debugging and rollback without restarting the thread. Feature-flag the coach: CAMPAIGN_COACH_ENABLED and ORCHESTRATOR_ROLE_PLAN are default-OFF, so rollback is a flag flip, not a redeploy — and when unset, both graphs are byte-identical to today's behavior.

Limitations: What the Field Gets Wrong

The dominant narrative holds that flat agent swarms are more "democratic" and "efficient" because they avoid a single point of failure. This ignores the coordination overhead flat topologies incur. Dignum (2000) formalized that explicit organizational structures reduce communication overhead in multi-agent systems Dignum, 2000 — yet the formal model is the part most often skipped in practice, leaving flat meshes that collapse under moderate complexity. Many agent frameworks default to broadcast communication, which masks the cost until the system grows past a handful of agents.

Hierarchy itself is not the villain the flat-swarm narrative makes it out to be. Romme (2019) reframes hierarchy as a gradient of accountability that a system can climb up or down depending on the problem, rather than a fixed chain of bosses Romme, 2019. An organization can be a "hierarchy without bosses" when delegation is grounded in competence rather than status — which is exactly the legitimacy a coach node earns in these graphs. The coach's authority is not positional; it is the bounded, schema-constrained right to emit one plan, auditable through four-field provenance. That is hierarchy as Romme describes it: accountability made explicit, not power concentrated.

A broader illustration of delegation failure comes from the World Development Report 2018: Learning to Realize Education's Promise. The World Bank (2017) found that only about 50 percent of students in many developing countries achieve basic literacy World Bank, 2017. This is a downstream effect of weak hierarchical oversight — teachers absent (the 35 percent worst case Chaudhury et al. measured), curriculum unenforced, no coach to sequence learning across years. Flat, unaccountable structures fail quietly and at scale, long before anyone notices the work was never coordinated.

But coach→worker is not a panacea, and the honest limitations are sharp. It adds latency — one LLM call per sequence or per email — and is overengineered for single-turn or linear-chain tasks where flat or no delegation wins. It requires careful prompt engineering for the coach, since a vague plan propagates to every worker. Its immutable-plan design trades adaptivity for coherence: a campaign whose premise was wrong at step 0 stays wrong until a human intervenes, because the coach does not re-plan. And the coach itself runs on the same DeepSeek family as the workers, so the eval gate — not the coach's self-report — is the only trustworthy quality signal. For coherent, multi-step, interdependent agent work, hierarchy is not a bug; it is the structure that keeps the team from talking itself into chaos. For anything simpler, it is dead weight.

Conclusion

The evidence converges from four directions. Human trauma teams switch to a single commander when novices arrive — over 50 percent of leaders did so mid-shift Klein et al., 2006. Every one of Hoda's 25 Agile teams grew an informal coach Hoda, 2011. Weak monitoring chains let teacher absence climb to 35 percent Chaudhury et al., 2006. And the organized-teams paper shows the same structure raising efficiency in embodied LLM agents Guo et al., 2024.

The production answer is to make the coach explicit, bounded, and auditable. The AA02 campaign coach issues one planning call at step 0; durable workers then honor that plan across a six-touch sequence. The AA06 leader assigns a fixed three-role team, and a zero-model-call reviewer gates the draft. Both sit behind default-OFF flags, both stamp four-field provenance, and both fail open to deterministic defaults — so the hierarchy is a refinement, never a new failure mode. For coherent, multi-step, interdependent agent work, a leader who plans once and delegates is not nostalgia for org charts. It is the cheapest known defense against a swarm talking itself into chaos — and the rung that lets the fleet's plan→act→verify loop become durable instead of per-run.

This article is #7 in a connected series, The Autonomous Sales Fleet. Each piece realizes one multi-agent paper as one real LangGraph graph, sharing a DeepSeek-only egress, a Cloudflare-D1 data plane, a LangSmith observability plane, a ≥0.80 eval gate, and a draft-first approval rule. Article #1, Reason→Decompose→Act→Verify — an Autonomous CRM Orchestrator, gave a single run the plan→act→verify planner; this piece scales that planner across time and across roles. Article #6, Design-Thinking Multi-Agent Campaign Strategy, is the deliberation panel that produces a sequence plan — the coach here is the hierarchy that executes one.

FAQ

Q: What is the difference between Coach→Worker delegation and a flat agent architecture? A: In Coach→Worker delegation a single agent (the Coach) plans and delegates subtasks to specialized Worker agents; a flat architecture has all agents communicate peer-to-peer. The hierarchical approach scales better because planning is centralized into one up-front call and each Worker has a narrow scope, so coordination cost does not grow with the number of agent pairs.

Q: How do you handle task routing when a Worker agent fails? A: In these production graphs, failure fails open to a deterministic baseline. An invalid coach plan reverts to static cadence defaults; an invalid role plan reverts to ["researcher", "composer"]; a kill-switch short-circuits every LLM path. Broader systems add retry with backoff, a timeout threshold, and a fallback queue, but the cheapest robust pattern is a constrained schema plus a fail-open default.

Q: Can Worker agents communicate with each other? A: In a strict hierarchy, Workers coordinate only through the Coach's plan and shared graph state, not by broadcasting to peers. That is the whole point — eliminating the all-pairs communication that makes flat swarms expensive. Some implementations allow limited peer data-sharing, but the Coach retains final oversight of the output.

Q: What frameworks support hierarchical Coach→Worker patterns? A: The implementations here use LangGraph with a single graph registry, a Cloudflare D1 checkpointer for durable state, and LangSmith for observability. Any stateful-graph framework that lets one node write a plan onto shared state that later nodes read can express the pattern.

Q: When should you not use a Coach→Worker delegation pattern? A: Avoid it for single-turn or linear-chain tasks needing only one or two agent calls — the routing overhead adds latency without benefit. Flat or no delegation is more efficient there. Reserve the coach for novel, multi-step, interdependent work where coherence across steps is the thing you are buying.

The Autonomous Sales Fleet — full series

This is Part 7 of 10 in a series on building one production autonomous-agentic-sales system on LangGraph + DeepSeek + Cloudflare D1, where each part adds one capability that moves the fleet up the autonomy ladder — from human-triggered assistants to self-directed plan→act→verify loops, gated by autonomy guardrails. The arc runs orchestration → enablement & analytics → campaign strategy → reliability & evaluation.

Orchestration

  1. Autonomous CRM Orchestrator (reason→decompose→act→verify)autonomy: high
  2. Multi-Step Lead Qualificationhigh
  3. Lead-to-Proposal Multi-Agent Pipelinehigh
  4. Hierarchical Coach→Worker Delegationhigh

Enablement & analytics 4. Sales-Enablement Copilot: Deal Coaching & Objection Handlingmedium 5. NL-to-SQL CRM Analytics over Cloudflare D1medium

Campaign strategy 6. Design-Thinking Expert Panels for Campaign Strategymedium

Reliability & evaluation — the autonomy guardrails 8. Deadlock & Infinite-Loop Preventionguardrail 9. Evidence-Driven Release Gates (PROMOTE/HOLD/ROLLBACK)guardrail 10. Detecting Agent Defects & Drift in Productionguardrail

References

  • Chaudhury, N., Hammer, J., Kremer, M., Muralidharan, K., & Rogers, F. H. (2006). Missing in action: Teacher and health worker absence in developing countries. Journal of Economic Perspectives. Resolve via DOI
  • Klein, K. J., Ziegert, J. C., Knight, A. P., & Xiao, Y. (2006). Dynamic delegation: Shared, hierarchical, and deindividualized leadership in extreme action teams. Administrative Science Quarterly. Resolve via DOI
  • Hoda, R., Noble, J., & Marshall, S. (2011). Self-organizing roles on agile software development teams. IEEE Transactions on Software Engineering. Resolve via DOI
  • Guo, X., Huang, K., Liu, J., Fan, W., Vélez, N., Wu, Q., Wang, H., Griffiths, T. L., & Wang, M. (2024). Embodied LLM agents learn to cooperate in organized teams. arXiv preprint arXiv:2403.12482. Read on arXiv
  • Dignum, F. (2000). A formal model of organizational interaction. Proceedings of the International Conference on Autonomous Agents. Resolve via DOI
  • Kozlowski, S. W. J., & Bell, B. S. (2013). Work groups and teams in organizations. In Handbook of Psychology (2nd ed.). Resolve via DOI
  • Islam, G., & Sferrazzo, R. (2021). Workers' rites: Ritual mediations and the tensions of new management. Organization Studies. Resolve via DOI
  • World Bank. (2017). World Development Report 2018: Learning to Realize Education's Promise. Washington, DC: World Bank. Resolve via DOI
  • Romme, A. G. L. (2019). Climbing up and down the hierarchy of accountability: Implications for organization design. Journal of Organization Design. Resolve via DOI
  • LangGraph documentation. LangChain. langchain-ai.github.io/langgraph
  • DeepSeek API documentation. DeepSeek. api-docs.deepseek.com
  • LangSmith documentation. LangChain. docs.smith.langchain.com

Evidence-Driven Release Gates: PROMOTE/HOLD/ROLLBACK for Sales Agents

· 22 min read
Vadim Nicolai
Senior Software Engineer

An evidence-driven release gate is the single component that lets a sales agent earn more autonomy instead of being granted it. Every move up the autonomy ladder — letting the orchestrator auto-dispatch a campaign, letting a multi-touch sequence run unattended, letting a new prompt version reach every thread — is only safe once a window of evidence clears a deterministic gate. The gate is where "earned autonomy" stops being a slogan and becomes a reproducible PROMOTE/HOLD/ROLLBACK decision: it is the mechanism that converts human approval of a version into machine approval, on evidence, so the fleet can climb a rung without a human re-reading every send.

That autonomy is fragile precisely because the most important release signals are invisible to a human reading the output. In a multi-agent sales fleet whose outputs are non-deterministic, one eyeballed conversation can sit directly next to a silent regression. The anchor for this article, "Automated Self-Testing as a Quality Gate: Evidence-Driven Release Management for LLM Applications" (Maiorano, 2026, arXiv:2603.15676), measured this directly: across a longitudinal case study of an internally deployed multi-agent conversational system, human reviewers and the automated gate agreed at only kappa = 0.13 — barely above chance. The reason is structural — latency violations and routing errors leave no trace in response text — and it is the whole argument for handing the autonomy decision to a gate rather than a reviewer.

This is article #9 in a connected 10-part series building one production sales fleet on LangGraph + DeepSeek + Cloudflare D1 + LangSmith. Each article realizes one CLEAN-tier 2026 paper as one real graph or decision function in the same fleet. They share the same constraints: a three-plane architecture (LangGraph control plane, Cloudflare data plane, LangSmith observability plane), DeepSeek-only egress through a single Cloudflare AI Gateway, a ≥0.80 eval bar on every prompt path, Grounding-First provenance on every persisted decision, and draft-first human approval. The fleet already scores individual runs (the territory of #8 Deadlock & Loop Prevention and #10 Agent Defect & Drift Detection). This article is what sits on top of those per-run verdicts: a deterministic gate that decides whether a version may ship.

Lead-to-Proposal as a Multi-Agent Pipeline in LangGraph

· 22 min read
Vadim Nicolai
Senior Software Engineer

A lead-to-proposal pipeline in LangGraph runs an autonomous lead→proposal loop: a raw lead enters, and three specialized agents qualify it, research it from grounded facts, and draft a tailored proposal — every intermediate node executing unattended, with no sales rep between them. That is the whole point of decomposing the work into a multi-agent graph rather than one prompt. The loop earns its autonomy by stopping at exactly one place: a human gate on the send, the single action that carries legal and reputational weight.

That gate is what most implementations get wrong. They either automate everything and lose human oversight at the consequential step, or keep a human in every node and forfeit the throughput the automation was supposed to buy. The pipeline below takes neither path. It automates the expensive cognitive labour — qualify, research, draft — and holds the final verify for an operator, who approves a grounded draft rather than composing one from scratch. The bottleneck was never the proposal itself; it is everything upstream of it, and that is precisely what the loop absorbs.

From Scripted Chatbot to Multi-Step Sales Agent: Lead Qualification That Sequences Work

· 25 min read
Vadim Nicolai
Senior Software Engineer

A multi-step lead qualification agent earns its autonomy by sequencing work no human queued: it decomposes an inbound signal into an ordered plan, grades each step against real data, and stops at a human-approval interrupt before anything ships. That is the line between a scripted chatbot and an agent — not a newer model or a sharper prompt, but a decision about who gets to sequence work. A chatbot automates a single turn; an agent automates the workflow that turn belongs to. On the fleet's autonomy ladder this capability sits high: it takes over the human plan step for an inbound lead — deciding which qualification and analysis tasks to run, and in what order — while every act stays a draft held for human verify.

The autonomy guard here is conservative by construction. The agent never sends; it composes, and the message is held as a pending draft behind a confirm-before-mutate interrupt, with a deterministic safety veto sitting upstream of the planner so a hostile or malformed plan can never reach a suppressed contact. That is the posture this article builds: reasoning is delegated, action is gated. Article #1's orchestrator dispatches into this qualifier; this is where the fleet first replaces a rep's "is this lead worth my time, and what do I do next?" judgement with a graded, auditable, draft-first sequence.

This is article #2 in The Autonomous Sales Fleet, a connected series describing one production agentic-sales system where each piece adds exactly one capability. The fleet shares a single architecture: a control plane of LangGraph StateGraphs, a data plane on Cloudflare (D1, Workers, Queues), and an observability plane of LangSmith tracing with per-graph golden datasets. Every LLM call exits through one DeepSeek endpoint behind a Cloudflare AI Gateway; no graph ships unless its golden dataset passes an eval gate; every persisted AI decision carries a four-field provenance record; and outreach is always draft-first, held for human approval. This article builds on The Autonomous CRM Orchestrator on LangGraph (#1) and connects forward to the Lead-to-Proposal Multi-Agent Pipeline (#3), which takes the qualified lead as a conceptual starting point.

The strongest evidence for constraining an agent the way this one does comes from AgentArch (Bogavelli, Sharma & Subramani, 2025), a benchmark of 18 agentic configurations across orchestration, prompt strategy, memory, and thinking-tool usage. It finds "significant model-specific architectural preferences" that break the one-size-fits-all assumption, with top models clearing only 35.3% of the complex enterprise task and 70.8% of the simpler one. When even the best configuration fails two of three hard tasks, an open-ended agent loop is a liability — and a closed, typed, narrow planner is the defensible bet. That is precisely the change this article walks through in a real email_orchestrator graph. Industry framing pieces such as Rai (2026) draw the same chatbot-versus-agent line conceptually; the engineering case rests on the indexed and canonical work cited below.

Why Scripted Chatbots Fail to Qualify Leads

Every implementation detail below — node names, scores, thresholds, the feature flag — is read directly from the production agentic-sales codebase (backend/graphs/email_orchestrator_graph.py, backend/graphs/score_contact_graph.py) and its AA04 specification. These values are first-party parameters of this deployment, tuned with the fleet's own eval-gated, grounded-pipeline discipline and LLM-as-judge harness; they are not figures borrowed from any cited paper, and I label them as such once here so the rest of the article can stay readable.

In the orchestrator built for this series — the same email_orchestrator LangGraph StateGraph documented in #1 — the scripted router lives in decide_action. It is an if/elif ladder with exactly four branches: reply if there is an unanswered inbound message; initial if no prior send exists (sequence_number == 0); skip with reason too_soon if the last send is younger than FOLLOWUP_DAYS = 3 days; and followup otherwise.

That ladder is fast, deterministic, and idempotent — and incapable of reasoning about why a lead should advance. It collapses every lead into one of four buckets based on timing alone, ignoring whether the prospect replied with budget authority or a polite "not now." The shift the literature describes — from scripted responses to contextual understanding (Chellappan, 2024) — is not a feature add; it requires a different architecture that maintains state and adapts across turns. In this orchestrator, decide_action is the ceiling: three timing checks that cannot adapt to signal quality.

The practical cost is concrete. A lead with a known company but missing seniority lands in the same bucket as a lead with neither company nor role — both get followup after the cooldown. The ladder has no vocabulary for partial qualification, no memory of prior qualification attempts, no scoring layer, and no ability to branch on data quality. A scripted bot mimics understanding without tracking context; the orchestrator's scripted path does the same with routing.

The Multi-Step Lead Qualification Framework

Gated behind a feature flag, SALES_SUPPORT_SEQUENCER_REASONING, the agentic path inserts a reasoning sequencer, plan_tasks, between the recall_memory node and the deterministic routing fork. The whole orchestrator is a LangGraph StateGraph, so adding the sequencer is a matter of inserting a node and an edge. When the flag is off, the graph is byte-identical to the scripted baseline — a hard acceptance criterion in the AA04 spec. When the flag is on, plan_tasks asks DeepSeek (make_llm(tier="standard")) to emit a typed, ordered task sequence over a closed four-stage vocabulary, {qualify, analyze_opportunity, compose, skip} — the interleaved reason-then-act loop of ReAct (Yao et al., 2023) narrowed to a closed vocabulary, and exactly the kind of constrained, function-calling-style plan that AgentArch (Bogavelli et al., 2025) shows outperforms an open agent loop on enterprise reliability. The "self-taught tool use" line of Toolformer (Schick et al., 2023) is the same instinct one rung lower: decide which call to make and when, rather than narrate.

The prompt receives a _signal_bundle containing hydrated contact, company, and company_facts rows plus prior thread history. Untrusted inbound and scraped text is fenced through wrap_untrusted so a hostile prompt cannot rewrite the plan — the prompt-injection guardrail described in the OWASP LLM Top 10 (LLM01). Any stage outside the four-item vocabulary is dropped by _repair_task_plan before it can shape routing. That is a structural guardrail, not a soft warning: an out-of-vocabulary stage never survives the repair step.

Two graded decision functions ride alongside the planner — and here precision matters, because they are functions whose verdicts travel in graph_meta, not graph dispatch nodes. This deterministic-grading stance is the direct answer to the consistency gap τ-bench measures (under 50% success, sub-25% pass^8): the qualify function produces a verdict from deterministic Cloudflare D1-hydrated signals with zero new LLM calls: a base of 0.4, plus 0.3 if the company is known, plus 0.3 if a role or seniority level is known. A score of 0.6 or above yields qualified; below it yields needs_review. The verdict is persisted as a Grounding-First record with four fields — confidence, reason, source, evidence — each back-linked to the D1 rows that produced it, and traced through LangSmith for the golden-dataset eval gate. The planner proposes the order of work; the grade comes from arithmetic over real rows.

Loading diagram…

The diagram shows the real compiled topology: safety_gate runs four nodes upstream of plan_tasks and can short-circuit straight to END, the planner sits between recall_memory and the deterministic decide_action router, and every path that reaches compose first passes the preview confirm interrupt. (For clarity it folds the plan_actions/plan_roles planner-and-team nodes into the decide_action → preview edge; those are the AA01 and AA06 nodes that the merged graph wired in place of the spec's originally-planned routing.)

Mapping Lead Qualification to Sales Readiness Stages

The second decision function, analyze_opportunity, follows the same pattern as qualify: its confidence is 0.5 + 0.1 × fact_count (capped), sourced to the companies and company_facts tables. It summarizes the opportunity from the enrichment rows already hydrated into state, again carrying the four-field provenance record rather than a free-text claim. Both functions exist so the LLM can propose the order of work while the grades come from deterministic arithmetic over real rows — verdicts that travel in state and feed the eval gate, not nodes that fork the graph.

Sales-readiness tiering lives one graph over, in the fleet's sibling score_contact_graph — a separate, independently compiled scorer, not a downstream node the orchestrator hands off to (the two graphs share no edge or import). It is the fleet's scoring discipline made concrete: a four-term weighted composite per vertical. The terms are seniority (rule-based from a title and seniority table), role_fit (one DeepSeek inference against the vertical description), reachability (rule-based from an authority score and LinkedIn presence), and a propensity sub-score; the four weights are renormalized to sum to 1.0. Tier thresholds are A at 0.80 or above, B at 0.60 or above, C at 0.40 or above, and D below 0.40. If the DeepSeek role-fit call fails, the score source is honestly relabeled rule_based_fallback — a degraded score is never dressed up as model-grounded — and in batch mode equal scores break ties by ascending contact_id for determinism.

This is the "sequencing work" the title promises. Within the orchestrator, the planner orders the qualification and analysis steps and grades them; the sibling scorer turns a contact's commercial readiness into one auditable number rather than a chat transcript a human must re-read. Each step is a typed stage; each decision carries evidence. The intelligence is in the structure — the four-stage vocabulary, the qualification cut line, the tier thresholds — not in any single prompt. That is the τ-bench (Yao et al., 2024) lesson stated structurally: even strong function-calling agents clear under half of real tool-agent-user tasks, so the reliability has to come from the rails, not the model.

Handoff Triggers: When to Escalate from Bot to Human Agent

The most common objection to autonomous sales agents is safety: what if the LLM hallucinates a plan that sends an aggressive follow-up to a prospect who already unsubscribed? This is the prompt-injection and over-action risk the OWASP LLM Top 10 warns about. The answer in this architecture is a deterministic veto that runs before the planner ever sees the signal. The safety_gate node is wired immediately after load_history and four nodes upstream of plan_tasks. It checks the central suppression list first, then the contact's do_not_contact, bounced, unsubscribed, and replied flags, and short-circuits to END on any hit. The cooldown rule, FOLLOWUP_DAYS = 3 days, is enforced the same deterministic way inside decide_action.

The LLM plan may refine routing but may never override a suppression decision or the too_soon cooldown. The acceptance criterion in the spec is explicit: a contact flagged do_not_contact, suppressed, or too_soon yields action == "skip" regardless of any task plan the LLM proposes. The deterministic veto is a governance layer baked into the graph topology, not a post-hoc audit trail — a structural constraint that makes a whole class of failure modes impossible. The agent gets to reason; the safety gate gets the first and final word, because it runs first.

The other real autonomy boundary is the output itself. The orchestrator's compose node always produces a draft (status: "draft"), reached only after the preview confirm-before-mutate interrupt. There is no score-tier gate that promotes a lead from planning to sending inside this graph; the conservative default is that everything is held for a human. Provenance is the third governance layer, and it is exactly the four-lifecycle-phase data-governance discipline (Pahune et al., 2025) made executable: every AI decision carries confidence, reason, source, and evidence, and a claim with no D1 row behind it does not get written. That is the only way to survive an audit when a lead complains that an agent fabricated a reason for skipping them — the reason field points at a row or a deterministic rule, so the audit trail is machine-verifiable rather than a human-readable guess.

The multi-step agent is not a free win. The reasoning path adds one DeepSeek call per run that the four-branch decide_action ladder does not need, which costs more latency and more tokens. For a pure timing-driven nurture flow, the scripted ladder wins: it is cheaper and fully deterministic.

The qualification scoring is also coarse by design. A 0.4 base plus two 0.3 increments yields only a handful of distinct scores around the 0.6 cut line. That coarseness is a feature for auditability and a ceiling on nuance versus a trained model. The scope is narrow too: this is one B2B sales-support orchestrator on one DeepSeek endpoint over Cloudflare D1, not a verdict on every chatbot-to-agent migration. A multi-step sequence reduces unqualified handoffs; it does not eliminate them.

The Research Backbone: The Papers That Frame the Shift

The production graph is one realization of a small but converging research conversation. Six sources frame the chatbot-to-agent shift from complementary angles, each mapping onto a concrete part of the orchestrator — with the deterministic scoring stance grounded directly in the code rather than in any of them.

  1. AgentArch: A Comprehensive Benchmark to Evaluate Agent Architectures in Enterprise (Bogavelli, Sharma & Subramani, 2025) is the architectural spine. It benchmarks 18 agentic configurations across orchestration approach, prompt strategy (ReAct versus function calling), memory design, and thinking-tool usage, and finds "significant model-specific architectural preferences" — with top models clearing only 35.3% of the complex task and 70.8% of the simpler one. That is the empirical case for constraining the planner the way this orchestrator does: a closed four-stage vocabulary and a single function-calling-style typed plan, rather than an open ReAct loop, is itself an architectural bet that AgentArch's results support for enterprise reliability.

  2. ReAct: Synergizing Reasoning and Acting in Language Models (Yao et al., 2023) supplies the planner paradigm. Its interleaved reason-then-act loop — alternating a reasoning trace with an action across multiple steps — is exactly what plan_tasks does: it reasons over the signal bundle, then emits a typed, ordered task plan before any single action fires. The orchestrator narrows that open-ended loop to a closed vocabulary of exactly 4 stages and grounds each step in code-derived arithmetic rather than free generation, so the agent's freedom is in ordering the 4 stages, never in inventing a 5th. That single architectural narrowing is what makes the loop auditable: a plan is a permutation of a known set, not an open transcript.

  3. Toolformer: Language Models Can Teach Themselves to Use Tools (Schick et al., 2023) frames the function-calling instinct underneath the plan. Its core finding — a model trained on as few as a handful of demonstrations per API can learn which of several tools to call, when, and with which arguments across calculators, search, and Q&A — is what a typed, closed-vocabulary plan formalizes: the model commits to 1 of the 4 named stages rather than narrating, and _repair_task_plan drops any token outside that set of 4 before it can shape routing. The contract is the point: a named call beats a paragraph of intent every time you need to dispatch on it.

  4. τ-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains (Yao et al., 2024) supplies the reliability reality check. It shows that even state-of-the-art function-calling agents like gpt-4o succeed on under 50% of real tool-agent-user tasks, and stay below 25% on the pass^8 consistency measure in retail — meaning the same agent given the same task 8 times rarely passes all 8. That under-50% ceiling, paired with AgentArch's 35.3% on complex enterprise tasks, is the direct argument for moving the reliability into deterministic rails — the safety veto, the typed 4-stage vocabulary, the 4-field provenance record — rather than trusting the model to be consistent on its own.

  5. From Scripted Responses to Contextual Understanding (Chellappan, 2024, SSRN working paper) maps the evolution from scripted to contextual dialogue. Its emphasis on turn-by-turn adaptation aligns with plan_tasks, which re-derives the task sequence from current signals on each run. But the adaptation here is bounded, not open-ended: it operates over a vocabulary of exactly 4 stages, gated against 1 of those 4 outcomes, so "contextual understanding" means a typed, constrained plan rather than free-form chat. The paper's own framing — that current chatbots "mimic" rather than understand — is exactly why this orchestrator refuses to let the model's prose stand as a decision: the qualify verdict is base 0.4 plus 2 separate 0.3 increments, not a sentence the model wrote, so the only thing the contextual layer is trusted to do is order the work.

  6. The Importance of AI Data Governance in Large Language Models (Pahune et al., 2025, Big Data and Cognitive Computing) supplies the governance frame, spanning 4 lifecycle phases — development, validation, deployment, and operations. It maps directly onto the orchestrator's Grounding-First provenance record and the deterministic safety gate. Governance here is not documentation written after the fact; it is the 4-field decision record (confidence, reason, source, evidence) and the structural veto that together make the agent's reasoning auditable and its unsafe actions impossible across all 4 of those phases. No graph in the fleet ships until its golden dataset clears the eval gate, so governance and quality are enforced by 1 threshold rather than 2 separate processes — the same number that gates a release also gates every scoring change.

(Industry trade pieces — Rai (2026) and Patel (2026), both in the non-indexed IJAIBDCMS — draw the same conceptual chatbot-versus-agent and real-time-data lines, but the engineering claims above rest on the indexed and canonical sources, not on them.)

Building Your Sequence: Step-by-Step Implementation

The evidence — from the published papers and from the production graph — supports a clear decision framework for when to deploy a multi-step agent versus a scripted chatbot.

Use a scripted chatbot (the decide_action ladder) when:

  • The qualification path turns on timing alone — reply speed and follow-up cadence — and the FOLLOWUP_DAYS = 3 rule is sufficient.
  • The cost of a misrouted lead is effectively zero, as in a mass nurture campaign that treats all leads identically.
  • The team has no observability tooling to debug plan failures. Without LangSmith traces, a failed plan_tasks call is a black box.

Use a multi-step agent (the plan_tasks sequencer) when:

  • Qualification requires combining data from more than two sources — contact, company, and enrichment — so the _signal_bundle references several Cloudflare D1 tables at once.
  • The lead base contains high-value contacts that must not receive the wrong message; the safety gate is only meaningful if there are non-suppressed contacts to protect.
  • The organization has a governance function that can review plan traces and curate the golden datasets behind the eval gate.

Never skip the deterministic veto. If your agent can override do_not_contact, you are not building an agent; you are building a liability. The veto must be structurally enforced, not a prompt instruction. The production graph wires safety_gate before the planner so the LLM plan is never produced for a suppressed contact in the first place.

Log every plan, even aborted ones. The _repair_task_plan output and the safety_gate decisions should feed directly into LangSmith for golden-dataset curation. The eval gate is only meaningful if you have traces of what the agent would have done before the veto stopped it.

Measuring Qualification Accuracy: Key Metrics to Track

A multi-step qualifier is only as trustworthy as the metrics you watch. Track lead-to-opportunity conversion rate, time-to-qualification (inbound to score availability), handoff acceptance rate (sales accepts the lead), and false-positive rate (leads handed to sales that were not ready to buy). The orchestrator instruments these through LangSmith traces, and the same golden-dataset threshold that gates a graph release is the floor for accepting a new scoring change into production.

Two design choices make these metrics meaningful rather than vanity numbers. First, because the scripted decide_action path and the agentic plan_tasks path share the same orchestrator and differ only by one feature flag, you can A/B the two on the same lead population and compare conversion and false-positive rates. Second, because every decision carries its four-field provenance, a false positive is debuggable down to the exact qualify verdict (the base plus increments) and the tier cut that let a contact through in the sibling scorer — you are measuring a graded decision, not a black box. No sequence eliminates unqualified handoffs entirely; it reduces them, and the metrics tell you by how much.

FAQ

Q: What is a multi-step lead qualification sequence? A: It is a structured process where a prospect moves through several automated stages — qualification, opportunity analysis, and a routed next action — before any message is composed or a human is involved. In this architecture the sequence is a typed, ordered task plan over a four-stage vocabulary: qualify, analyze_opportunity, compose, or skip.

Q: How is a multi-step sales agent different from a scripted chatbot? A: A scripted chatbot follows a fixed decision tree; a multi-step agent uses reasoning-driven task decomposition to order the work, then grades each step deterministically. The planner first proposes the sequence, the qualify function grades the lead against a 0.6 cut line between needs_review and qualified, and the fleet's sibling composite scorer assigns an A/B/C/D tier (thresholds 0.80/0.60/0.40).

Q: When should the system hand a lead to a human? A: Always — every outreach is draft-first. The orchestrator's compose node produces a held draft (status: "draft") reached only after a preview confirm-before-mutate interrupt, and nothing sends without human approval. There is no score-tier gate that lets the agent send on its own inside this graph.

Q: How do you keep the agent from sending to someone who opted out? A: The deterministic veto. The safety_gate node runs before the planner and short-circuits to END on any central-suppression, do_not_contact, bounced, unsubscribed, or replied hit. No task plan the LLM proposes can override that — the spec requires action == "skip" in those cases regardless of the plan.

The Road Ahead

The literature on multi-step sales agents is still nascent, and there are no published benchmarks comparing scripted versus agentic qualification on conversion rate, time-to-human, or false-positive handoffs. That gap is the opportunity. The orchestrator described here is built for exactly that experiment: the scripted path and the agentic path share the same graph, and the only difference is whether SALES_SUPPORT_SEQUENCER_REASONING is on. Flip the flag off to validate the baseline, flip it on for a subset of low-risk leads, and the two trajectories become directly comparable.

The practical insight is this: agents that sequence work are not more intelligent than chatbots — they are better structured. The capability in this email_orchestrator graph comes from the four-stage vocabulary, the graded qualification verdict, the four-field provenance, and the safety gate that fires before the LLM ever sees the signal. That structure is what turns a chatbot into a lead-qualification system that sequences work. The next article in the fleet, Lead-to-Proposal Multi-Agent Pipeline (#3), takes the qualified lead as its conceptual starting point and sequences the proposal.

The Autonomous Sales Fleet — full series

This is Part 2 of 10 in a series on building one production autonomous-agentic-sales system on LangGraph + DeepSeek + Cloudflare D1, where each part adds one capability that moves the fleet up the autonomy ladder — from human-triggered assistants to self-directed plan→act→verify loops, gated by autonomy guardrails. The arc runs orchestration → enablement & analytics → campaign strategy → reliability & evaluation.

Orchestration

  1. Autonomous CRM Orchestrator (reason→decompose→act→verify)autonomy: high
  2. Multi-Step Lead Qualificationhigh
  3. Lead-to-Proposal Multi-Agent Pipelinehigh
  4. Hierarchical Coach→Worker Delegationhigh

Enablement & analytics 4. Sales-Enablement Copilot: Deal Coaching & Objection Handlingmedium 5. NL-to-SQL CRM Analytics over Cloudflare D1medium

Campaign strategy 6. Design-Thinking Expert Panels for Campaign Strategymedium

Reliability & evaluation — the autonomy guardrails 8. Deadlock & Infinite-Loop Preventionguardrail 9. Evidence-Driven Release Gates (PROMOTE/HOLD/ROLLBACK)guardrail 10. Detecting Agent Defects & Drift in Productionguardrail

References

The papers below resolve to a public landing page or DOI. Implementation details (node names, scores, thresholds, the feature flag) are first-party values read from the production agentic-sales codebase, not figures from any paper.

NL-to-SQL CRM Analytics over Cloudflare D1, with a Self-Healing Loop

· 22 min read
Vadim Nicolai
Senior Software Engineer

A sales operator types "how many fintech contacts replied last week?" and gets an answer. No one writes SQL. This is NL-to-SQL CRM analytics on Cloudflare D1: the text_to_sql graph translates the question, runs it on D1, and — when the query fails — heals itself from the database's own error message. That last move is the load-bearing idea behind the self-healing loop: the database is not a passive recipient of your SQL. It is the most honest verifier you have.

That inversion drives Evaluating Open-Source LLM Agents for SQL Generation and Structured Analytics on Relational Databases, by Borovčak, Bagić Babac, and Mornar in Computers, Materials & Continua (2026). You do not demand a perfect one-shot translation. You let the query run, read the error, and regenerate against that diagnostic. The error text is the repair signal. Execution accuracy, not string overlap, is the metric that counts. The 7 numbered findings below are the evidence, and they map onto a 7-node production graph.

This is article #5 of a 10-part series, "The Autonomous Sales Fleet" — one production LangGraph + DeepSeek + Cloudflare-D1 + LangSmith system. Each part realizes one 2026 paper as one real graph. This one is the text_to_sql graph in backend/graphs/text_to_sql_graph.py, one of 39 registered in the fleet. It answers questions over the 4 CRM tables in the Cloudflare D1 database lead-gen-jobs. It generates a SELECT, validates it against a hard read-only gate, runs it, and repairs its own failures up to 2 times. No write path is ever reachable.

On the fleet's autonomy ladder this capability sits medium. It fully automates the plan→act span for a read-only analytics question. The graph translates intent to SQL, runs it, and heals its own failures with no human writing a query. The database's SELECT-only gate is what lets it act unattended. The operator reading the 1-to-2 sentence summary is the verify step. It earns that autonomy because the action space is structurally incapable of mutating data. A write-capable version would drop back down the ladder, behind human approval.

Two siblings frame this one. Article #1, Reason→Decompose→Act→Verify — an Autonomous CRM Orchestrator on LangGraph, reasons over signals and dispatches worker graphs. This graph answers the operator's question about the pipeline itself. Article #9, Evidence-Driven Release Gates for LLM Sales Agents, is the eval harness. It holds every prompt path here to the fleet's ≥0.80 bar before a change ships.

Why NL-to-SQL for CRM Analytics Matters: What the Anchor Paper Found

  1. The protagonist paper carries the weight of this design. Evaluating Open-Source LLM Agents for SQL Generation and Structured Analytics on Relational Databases (Borovčak, Bagić Babac, and Mornar, 2026) evaluates 4 open-source foundation models — Mistral, Devstral, Qwen2.5-Coder, and Qwen3. The task is turning a natural-language request into an executable query for structured analytics. It tests on a custom analytics suite and the canonical Spider benchmark. Its central reframing: NL-to-SQL is not a 1-shot translation but an agentic loop in which the database is the verifier and its error output is the correction signal (DOI 10.32604/cmc.2026.078330).

  2. The benchmark grounds that evaluation, and it is what makes "execution accuracy" a meaningful number rather than a vibe. Spider: A Large-Scale Human-Labeled Dataset for Complex and Cross-Domain Semantic Parsing and Text-to-SQL Task (Yu et al., 2018) introduced 10,181 questions and 5,693 unique complex SQL queries across 200 databases spanning 138 domains. The split is deliberate: a model is scored on databases it never saw in training. That cross-domain design is why a strong Spider result transfers to a private CRM schema at all. The model is rewarded for generalizing query structure, not for memorizing 1 warehouse. The agentic-sales graph inherits that assumption. It is handed the real companies / contacts / email_campaigns / emails surface — 4 tables — and asked to generalize join-and-aggregate shape onto it.

The practical takeaway is the one this graph is built on. Smaller open-source coder models reach usable execution accuracy on analytics queries when wrapped in an iterative, error-grounded loop, rather than asked to 1-shot the SQL. The fleet honors that finding but collapses the model choice to 1 provider. Every LLM call goes through make_llm() — DeepSeek via 1 Cloudflare AI Gateway — so the "open-source agent" insight becomes a DeepSeek-only repair loop. The paper publishes no per-query latency or cost figures for edge databases, and this article invents none. It establishes that the loop works on real relational databases. That is the foundation the implementation stands on.

Designing the Natural Language to SQL Translation Layer

The anchor paper does not stand alone. A clear line of self-correction research arrives at the same conclusion. Two papers with hard numbers justify the DeepSeek-only loop, so they are worth pinning down before the implementation.

  1. The first reframes "you need a bigger model" as "you need a correction loop." SelECT-SQL: Self-correcting ensemble Chain-of-Thought for Text-to-SQL (Shen and Kejriwal, 2024) combines chain-of-thought prompting, self-correction, and ensembling. It reports 84.2% execution accuracy on the Spider development set using GPT-3.5-Turbo. That beats the competing GPT-3.5 solution at 81.1% and the peak GPT-4 result at 83.5% on the same leaderboard. The headline is the ordering: a smaller model with a correction loop beats a larger model without one. That 3-number ordering — 84.2 over 83.5 over 81.1 — is the economic argument the agentic-sales graph leans on. It pairs 1 DeepSeek model with a bounded repair loop instead of reaching for a frontier model on every query.

  2. The second pins down the shape of the loop. SQL-of-Thought: Multi-agentic Text-to-SQL with Guided Error Correction (Chaturvedi, Chadha, and Bindschaedler, 2025) decomposes the task into 5 stages: schema linking, subproblem identification, query plan generation, SQL generation, and a guided correction loop. It reports state-of-the-art results across the Spider family, showing again that a 5-stage guided loop beats a 1-shot prompt. Its key move is taxonomy-guided dynamic error modification rather than execution-only correction. The correction is guided by a rule layer, not left to free-form retry. The agentic-sales graph implements that division. The SELECT-only gate is the rule layer. Its rejection reason — not a raw exception — feeds the repair node, so a correction is grounded in a verdict, not a guess.

Setting Up Cloudflare D1 and the Graph, End to End

Loading diagram…

The graph's default contract is small and, crucially, unchanged from before this capability landed. With no execute flag set, the flow is a deterministic LangGraph StateGraph: understand_questionidentify_tablesgenerate_sqlvalidate_sqlEND. It produces {sql, explanation, confidence, tables_used}. Every node that touches the database — execute_sql, repair_sql, summarize — lives only on the opt-in execute=True branch. That separation is the safety contract for existing callers. The byte-identical default path means the /api/text-to-sql endpoint that previously only generated SQL is untouched. The analytics path that runs it is purely additive.

understand_question restates the user's text as a 1-sentence intent. identify_tables is where grounding starts. Rather than let the model invent table names, the node is seeded with the actual CRM surface: companies, contacts, email_campaigns, emails. It picks among those 4 real tables, or a correct subset, instead of guessing. generate_sql then emits a single SQLite SELECT. Its system prompt constrains output to read-only, explicit column lists, and SQLite-only syntax — strftime/julianday for dates, || for concatenation, no ::casts or ILIKE. D1 is SQLite, not PostgreSQL. The output is {sql, explanation, confidence}, and confidence rides through to the final answer as provenance.

The SELECT-Only Gate: Two Layers, One Hard Backstop

validate_sql is the load-bearing guardrail, and it has 2 layers. Layer 1 is a leading-head check. After stripping any leading (, the lowercased query must begin with select or with — a SELECT-bearing CTE. Anything else is rejected, returning exec_error="gate: non-SELECT statement (must start with SELECT/WITH)". Layer 2 is a statement-boundary write/DDL block: a compiled regex, _WRITE_RE. It hard-blocks 19 keywords — insert|update|delete|drop|alter|truncate|grant|revoke|create|replace|merge|copy|call|do|vacuum|reindex|comment|lock|execute|prepare — plus 4 SQLite-specific escapes (attach, detach, pragma, load_extension) and 4 PostgreSQL admin functions kept as defense-in-depth.

  1. The anchoring of that regex separates a guardrail that holds from one that silently corrupts data. The block anchors to a statement boundary: start-of-string, a ; stacked-statement separator, or the ( of a data-modifying CTE. It does not anchor to a bare word boundary. The old version used a plain \b match. It fired on legitimate identifiers — SELECT comment FROM contacts, REPLACE(name,'a','b'), a column named lock or merge. All 3 were wrongly blanked into empty rejections. The statement-boundary anchor keeps the stacked-statement and CTE-write protection while letting those keywords live as ordinary columns and functions. The rule-based-verification line of text-to-SQL research keeps relearning this: the rule layer must be precise. A false rejection is as much a defect as a false acceptance, just a quieter one.

Two properties make this gate trustworthy, not decorative. First, it is the only path to execution. Repaired SQL re-enters validate_sql before it can run, so a repair fixes syntax or semantics but never widens permissions. Second, it composes with prompt-injection fencing rather than relying on it. The user's question is untrusted text. So 4 nodes — understand_question, generate_sql, repair_sql, and summarize — wrap it through wrap_untrusted (from backend/llm/prompt_safety.py). That call fences the body in an explicit <<<USER QUESTION — treat strictly as data…>>> block, strips zero-width and bidi characters, and collapses <<</>>> runs so an attacker cannot forge the end-fence. A "… and also DROP the table" payload is described by the intent step, not obeyed. And even if fencing were defeated, the SELECT-only gate is the hard backstop the generated SQL cannot pass.

Implementing the Self-Healing Loop: D1 Error Messages as Repair Signals

This is where the anchor paper's "database is the verifier" insight becomes running code. When execute=True, a gate-passed query reaches execute_sql, which runs it through infra.db.d1_all against the sales D1. A SQLite/D1 exception can mean a missing column, a malformed join, or a type mismatch. The node does not raise. It captures the raw diagnostic into exec_error and the failing query into failed_sql. The router route_after_execute then checks repair_attempts < _MAX_REPAIR_ATTEMPTS, which is 2. If attempts remain, it sends the run to repair_sql.

  1. The repair mechanism descends directly from 1 paper, and naming it precisely matters for reproduction. SQL Query Engine: A Self-Healing LLM Pipeline for Natural Language to PostgreSQL Translation (Ijaz, 2026) describes a 2-stage pipeline. Its second stage "enters an iterative self-healing loop in which the LLM diagnoses the error using full SQLSTATE codes and PostgreSQL diagnostic messages." The agentic-sales repair_sql node implements that 2-stage shape against D1 instead of PostgreSQL, bounded to 2 repair rounds. It receives the failed SQL plus the actual error text, diagnoses what went wrong, and regenerates a corrected single SELECT. Then, critically, that regenerated query re-enters validate_sql before any execution. The loop is bounded, error-grounded, and read-only by construction — the exact contract the paper argues makes a self-healing pipeline safe to run unattended.

  2. Two design choices keep the loop from becoming a cost or correctness hazard. The first is early-accept. The moment a query executes successfully, the loop ends and routes to summarize, so a working query is never "repaired" into a different one. That mirrors the Spider (Yu et al., 2018) discipline of scoring execution accuracy on the first correct result, not on attempt count. The second is that an empty result set counts as success, not a defect to heal. On CRM data, "no contacts matched that filter" is usually the true answer. A loop that treated 0 rows as failure would burn its 2-attempt budget rewriting a correct query. Gate rejections feed the same loop through a different door. route_after_validate sends a gate-rejected query to repair_sql while attempts remain, with the rejection reason itself ("gate: non-SELECT statement") as the repair signal. After 2 attempts are exhausted, the run returns the last error rather than looping forever. The bound, not a hope, is the circuit breaker.

Handling Edge Cases: Grounded Summaries and Deterministic Builders

The terminal node, summarize, turns the executed query's {rows, row_count} into 1 or 2 plain business sentences for an operator who does not read SQL — the structured-analytics layer the anchor paper describes. It is grounded only in the returned rows. The sample is bounded to 20 rows, and _MAX_ROWS=50 caps what is fetched. It never invents totals, percentages, or names absent from the data. It fails closed in 3 ways. An empty result yields "The query ran successfully but matched no records". No SQL or no rows yields "No query was executed, so there is nothing to report". An unavailable LLM yields the deterministic "The query returned N record(s)" — never a fabricated figure. Every answer carries 4 provenance fields: confidence, reason (the explanation), source (the tables_used), and evidence (the executed SQL). That is the Grounding-First pattern the whole fleet shares.

  1. Some analytics questions should never be left to a model. For those, the same module ships deterministic builders that emit fixed SELECTs and compute results with plain arithmetic. The funnel builder, build_funnel_queries(vertical), emits 6 SELECT COUNT(*) queries for the conversion stages discovered → enriched → contacted → opened → replied → converted. Each is scoped to a sanitized vertical literal, and each passes through validate_sql like an ad-hoc read. compute_funnel_report then derives stage-to-stage conversion rates purely from those 6 counts — no LLM, no fabrication. The attribution builders, build_touch_history_query(contact_id) and attribute_touches, read a contact's ordered touch history through the same SELECT-only path and distribute credit across touches. When the LLM is unavailable, they fall back to a deterministic, still-provenanced last-touch model rather than inventing weights. The split is the decision framework. If the query is structurally known, use a builder. If it is genuinely ad-hoc, use the self-healing loop. This echoes the text-to-SQL literature's recurring finding: a hybrid of guided computation and free-form generation outperforms either alone.

Performance, Cost, and Production: Registry, Observability, and the Eval Gate

The graph is 1 row in the fleet's single source of truth, backend/infra/registry.py: GraphSpec("text_to_sql", "graphs.text_to_sql_graph"). It takes resumable=False because each run gets a random-UUID thread, is idempotent, and has nothing to resume. Both runtimes read identity from that 1 registry — the local langgraph dev server on port 8002 and the FastAPI/Cloudflare app. So the local and deployed graphs are the same graph, not 2 drifting copies. Adding or changing a graph is a single edit there. That is what keeps a 39-graph fleet auditable.

The cost envelope is bounded by 3 hard limits, not a billing surprise: at most 2 repair attempts per query, at most 50 rows fetched, and a 20-row sample handed to the summary model. A query therefore costs at most 4 LLM calls — intent, generation, and up to 2 repairs — plus 1 summary call, for a worst case of 5. The 0.80 eval bar gates any prompt change that would raise that count without earning its keep.

Observability is the audit trail that makes the analytics path trustworthy after the fact. Per run, the spec emits 4 metrics to LangSmith: agentic_sales.text_to_sql.tables_used, agentic_sales.text_to_sql.confidence, agentic_sales.text_to_sql.row_count, and agentic_sales.text_to_sql.repair_attempts. The last is the most operationally useful. Filtering for runs where repair_attempts > 0 surfaces the schema mismatches and ambiguous phrasings the self-healing loop quietly resolved. That is the signal a schema owner uses to add a synonym column or a view. And every prompt path is held to the fleet's ≥0.80 evaluation bar on LangSmith golden datasets — the gate built in article #9, Evidence-Driven Release Gates for LLM Sales Agents. A prompt change that drops summary or repair quality below 0.80 simply does not ship.

Limitations and Honest Scope

This design is not a universal answer, and it is worth being clear about where it stops. The self-healing loop fixes execution-time errors — a bad column name, a malformed join, a type mismatch the database rejects. It cannot fix a query that runs cleanly and answers the wrong question. A SELECT that joins the wrong table returns rows, counts as success, and is summarized confidently. Only the operator reading the answer can catch that semantic miss. The 0.80 eval bar guards prompt quality, not the truth of any single answer.

Two more boundaries matter. First, the benchmarks cited here are public-dataset figures — Spider, SelECT-SQL at 84.2%, SQL-of-Thought's state-of-the-art numbers. They are not measured on this private CRM schema. They justify the loop's shape, not a specific accuracy number for lead-gen-jobs, and this article reports no such number because none has been published. Second, D1 is SQLite. Window functions and full-text search that a Postgres-backed CRM might lean on are not available. Queries that need them fall to the deterministic builders or cannot be expressed. The honest framing: this graph makes NL-to-SQL safe and self-correcting, not infallible.

Practical Takeaways

This architecture is not hypothetical. It runs today on Cloudflare D1 as part of a production autonomous sales fleet. The 6 principles that make it hold up generalize beyond this stack:

  1. Always gate your SQL output, and anchor the gate to statement boundaries. A 2-layer defense is non-negotiable: a leading-head parse plus a statement-boundary keyword block. Anchor the keyword layer to string and statement boundaries, not word boundaries. Otherwise you will silently blank legitimate reads.
  2. Let the database verify. Execution errors carry more information than any static validator. They reflect the real schema at the real moment. Feed the diagnostic back to the model for repair, as the anchor paper and SQL Query Engine: A Self-Healing LLM Pipeline for Natural Language to PostgreSQL Translation both argue.
  3. Bound the loop and accept early. 2 repair attempts is a sober default. The first successful execution ends the loop. An empty result set is a success, not a defect to heal.
  4. Ground the schema. Seed the model with real table names; for a dynamic schema, read PRAGMA table_info at graph start rather than letting the model guess.
  5. Separate ad-hoc from known analytics. Use deterministic builders for funnel reports and attribution; reserve the self-healing loop for genuinely ad-hoc questions.
  6. Provenance every output and eval-gate every prompt change. Carry confidence, reason, source, and evidence on every answer, and hold each prompt path to a ≥0.80 golden-dataset bar before it ships.

The database is the verifier. All 5 papers — Borovčak, Bagić Babac, and Mornar (2026), Yu et al. (2018), Shen and Kejriwal (2024), Chaturvedi et al. (2025), and Ijaz (2026) — converge on 1 reality: NL-to-SQL is unreliable by default and reliable by design. Build in validation, healing, and grounded summarization, and a 7-node graph answers an operator's question safely behind a 2-attempt repair loop. No one writes a line of SQL. Treat the database like the honest debugger it is.

Frequently Asked Questions

What is a self-healing loop in NL-to-SQL? It is an automated feedback loop. A failed query's database error becomes the repair signal. The model diagnoses that error and regenerates a corrected query, bounded here to two attempts.

Does Cloudflare D1 support the SQL that CRM analytics needs? D1 uses SQLite semantics. It supports joins, aggregations, and subqueries — enough for the funnel and attribution queries here. The graph emits SQLite-only syntax, never PostgreSQL casts or ILIKE.

How does the system prevent a destructive query? A two-layer SELECT-only gate. The query must start with SELECT or WITH, and a statement-boundary regex blocks every write or DDL keyword. Repaired queries re-enter the same gate, so a repair can never widen permissions.

Can the same pattern run on Postgres or MySQL? The gate and repair loop generalize, but the SQL dialect and the D1 transport (infra.db.d1_all) are D1-specific. The self-healing pattern itself is database-agnostic.

The Autonomous Sales Fleet — full series

This is Part 5 of 10 in a series on building one production autonomous-agentic-sales system on LangGraph + DeepSeek + Cloudflare D1, where each part adds one capability that moves the fleet up the autonomy ladder — from human-triggered assistants to self-directed plan→act→verify loops, gated by autonomy guardrails. The arc runs orchestration → enablement & analytics → campaign strategy → reliability & evaluation.

Orchestration

  1. Autonomous CRM Orchestrator (reason→decompose→act→verify)autonomy: high
  2. Multi-Step Lead Qualificationhigh
  3. Lead-to-Proposal Multi-Agent Pipelinehigh
  4. Hierarchical Coach→Worker Delegationhigh

Enablement & analytics 4. Sales-Enablement Copilot: Deal Coaching & Objection Handlingmedium 5. NL-to-SQL CRM Analytics over Cloudflare D1medium

Campaign strategy 6. Design-Thinking Expert Panels for Campaign Strategymedium

Reliability & evaluation — the autonomy guardrails 8. Deadlock & Infinite-Loop Preventionguardrail 9. Evidence-Driven Release Gates (PROMOTE/HOLD/ROLLBACK)guardrail 10. Detecting Agent Defects & Drift in Productionguardrail

References

Deadlock & Infinite-Loop Prevention in Multi-Agent Sales Workflows

· 22 min read
Vadim Nicolai
Senior Software Engineer

Deadlock and infinite-loop prevention in multi-agent sales workflows starts with one ugly trace: a sales agent sits idle while a competitor closes the deal. Two nodes trade the same lead back and forth — rechecking CRM fields, re-requesting approval, re-updating scores — until the opportunity ages out. No cancellation, no escalation, no crash. Just an infinite loop that burns credits, writes no value, and slips past every per-message quality gate, because each individual draft looks fine.

This is article #8 of The Autonomous Sales Fleet — one production LangGraph + DeepSeek + Cloudflare-D1 + LangSmith system where each article realizes one 2026 reliability paper as one real graph node. The constraints stay constant across the series. A three-plane architecture splits the work: a LangGraph control plane, a Cloudflare data plane, and a LangSmith observability plane. DeepSeek-only egress runs through a single AI Gateway. A 0.80 eval gate sits on every prompt path. Grounding-First provenance tags every persisted decision, and every send waits on draft-first human approval. This piece adds the liveness layer: structural deadlock and infinite-loop prevention that runs before any model judges anything.

This is a guardrail, not a rung on the autonomy ladder. It is one of the constraints that earns the autonomy the higher rungs exercise — the orchestrator, the coach, the lead-to-proposal pipeline. Every plan→act→verify loop that runs unattended needs a deterministic floor under it. That floor proves the loop will actually terminate; without it, the act step has no safe upper bound. This guard is the thing that lets the fleet trust a self-directed loop at all.

A Sales-Enablement Copilot: Grounded Deal Coaching and Objection Handling

· 24 min read
Vadim Nicolai
Senior Software Engineer

The most effective sales-enablement copilot in our production fleet never sends a single message. That cuts against every vendor demo where a glowing AI drafts the perfect rebuttal and fires it off. This sales-enablement copilot does grounded deal coaching and objection handling, but in production the highest-leverage capability is not generation — it is holding fire. The agentic-sales fleet runs a LangGraph state machine where every objection-handling draft is stamped status='draft' and routed to a human for approval. The copilot coaches, suggests, and grounds its advice in company knowledge, but it never touches the send button. That single design choice turns a liability into an asset: the rep gets a grounded, auditable recommendation that she still owns.

On the fleet's autonomy ladder this capability sits deliberately medium — it is rep-assist, not self-direction. It automates the plan step: what grounded coaching and rebuttal a given objection deserves. But it hands both act and verify to the human. The copilot drafts and grounds; the rep decides and sends. That is a conscious rung below the orchestrator and the lead-to-proposal pipeline. The failure cost of an objection rebuttal — repeating a hallucinated compliance claim to a live prospect — is high enough that earning the send is not worth it.

This is article #4 in The Autonomous Sales Fleet series, and like every entry it adds exactly 1 capability as 1 real graph: a company-knowledge-grounded objection-handling copilot that feeds the reply path, backed by a faithfulness gate and a per-vertical playbook of 9 entries. It builds on the shared fleet introduced in An Autonomous CRM Orchestrator with LangGraph (#1) and the typed task sequencing of A Multi-Step Lead-Qualification and Sales-Support Agent (#2).