Lead-to-Proposal Multi-Agent Pipeline in LangGraph
From Lead to Proposal: Building a Multi-Agent Pipeline with LangGraph
A lead-to-proposal pipeline in LangGraph runs an autonomous lead→proposal loop: a raw lead enters, and three specialized agents qualify it, research it from grounded facts, and draft a tailored proposal — every intermediate node executing unattended, with no sales rep between them. That is the whole point of decomposing the work into a multi-agent graph rather than one prompt. The loop earns its autonomy by stopping at exactly one place: a human gate on the send, the single action that carries legal and reputational weight.
That gate is what most implementations get wrong. They either automate everything and lose human oversight at the consequential step, or keep a human in every node and forfeit the throughput the automation was supposed to buy. The pipeline below takes neither path. It automates the expensive cognitive labour — qualify, research, draft — and holds the final verify for an operator, who approves a grounded draft rather than composing one from scratch. The bottleneck was never the proposal itself; it is everything upstream of it, and that is precisely what the loop absorbs.
A 2026 Aalto University master’s thesis by Ilari Metsälä, Suuriin kielimalleihin perustuvien moniagenttisten työnkulkujen kehitys myyntitehtävissä — Developing LLM-based multi-agent workflows for sales tasks, puts a sharp point on why. Single‑model performance gains are flattening, and the field is turning to specialized multi‑agent architectures. The thesis’s central finding is a trade-off. Multi‑agent workflows improve efficiency by automating complex tasks with reduced user involvement, yet trade away granular control.
On the fleet’s autonomy ladder this capability sits high. It automates the human plan→act span for an entire lead and runs every intermediate node unattended, yet earns that standing only by holding the final verify — the send — for a human. This is where the fleet first chains three cognitive steps end-to-end without a rep between them, and where it stops at the one decision that carries legal weight.
The production system below makes that concrete. It is a single LangGraph codebase, referred to throughout as “the fleet,” running on one DeepSeek egress behind a Cloudflare AI Gateway, a Cloudflare D1 data plane, and LangSmith tracing. This is not a demo: the fleet processes leads through qualification, research, and proposal composition, then presents a held draft to an operator for approval.
The thesis pattern — specialized agents as modular tools via the Model Context Protocol (MCP) — maps directly onto a subgraph-invocation pattern, unpacked in the sections below.
Why a Multi-Agent Pipeline Beats a Single Prompt
Defended on 2026-03-26, Metsälä’s thesis grounds multi‑agent sales automation in 2 complementary studies. The first, a requirements‑engineering study, mapped the real B2B software‑consultancy sales process. A prototype evaluation followed, with think‑aloud user tests and interviews. The headline finding is a trade-off.
Multi‑agent workflows improve efficiency by automating complex tasks with reduced user involvement, at the cost of fine‑grained control. It emerged from users interacting with a proof‑of‑concept. That prototype decomposed the lead‑to‑proposal process into modular tools exposed through the Model Context Protocol.
The thesis argues that single‑model LLM gains are plateauing. The path forward is specialized agents, each owning a narrow slice of the task. That choice is what lets a system be evaluated per node rather than as a black box. Consider the alternative.
A single monolithic prompt — “qualify this lead, research it, and write a proposal” — cannot be debugged, gated, or iterated at the node level. Its performance collapses into one opaque blob.
Architecture: Designing the Multi-Agent LangGraph Pipeline
LangGraph excels at stateful graph orchestration. In AI Agents vs. Agentic AI: A Conceptual Taxonomy, Applications and Challenges — a heavily-cited 2025 survey — Sapkota et al. (2025) taxonomize agentic toolchains across 2 distinct paradigms, contrasting LangGraph’s support for cyclic, stateful graphs with LangChain’s linear chain-based composition. The survey’s central claim is that a single Agentic-AI controller scales worse than a network of narrow AI Agents, each owning a deterministic slice of the task. That distinction is exactly what justifies splitting the fleet’s proposal stage into 3 separately-gated nodes rather than 1 monolithic prompt. Cyclic state is the prerequisite for a node to revisit prior enrichment after a QA flag — something a linear chain cannot express.
Their paper notes that LangGraph excels at stateful graph orchestration. That is the critical difference when agents revisit prior steps. A proposal writer re-fetches lead enrichment data after a qualification agent completes. The fleet relies on LangGraph’s StateGraph topology with one shared schema that grows additively, so adding the AA03 nodes touches no existing fields.
The graph starts from a discover node. It fans out via LangGraph's Send API into parallel enrich_one and contact_one tasks, funnels them through qa_counts and the qa_critic, then ranking. With PIPELINE_PROPOSAL_STAGE_ENABLED on, ranking routes into the three AA03 nodes — qualify_lead, research_lead, compose_proposal — before reaching the outreach_queue interrupt; with the flag off, ranking edges straight to outreach_queue and the topology is byte-identical to the legacy pipeline. The cyclic capability matters here.
Enrichment tasks may need a re-run when the QA critic flags a low-confidence sample. Without cycles, you would re-invoke the entire graph. LangGraph’s state persistence preserves enrichment outputs and allows conditional re-entry without losing context. This is the infrastructural enabler for Metsälä’s modular-tools vision.
Building the Lead-to-Proposal Pipeline: Specialized Agents as Modular Tools
The pattern from Metsälä’s thesis—specialized workflows integrated as modular tools through MCP—maps onto the fleet’s subgraph‑as‑node pattern. The fleet’s internal contract for this stage lives in its AA03 backlog item. It defines three new nodes between ranking and the outreach gate: qualify_lead, research_lead, and compose_proposal. Each is a narrow, independently evaluable subgraph.
Each sits behind the PIPELINE_PROPOSAL_STAGE_ENABLED flag, so the topology is byte-identical to today when the flag is off.
The decomposition principle is general. Sapkota et al. (2025) frame it as a move from a single Agentic-AI controller to a network of narrow AI Agents. Each agent owns a deterministic slice of the task. That maps onto the fleet’s three new nodes.
Each role becomes one LangGraph subgraph node. Each carries its own prompt, its own golden dataset, and its own place in the shared state schema. A regression in one role never silently degrades another. This is the payoff of decomposition.
The modular-tools vision from Metsälä’s thesis is what makes per-node evaluation, and therefore the ≥0.80 gate, possible at all.
A reviewer role is the other half of the pattern. The fleet’s qa_critic node plays it, running over a default sample of 10 fresh enrichments before any proposal work begins. The design choice is explicit: qualify does not research, research does not draft, and compose does not decide eligibility. A node ships only when its prompt clears the ≥0.80 accuracy gate.
So a regression in one role cannot pass unnoticed into production. A flawed proposal draft is caught at the held-draft interrupt, not after a send.
The numbers back the decomposition. In Multi-Agent RAG Framework for Entity Resolution, Althaf et al. (2025) split household entity resolution into specialized, coordinated agents rather than one model. They reached 94.3% accuracy on name-variation matching while cutting API calls by 61% against a single-LLM baseline. Both numbers matter for a lead-to-proposal pipeline. The 94.3% accuracy is what keeps two similarly-named companies from being conflated upstream of a proposal. The 61% call reduction is what keeps a 3-node decomposition affordable on a single DeepSeek egress. A monolithic prompt buys neither — it neither isolates the resolution step nor lets you measure it.
The fleet’s structure echoes this. It runs three new proposal-stage nodes, four additive state fields, and one human gate. All of it sits behind a single feature flag that defaults to off, on one DeepSeek egress. Every node is held to the same 0.80 accuracy bar.
Decomposition is not free — it adds state-management overhead — but the 94.3% accuracy and 61% call-reduction figures show the wins outrun that cost.
Zhang & Arawjo (2024), in ChainBuddy, tackle the “blank page” problem in LLM-pipeline construction, where users stall on where to even begin. In their user study, 78% of participants preferred an agent-generated pipeline scaffold over a blank editor — a strong signal that a structured starting point, not a from-scratch generation, is what people actually adopt. The lesson transfers cleanly to proposal generation: a lead-to-proposal agent should hand the reviewer a grounded first draft rather than a blinking cursor. That 78% preference reframes scaffolding as a baseline requirement for adoption, not an optional convenience layered on top.
In the fleet, the compose_proposal node delegates copy to email_outreach_graph. That subgraph reuses predefined VERTICAL_SEQUENCE_DEFS and SUB_NICHE_SEQUENCE_DEFS step-0 directives — structured scaffolding by another name. The 78% figure suggests scaffolding is a necessity, not a nicety.
State Management Between Agents: MCP Context and Grounding-First Provenance
In A Survey on Model Context Protocol: Architecture, State-of-the-art, Challenges and Future Directions, Ray (2025) describes MCP as a session-oriented JSON-RPC framework. It lets an LLM negotiate capabilities, invoke external tools, and retrieve contextual resources under fine-grained, OAuth 2.1-compliant access control. That is the contract a lead-to-proposal stage needs. The survey names 2 practical costs of standardizing context this way: transport latency and credential exchange. Each of the 3 new nodes is a tool that consumes structured context — CRM facts, product catalogs, pricing rules — with no ad-hoc integration code. The 0.80 gate on each node’s output is what keeps a corrupted context surface from silently propagating into a proposal.
The fleet uses a Cloudflare D1 SQLite data plane in place of a live MCP server, but the spirit is identical. The research_lead node reads one structured, auditable context surface rather than re-scraping the open web on every run.
The research_lead node builds a compact grounded brief. It draws only on enrichment columns and the QA verdict already resident in D1. No new scraping occurs. Each persisted qualify decision carries {confidence, reason, source, evidence}.
This provenance lets the human at the held draft see why the lead qualified and what facts grounded the proposal. Untrusted enriched content is wrapped via llm/prompt_safety.py (wrap_untrusted) to address OWASP LLM01 prompt injection. Scraped text from third-party sources cannot hijack the qualify or compose prompts.
Step‑Wise Evaluation and the ≥0.80 Gate
In Agent-as-a-Judge: Evaluate Agents with Agents, Zhuge et al. (2024) propose evaluating agents at each intermediate step, not only on the final outcome. A single end-of-run score hides where a multi-step trajectory went wrong. This is critical for a lead-to-proposal pipeline. A flawed early step — misinterpreting lead requirements during qualify — cascades silently into a poor proposal 3 nodes later. The fleet implements that step-wise principle directly. Each of its 3 AA03 nodes carries its own LangSmith golden dataset and its own 0.80 accuracy gate. So the qualify, research, and compose stages are scored independently rather than as one blob. With 3 gates instead of 1, a 0.80 regression in the qualifier is caught and attributed to it, long before it reaches a human.
Each AA03 node has its own golden set, and a prompt is promoted only when it clears the 0.80 accuracy gate. There are three independent gates instead of one end-to-end check. A regression is localized to the node that caused it, not diffused across the whole run. The same PROMOTE/HOLD/ROLLBACK discipline drives the fleet's evidence-driven release gates, which decide when a node's new prompt is allowed into production.
The fleet also includes a reflection‑style LLM critic (run_qa_critic) over a sample of fresh enrichments. That critic is the orchestrator’s only direct DeepSeek call (via make_deepseek_flash(temperature=0.0)), and it runs before the proposal stage. The critic’s output (qa_verdict) feeds into the qualify_lead node’s deterministic‑first logic: only borderline leads escalate to an LLM qualify call. This gives the qualification gate two layers.
The first is a deterministic rule-based filter over ranking scores and the QA verdict. The second is an LLM fallback for ambiguous cases. The deterministic layer guarantees zero cost for clearly qualified or clearly unqualified leads. The LLM layer adds cost only for the borderline middle, where lead signals are rarely clean enough for a rule alone to decide. The qualify_lead node reuses the same deterministic-first scoring logic developed for the fleet's multi-step lead-qualification agent, so qualification stays consistent across stages.
Data Quality First: Multi‑Agent RAG for Entity Resolution
No pipeline survives dirty lead data. In Multi-Agent RAG Framework for Entity Resolution: Advancing Beyond Single-LLM Approaches with Specialized Agent Coordination, Althaf et al. (2025) decompose household entity resolution into coordinated, task-specialized agents. They report 94.3% accuracy on name-variation matching while cutting API calls by 61% versus a single-LLM baseline. In the fleet, entity resolution happens during the enrich_one and contact_one nodes, which fan out via LangGraph’s Send API. Each enrichment task hydrates a single company from public sources and the D1 database.
The qa_critic then samples these results and flags low‑confidence enrichments for re‑run or human review.
The 94.3% accuracy figure from Althaf et al. is directly applicable. Without multi‑agent resolution, a single LLM conflates similar company names at a higher rate. That corrupts downstream proposal generation. The reported 61% reduction in API calls also maps onto the fleet’s cost budget on a single DeepSeek egress.
The fleet implements the multi‑agent RAG pattern with separate enrichment, contact discovery, and contact enrichment subgraphs. Each one has its own prompting and writes to the D1 data plane. The data cleaner role is implicitly the qa_critic. It detects inconsistencies like missing website URLs or mismatched industry tags.
Human-in-the-Loop Checkpoints in the Lead Pipeline
Metsälä’s central trade‑off is the spine of the whole pipeline. The fleet resolves it deliberately. The three new proposal-stage nodes run autonomously and unattended — that is the machine throughput. But the interrupt() at outreach_queue is preserved as the sole human gate.
That is human control over the consequential action. Provenance returns control at exactly the point the thesis says it gets lost. Every held draft carries {confidence, reason, source, evidence} from the qualify and research nodes. The operator sees why the lead qualified and what facts grounded the proposal.
A two-layer consensus makes that stance concrete. In the fleet, the qualify_lead node is a deterministic-plus-LLM hybrid. The deterministic filter and the QA critic must agree that a lead is at least borderline before the DeepSeek qualify call fires. This two-layer agreement ensures the throughput gain from automation does not come at the cost of degraded qualification accuracy.
Step-wise evaluation is the safeguard underneath it. Zhuge et al. (2024) argue that a flawed early step cascades into a poor final result. So each intermediate node — not just the final draft — must be judged on its own. The fleet applies exactly that.
Every node clears its own golden dataset before it can influence a proposal.
Production Considerations: Security, Rollback, and Observability
The fleet is designed for zero‑downtime rollback. The PIPELINE_PROPOSAL_STAGE_ENABLED flag defaults to 0. The graph topology is then byte‑identical to the pre‑proposal state, ranking straight to outreach_queue. Setting it to 1 inserts three nodes and four additive state fields: qualified_ids, qualified_decisions, research_briefs, and proposals.
Setting it back to 0 removes them immediately. There is no migration to unwind and no schema change to revert. This additive pattern, combined with LangSmith’s per‑node golden datasets, means that each new node ships only when its prompt clears ≥0.80 accuracy.
It is worth stating the cost honestly. Decomposition is not a free win. Metsälä’s own users reported a loss of fine-grained control as more steps became autonomous. A multi-agent graph also carries real state-management and context-passing overhead that a single prompt avoids.
The case for it rests on the measured upside. Althaf reports 94.3% entity-resolution accuracy and a 61% call reduction. ChainBuddy reports a 78% scaffolding preference. The fleet runs 3 independent gates instead of 1.
A caveat worth naming plainly: those figures come from adjacent domains — household entity resolution and LLM-pipeline authoring — not from a controlled sales A/B test of this exact pipeline. They argue that decomposition and scaffolding tend to pay off; they do not promise a specific lift on any one team's leads. The honest move is to treat them as design priors and let each node's own golden dataset, gated at 0.80, supply the domain-specific evidence. Those numbers justify the overhead — and only because a human still signs the send.
In Language Models and Cognitive Automation for Economic Research, Korinek (2023) catalogs 25 distinct use cases across 6 domains: ideation, writing, background research, data analysis, coding, and mathematical derivation. The unifying thread across all 25 is tool integration — web search, database queries, code execution. That is exactly the capability surface the fleet wires into its data plane. It is why the proposal stage is 3 tool-calling subgraphs rather than 1 text-only prompt. Each of those 25 use cases is a cognitive task that becomes reliable only when the model can reach a real tool. The fleet operationalizes that for its 3 sales-specific tasks — qualify, research, draft — routing every one through the same make_llm() factory onto the same 4 Cloudflare bindings.
The fleet’s tools are D1, R2, Queues, and Workers AI. Each proposal-stage node reaches them through the same make_llm() factory. Observability comes from LangSmith trace tags (agentic_sales.stage=qualify|research|proposal) and OTel counters (agentic_sales.pipeline.qualified, agentic_sales.pipeline.proposals_drafted). The per-node StageReport gives the operator the qualify-to-proposal yield without reading individual traces.
The A09 logging rule keeps spans to agentic_sales.* attributes only — never email addresses or proposal body text.
Practical Takeaways
From this pipeline, four lessons generalize to any multi‑step business automation:
-
Gate per node, not per pipeline. The ≥0.80 gate applies to each subgraph on its own. You can iterate individual nodes without re‑validating the whole graph. Ship a better qualifier independently of the proposal drafter.
-
Deterministic first, LLM last. The
qualify_leadnode uses a rule‑based filter over ranking scores and QA verdict before calling DeepSeek. This minimizes LLM cost and latency for most leads. Clearly qualified and clearly unqualified leads resolve on the deterministic rule alone. Only the ambiguous middle escalates to a DeepSeekmake_llm()call. Benchmark your own data to set the thresholds — they are business-specific. -
Provenance is how you give back control. Metsälä’s thesis found that users felt loss of control when automation took over intermediate steps. By attaching
{confidence, reason, source, evidence}to every decision, the human reviewer at the interrupt point can reconstruct the reasoning chain in seconds rather than reading logs. -
Rollback must be a feature flag, not a git revert. The additive state fields and conditional node insertion sit behind
PIPELINE_PROPOSAL_STAGE_ENABLED(default 0). The entire proposal stage can be disabled instantly, with no migration to unwind. This is the only safe way to ship experimental capabilities in a pipeline that processes real leads.
The Broader Implication
The lead‑to‑proposal pipeline is a specific instance of a general pattern. Any business process that chains cognitive tasks — assess, gather, compose, approve — can be modelled as a LangGraph of specialised agents, much like the fleet's autonomous CRM orchestrator that reasons, decomposes, acts, and verifies across the same shared state. The thesis by Metsälä (2026) provides the academic justification. LLM performance plateaus, and multi‑agent decomposition is the next lever.
The practical contribution of this fleet is the deliberate resolution of the control/throughput trade‑off. The human is kept in the loop at exactly the point where a mistake is most consequential: the send. Everything before that runs unattended, auditable, and gated. The pattern transfers to contract drafting, compliance reviews, and personalised marketing.
It fits any domain where throughput matters but the last decision carries legal weight. The fleet’s three-plane architecture and its grounding-first provenance are the production guardrails. Together they turn multi‑agent theory into a system an operator trusts to hold a draft without holding the pen.
Frequently Asked Questions
What is a lead-to-proposal pipeline in LangGraph?
It is a LangGraph multi-agent graph that takes a raw lead and runs three specialized nodes — qualify_lead, research_lead, and compose_proposal — to produce a tailored B2B proposal. Every intermediate node executes unattended; the loop stops only at one human-in-the-loop gate on the send.
Why decompose proposal generation into multiple agents instead of one prompt?
A single monolithic prompt cannot be gated, debugged, or iterated per step — its performance collapses into one opaque blob. Decomposing into qualify, research, and compose nodes lets each be scored against its own golden dataset and promoted only when it clears the ≥0.80 accuracy gate, so a regression is localized to the node that caused it rather than diffused across the whole run.
Where does the human-in-the-loop gate sit in this B2B proposal automation?
The fleet automates the expensive cognitive work — qualify, research, draft — but holds a LangGraph interrupt() at the outreach_queue node. An operator approves a grounded held draft rather than composing one, keeping human control over the single action that carries legal and reputational weight: the send.
How is the lead-to-proposal pipeline rolled back safely?
The whole proposal stage sits behind the PIPELINE_PROPOSAL_STAGE_ENABLED feature flag, default 0. With it off, the graph topology is byte-identical to the legacy pipeline. Setting it to 1 inserts three nodes and four additive state fields; setting it back to 0 removes them with no migration to unwind and no schema change to revert.
How does the pipeline keep proposals grounded in real facts?
The research_lead node reads a structured Cloudflare D1 data plane rather than re-scraping the web, and every qualify decision carries {confidence, reason, source, evidence}. Untrusted enriched content is wrapped via llm/prompt_safety.py (wrap_untrusted) to address OWASP LLM01 prompt injection before it reaches the qualify or compose prompts.
The Autonomous Sales Fleet — full series
This is Part 3 of 10 in a series on building one production autonomous-agentic-sales system on LangGraph + DeepSeek + Cloudflare D1, where each part adds one capability that moves the fleet up the autonomy ladder — from human-triggered assistants to self-directed plan→act→verify loops, gated by autonomy guardrails. The arc runs orchestration → enablement & analytics → campaign strategy → reliability & evaluation.
Orchestration
- Autonomous CRM Orchestrator (reason→decompose→act→verify) — autonomy: high
- Multi-Step Lead Qualification — high
- Lead-to-Proposal Multi-Agent Pipeline — high
- Hierarchical Coach→Worker Delegation — high
Enablement & analytics 4. Sales-Enablement Copilot: Deal Coaching & Objection Handling — medium 5. NL-to-SQL CRM Analytics over Cloudflare D1 — medium
Campaign strategy 6. Design-Thinking Expert Panels for Campaign Strategy — medium
Reliability & evaluation — the autonomy guardrails 8. Deadlock & Infinite-Loop Prevention — guardrail 9. Evidence-Driven Release Gates (PROMOTE/HOLD/ROLLBACK) — guardrail 10. Detecting Agent Defects & Drift in Production — guardrail
References
- Metsälä, I. (2026). Suuriin kielimalleihin perustuvien moniagenttisten työnkulkujen kehitys myyntitehtävissä (Developing LLM-based multi-agent workflows for sales tasks). Master's thesis, Aalto University. OpenAlex W7161925826.
- Sapkota, R., Roumeliotis, K. I., & Karkee, M. (2025). AI Agents vs. Agentic AI: A Conceptual Taxonomy, Applications and Challenges.
- Zhuge, M., et al. (2024). Agent-as-a-Judge: Evaluate Agents with Agents.
- Zhang, J., & Arawjo, I. (2024). ChainBuddy: An AI-assisted Agent System for Helping Users Set up LLM Pipelines.
- Althaf, et al. (2025). Multi-Agent RAG Framework for Entity Resolution: Advancing Beyond Single-LLM Approaches with Specialized Agent Coordination.
- Korinek, A. (2023). Language Models and Cognitive Automation for Economic Research.
- Ray, P. P. (2025). A Survey on Model Context Protocol: Architecture, State-of-the-art, Challenges and Future Directions.
- LangGraph documentation · DeepSeek API · Cloudflare D1 · LangSmith.
This is article #3 in the ongoing series The Autonomous Sales Fleet. Start with #1: The Autonomous CRM Orchestrator in LangGraph, continue with #2: A Multi-Step Lead-Qualification & Sales-Support Agent, and follow with #4: A Sales-Enablement Copilot for Deal Coaching & Objection Handling.
