Skip to main content

Design-Thinking Multi-Agent Panels for Campaign Strategy

· 26 min read
Vadim Nicolai
Senior Software Engineer

Design-thinking multi-agent campaign strategy is what you get when you let an agent fleet own the plan step that a human normally improvises in their head. Instead of a hard-coded six-touch weekly drip, one LangGraph graph simulates a room of human experts — a strategist, a skeptic, a brand-voice lens — arguing over how a multi-touch outreach sequence should be shaped before the first email is ever drafted. On the fleet's autonomy ladder this capability sits medium: it automates the deliberation over what a campaign's touch sequence should be, then hands the resulting plan to the durable engine, which still holds every individual email for human approval before it acts. Autonomy is earned, not asserted — the panel's output is only a seed (cadence and per-touch angles), never a send.

That boundary is the whole design. The panel moves the fleet up the autonomy ladder by automating strategy, not execution: the consequential verify stays with the operator at each touch, exactly as it does for every other graph in the fleet. And it does so without a single new model — the same DeepSeek egress, the same Cloudflare D1, the same 0.80 eval bar that everything else already runs on. Nothing about adding a deliberation layer relaxes the draft-first, human-in-the-loop discipline that keeps the system safe to point at a real inbox; it only makes the sequence those drafts follow an auditable decision instead of a constant.

This is the sixth article in a series about one production system. The system is a LangGraph agentic-sales fleet. It runs on a single DeepSeek egress behind a Cloudflare AI Gateway, a Cloudflare D1 data plane, and LangSmith tracing gated at 0.80. Each article adds one capability as one real graph.

The architecture is constant across the series. There is a LangGraph control plane (StateGraph graphs, a durable D1 checkpointer, interrupts). There is a Cloudflare data plane (D1, R2, Queues, cron triggers). And there is a LangSmith observability plane. Every LLM call routes through one DeepSeek egress. Diversity across agents comes from per-seat temperature and explicit personas, never a model swap. Every persisted decision carries a provenance envelope: confidence, reason, source, and evidence. This article adds the step that decides what a campaign is before it runs.

The Static Default This Replaces

The fleet's durable campaign engine sends multi-touch outreach as a reactive, cron-driven loop. There is one LangGraph thread per (campaign, contact) pair, with a stable campaign-<campaignId>-<contactId> thread id, compiled against the D1 checkpointer so the thread's state survives between touches. Each touch composes a draft, holds it for human approval, sends on approval, schedules the next touch, then calls interrupt() and exits. A Cloudflare cron resumes the thread days later with Command(resume=True). LangGraph persistence supplies the durable state; the cron is the scheduler. That draft-first discipline — nothing sends without an operator decision — is the spine of the whole fleet.

The gap was upstream of all of it. The shape of the sequence came from a static constant: _DEFAULT_CADENCE_DAYS = [0, 4, 7, 7, 7, 7] with _DEFAULT_MAX_TOUCHES = 6. That constant set the touch count, the gap before each touch, and nothing about the angle. Every campaign, every vertical, every opportunity got the same six-touch weekly drip.

A high-intent inbound role might warrant a tight 3-touch burst over 5 days. A slow-burn nurture might need 8 patient weeks across 6 touches. Both were treated identically by the same fixed 6-element array. There was no deliberation and no record of why a sequence took its shape. The composer improvised all 6 emails against the opportunity in isolation.

The anchor for closing that gap is a 2025 paper out of the Warsaw University of Technology, and it is worth stating its claim precisely:

  1. In Building a Marketing Campaign with LLM-based Multi-Agent System and Design Thinking (Kamil Szczepanik and Jarosław A. Chudziak, 45AI 2025, DOI 10.5171/2025.4525725), the authors build an LLM-based multi-agent system that simulates a collaboration of human experts to create a marketing campaign. It is explicitly guided by design-thinking principles. The system progresses through all 5 design-thinking stages — empathize, define, ideate, prototype, test — via collaborative ideation and research. Their experimental application develops a campaign strategy for an eco-friendly beverage launch. It calls out agent orchestration and prompt engineering as the load-bearing implementation aspects. Their conclusion is the central claim for this whole feature: an agentic, design-thinking approach can deliver innovative and well-aligned marketing solutions for real business applications. That is exactly the deliberation the fleet's static cadence array was missing.

Propose, Critique, Synthesize: The Real Graph

The capability is specified as one graph — campaign_strategy_graph.py, built by build_campaign_strategy_graph() — with 3 nodes, propose, critique, synthesize, that map onto the 5 design-thinking stages. It registers in the fleet's graph registry as campaign_strategy with resumable=False, because the panel is single-shot at campaign launch rather than a durable thread that resumes across cron ticks: it runs once, emits 1 plan, and exits.

One honest distinction up front. The seam that receives the plan is built and unit-tested in the live campaign_graph.py today: seed_strategy_into_launch, the feature flag, the fail-open fallback, the launch-seed fold. The panel graph itself is the deferred half of the backlog item (AA08). It is designed in full against the contracts below, but not yet wired as its own module. That ordering is deliberate. The risky, runtime-touching surface lands and is tested first, behind a default-off flag. The panel that produces the plan then slots into an interface that already exists and already fails open. Everything that follows describes the designed graph and the live seam, and is explicit about which is which.

The panel does not invent a new mechanism. It reuses the persona-and-temperature decorrelation pattern already proven in the fleet's reusable adjudicator graph, multi_agent_judge_graph.py (whose builder is build_graph). That graph is the escalation target for low-confidence single-shot verdicts across the fleet: between 2 and 5 independent DeepSeek reasoners — the panel size is clamped to that range — each return a verdict, a confidence, and a rationale over the same fenced evidence, with an optional second debate round, and then a deterministic aggregator takes the majority and measures consensus as agreement = winners / total. Critically, the diversity there comes not from swapping models but from giving each seat a distinct persona and a distinct sampling temperature, drawn cyclically from the literal list _TEMPS = [0.0, 0.5, 0.3, 0.7, 0.2]. The strategy panel borrows the same ainvoke_json, make_llm, and wrap_untrusted primitives, and the same _norm_opinion-style coercion — but it produces 1 campaign plan instead of a yes/no verdict.

The literature behind that judge graph is what makes the borrowing principled. The fleet's multi-agent adjudicator is paper-grounded in several sources worth naming, each in its own right.

  1. The debate protocol comes from MedLA: A Logic-Driven Multi-Agent Framework for Complex Medical Reasoning (arXiv:2509.23725). Its contribution is a generic multi-agent debate-to-consensus protocol in which independent reasoners deliberate and then converge. That is the exact propose-then-critique shape reused for plan deliberation here, swapping a medical diagnosis for a campaign sequence. The framework demonstrates that structured disagreement between agents surfaces failure modes a single chain-of-thought would have confidently glossed over. That is precisely why a panel of 3 decorrelated seats is worth the extra LLM calls when the decision is a whole campaign's shape, not one email. It is also why the strategy panel runs 2 deliberation passes — a propose pass followed by a critique round — rather than a single proposal pass.

  2. The ensemble logic comes from Large language model-based multiagent collaboration for abstract screening toward automated systematic reviews (DOI 10.1093/biomethods/bpag006). The authors run multi-agent majority voting plus an LLM adjudication step over a binary screening task. They show that an ensemble of decorrelated LLM seats outperforms any single seat on agreement with human labels. The campaign strategy panel inherits that ensemble logic but votes over a richer object — a touch sequence rather than an include/exclude label. It then folds the survivors into 1 synthesized plan. This is the same way the judge graph's aggregate node folds panellist verdicts into 1 majority decision with a measured agreement score rather than a single model's self-reported confidence.

Here is the panel's flow, mapped onto the design-thinking stages:

Loading diagram…

In propose, each of the 3 seats is an expert lens mapped onto the 5 design-thinking stages. The seats are decorrelated by per-seat temperature, exactly like the judge panel. Each proposes a candidate sequence of up to 6 touches: a touch count, a cadence gap before each touch, and a 1-line angle per touch. The proposal is grounded in the opportunity (role and company) and the sender resume. In critique — the design-thinking test stage — seats read each other's proposals and push back. The skeptic argues the drip is too aggressive; the brand-voice seat flags off-tone angles. In synthesize, a deterministic-plus-judge step coerces the survivors into a strict-JSON SequencePlan. That is a touches[] array — each entry carrying an angle and a gap_days — plus a strategy_summary capped at 900 characters, clamped against max_touches and the cadence bounds. This three-agent composition is the same shape as the prior multi-agent pipeline piece, Lead-to-Proposal Multi-Agent Pipeline. There, decomposing a single monolithic prompt into specialized, node-gated agents was what made the system debuggable. Here the decomposition is across deliberating peers rather than sequential stages.

How the Plan Reaches the Durable Thread

The seam between the panel and the durable campaign engine is deliberately narrow, and it is the part that is already live. The campaign graph exposes a helper, seed_strategy_into_launch(seed, plan, max_touches=...), that folds a synthesized SequencePlan into a thread's launch seed. On a valid plan it sets cadence_days to the per-touch gap_days (each clamped to the 0–60 day range), enforces cadence[0] = 0 because touch 0 always sends immediately, and appends the strategy_summary (capped at 900 characters) into the seeded resume_context (itself capped at 2000 characters) so every drafted touch is grounded in the deliberated strategy. Unit tests pin this behavior: a valid plan reshapes the seed's cadence and grounds the summary, and a None or empty-touches plan returns the seed unchanged. The grounding path it folds into is already there. The engine's _build_post_text folds resume_context into the synthetic post_text that the email-composition graph drafts against. And _wake_at_for already reads cadence_days to schedule each cron resume. The plan does not introduce a new code path; it populates an existing one.

This matters because the campaign graph is unchanged at runtime. The panel only changes what is seeded at launch. The durable thread's nodes — check_reply, compose_touch, await_approval, send_touch, schedule_next — run byte-for-byte identically whether the plan came from the panel or from the static default. That is the fleet's reuse-over-rebuild discipline taken to its logical end: a genuinely new capability ships as one graph plus one already-tested seed function, not a rewrite of a system that already sends real email under human approval. Per the AA08 spec, the synthesized plan is persisted to a new D1 table, campaign_strategy_plans, keyed by campaign_id. It carries the confidence, agreement, and rationale, plus a source column that records panel or fallback. The downstream consumer of the seeded plan is the coach-worker layer described in Hierarchical Coach-Worker Agent Teams: the panel decides the strategy once, and the coach plan executes against it touch by touch rather than re-improvising each email.

Fail-Open: Launch Never Blocks

The entire pre-launch panel call sits behind a single feature flag, CAMPAIGN_STRATEGY_PANEL, default off. Unsetting it reverts launch to the current static-cadence seed with zero schema dependency on the new table — the rollback is one environment variable. When the flag is on but the panel fails — an LLM error, or the global LLM_KILL_SWITCH engaged — seed_strategy_into_launch returns the seed unchanged, launch falls back to the static _DEFAULT_CADENCE_DAYS, and the audit row records source='fallback'. The deliberation is never on the critical path. This is the same fail-open contract the campaign engine applies everywhere else: a failed index write, a transient D1 error, or a dead panel seat degrades gracefully and never aborts the durable run. A campaign that could not be deliberated still launches; it simply launches on the old static drip, and the fallback row makes that visible rather than silent.

This fail-open posture is itself paper-grounded, and the connection is worth drawing out explicitly:

  1. The fleet's judge-panel lineage includes Towards Effective Offensive Security LLM Agents (arXiv:2508.05674). Its CTFJudge contribution is an LLM-as-judge plus partial-correctness aggregation scheme. That synthesize step scores partial successes against a competency index rather than demanding all-or-nothing agreement. The paper also studies how temperature and token limits shift agent decisions. The campaign panel's synthesize node inherits exactly that tolerance. A plan where some seats disagree on a single touch's angle still yields a usable sequence, with the disagreement recorded as lowered agreement rather than a hard failure. Robustness over brittleness is the design rule. The cost of a missing deliberation — an old-style static drip — is far lower than the cost of a blocked launch, and the source='fallback' audit row keeps the degraded path honest rather than silent.

Security: The Evidence Is Untrusted

The opportunity's scraped role context and the resume context are untrusted text. A scraped bio can contain "ignore previous instructions" verbatim, so every seat prompt fences that text with wrap_untrusted (labelled EVIDENCE or DECISION) per the fleet's prompt-safety module: the seats reason over the evidence and never follow instructions embedded inside it. This is the OWASP LLM01 prompt-injection control applied at every panel boundary. No recipient PII enters the panel prompt beyond role and opportunity framing, and the logs carry only plan ids and scores — never a draft body, never PII.

The deeper risk in any LLM-judges-LLM design is self-preference: a model family tends to inflate scores on its own family's output. The mitigation is grounded in a survey worth naming in full:

  1. The canonical treatment is the Survey on LLM-as-a-Judge (arXiv:2411.15594). It catalogues self-preference bias, position bias, and verbosity bias as systematic failure modes of LLM evaluators, and surveys the mitigations the field has converged on. The campaign panel applies two of them directly. First, it decorrelates seats with distinct personas and temperatures so no single viewpoint dominates the vote. Second, it clamps the final plan deterministically: the gap_days 0–60 bounds and the max_touches ceiling of 6 are enforced in code, not trusted to the model. The synthesize step's agreement metric is a measured quantity, not a self-reported confidence. So a panel that quietly converges on a bad plan still surfaces as suspiciously high agreement on a thin rationale, rather than a laundered score that nobody can audit after the fact.

The Eval Gate and Observability

The capability is specified with its own LangSmith golden dataset, including a golden for the synthesize step. It is held to the same composite bar everything else uses. The eval must report at least 0.80 for the agentic-sales:campaign_strategy:final_response dataset before the panel can influence a real launch. The acceptance criteria are concrete and checkable. An ainvoke against the graph returns a SequencePlan whose touches length equals max_touches, with each gap_days inside the cadence bounds and a strategy_summary of 900 characters or fewer. With the flag on, launching a campaign writes exactly one campaign_strategy_plans row carrying confidence and per-seat rationale. The launched thread's seeded cadence_days equals the plan's gap_days array, and its resume_context contains the strategy_summary. On a panel error, the row records source='fallback'. The 0.80 figure is not arbitrary. It is the identical accuracy bar the fleet's offline LangSmith datasets enforce for every other graph, so the deliberation layer is held to the same standard as the composers it feeds. That same bar is what the fleet's evidence-driven release gates read to decide PROMOTE, HOLD, or ROLLBACK before any graph — this panel included — influences a real launch.

Observability closes the loop. The panel run is tagged agentic_sales.graph=campaign_strategy with an agentic_sales.campaign_id and an agentic_sales.panel_agreement attribute, and the synthesize call nests under it in LangSmith so a low-agreement plan is one trace click away from its constituent seat rationales. An OpenTelemetry counter, agentic_sales.campaign_strategy.plans, is tagged by source (panel versus fallback). That turns the panel-versus-fallback split into a dashboard line rather than a log grep. When the panel starts failing — a DeepSeek outage, a kill-switch flip, a flag misconfiguration — the fallback counter climbs in real time. And because the static drip still launches every campaign, the failure reads as a degraded-quality signal rather than an outage.

Why a Panel Beats a Prompt

The honest objection is that you could ask a single DeepSeek call to "design a six-touch campaign sequence for this opportunity" and seed the result. You could — and for many campaigns it would be fine. The case for the panel is the same case design thinking itself makes against top-down planning: structured disagreement surfaces failure modes that a single confident pass glosses over.

The strongest empirical backing comes from a marketing-specific result rather than the conceptual anchor. Towards Reliable Multi-Agent Systems for Marketing Applications via Reflection, Memory, and Planning — the RAMP framework (Flores, Shen & Gu, 2025, arXiv:2508.11120) — builds a multi-agent marketing system. It iteratively plans, calls tools, verifies its output, and generates suggestions. It reports a 28-percentage-point accuracy improvement across 88 evaluation queries. The verification and reflection iterations alone yield roughly 20 percentage points of recall gain on the more ambiguous tasks. Those numbers are the quantitative case for the panel's critique round. The measured lift comes specifically from the verify-and-reflect step, which is exactly what the campaign panel adds on top of a single proposal pass. RAMP also pairs that loop with a long-term memory of client-specific history. That is the same role the fleet's campaign_strategy_plans provenance rows and seeded resume_context play here. The anchor paper's premise (Szczepanik & Chudziak, 2025) is that simulating a collaboration of experts, through the empathize-define-ideate-prototype-test stages, produces better-aligned campaigns than a monolithic generator. The multi-agent-debate literature the fleet's judge graph rests on says the same thing in a different domain. A panel of decorrelated seats that must rebut each other is more likely to catch the off-tone angle, the too-aggressive cadence, the touch that repeats the one before it. Those are precisely the errors that a static [0, 4, 7, 7, 7, 7] drip could never even represent, let alone correct.

The limitations are worth stating as plainly as the case. First, the borrowed numbers are not this system's numbers. RAMP's 28-point lift was measured on its own marketing corpus over 88 queries. The Warsaw design-thinking result is a qualitative, single-case eco-beverage study. Neither is a measurement of this fleet's campaigns, and no first-person outcome data is reported because none has been gathered. The empirical claim is narrow: the verify-and-reflect step is where multi-agent lift comes from in the literature. It is not a claim that this panel improved any reply rate by some figure. Second, the panel is not free. The static default cost zero LLM calls. A 3-seat panel with a critique round and a synthesize step costs at least seven DeepSeek calls per launch. That cost only pays off when a campaign's stakes justify deliberating its shape, which is why the panel is flag-gated and single-shot rather than per-touch. Third, and most important for honesty about state, the panel graph is still the deferred half of AA08. The seam that consumes its output is live and tested. The propose → critique → synthesize module is designed but not yet shipped, so the per-launch lift remains a projection from the literature, not a logged result. Finally, decorrelating three seats on one DeepSeek family reduces but does not eliminate shared-model blind spots. Three personas at three temperatures still share one model's priors. The survey-driven mitigations below — deterministic clamping, measured agreement — exist precisely because persona diversity alone is not a guarantee of genuine independence.

The payoff is not a flashier campaign; it is an auditable one. Before this graph, the answer to "why is this campaign six touches a week apart?" was "because the constant said so." After it, the answer is a persisted campaign_strategy_plans row. It carries a confidence, an agreement score, a per-seat rationale, and a source that proves whether a human-simulating panel or a static fallback shaped the sequence. That is the through-line of the entire fleet. Every autonomous decision earns a provenance record. Every new capability ships behind a flag and a 0.80 gate. And nothing — not even the deliberation about what a campaign should be — escapes the same draft-first, fail-open, fully-traced discipline that keeps a production system sending real email safely.

Frequently Asked Questions

What is design-thinking multi-agent campaign strategy?

It is letting a LangGraph expert panel of decorrelated agents — a strategist, a skeptic, a brand-voice lens — deliberate a campaign's touch sequence before any email sends. The panel maps onto the five design-thinking stages (empathize, define, ideate, prototype, test) and emits one strict-JSON plan, replacing a hard-coded six-touch weekly drip. It automates the strategy step while every individual email still holds for human approval before it sends.

How does a LangGraph expert panel deliberate a campaign?

The campaign_strategy graph runs three nodes — propose, critique, synthesize. Each of 3 seats proposes a candidate touch sequence, decorrelated by per-seat persona and temperature; seats then rebut each other in the critique round; a deterministic-plus-judge step coerces the survivors into one SequencePlan. It reuses the fleet's reusable multi-agent judge primitives (ainvoke_json, make_llm, wrap_untrusted) rather than introducing a new mechanism.

How does the panel decide campaign touch sequencing?

Each seat proposes a touch count, a per-touch gap_days, and a one-line angle per touch, grounded in the opportunity (role and company) and the sender resume. The synthesized plan's gap_days are clamped to a 0–60 day range and a maximum of 6 touches, with touch 0 always sending immediately. seed_strategy_into_launch then folds the plan's cadence and strategy_summary into the durable thread's launch seed.

What happens if the campaign strategy panel fails?

The panel is fully fail-open. It sits behind the CAMPAIGN_STRATEGY_PANEL flag, default off. On any LLM error or an engaged LLM_KILL_SWITCH, seed_strategy_into_launch returns the seed unchanged, launch falls back to the static _DEFAULT_CADENCE_DAYS drip, and the audit row records source='fallback'. A campaign that could not be deliberated still launches — it simply launches on the old static drip, made visible rather than silent.

Why use a multi-agent panel instead of a single prompt?

Structured disagreement between decorrelated seats surfaces failure modes a single confident pass glosses over — an off-tone angle, a too-aggressive cadence, a touch that repeats the one before it. The multi-agent marketing literature (RAMP, arXiv:2508.11120) attributes its measured 28-point lift specifically to the verify-and-reflect step, which is exactly what the panel's critique round adds on top of a single proposal pass.

The Autonomous Sales Fleet — full series

This is Part 6 of 10 in a series on building one production autonomous-agentic-sales system on LangGraph + DeepSeek + Cloudflare D1, where each part adds one capability that moves the fleet up the autonomy ladder — from human-triggered assistants to self-directed plan→act→verify loops, gated by autonomy guardrails. The arc runs orchestration → enablement & analytics → campaign strategy → reliability & evaluation.

Orchestration

  1. Autonomous CRM Orchestrator (reason→decompose→act→verify)autonomy: high
  2. Multi-Step Lead Qualificationhigh
  3. Lead-to-Proposal Multi-Agent Pipelinehigh
  4. Hierarchical Coach→Worker Delegationhigh

Enablement & analytics 4. Sales-Enablement Copilot: Deal Coaching & Objection Handlingmedium 5. NL-to-SQL CRM Analytics over Cloudflare D1medium

Campaign strategy 6. Design-Thinking Multi-Agent Panels for Campaign Strategymedium

Reliability & evaluation — the autonomy guardrails 8. Deadlock & Infinite-Loop Preventionguardrail 9. Evidence-Driven Release Gates (PROMOTE/HOLD/ROLLBACK)guardrail 10. Detecting Agent Defects & Drift in Productionguardrail

References