LLM Trading System Architecture: From Research Paper to Production
Research papers showed that LLMs can trade. Production showed that doing it without an architecture is how you blow up. Here is the engineering pattern that closes the gap.
The gap between paper and production
In late 2024 the TradingAgents paper (Yu et al., arXiv) demonstrated that multi-agent LLM systems could trade competitively on simulated environments. Within months, half a dozen open-source projects appeared on GitHub: TauricResearch/TradingAgents, virattt/ai-hedge-fund, and several Medium-grade implementations.
The paper-to-production gap is significant. A research-grade system runs on Jupyter, trades on Yahoo Finance data, and is evaluated by backtest. A production-grade system runs as a Windows service, trades on a real broker, and is evaluated by surviving Friday afternoons. The architectural differences are real.
This post is the production engineering. What the research papers leave out.
The architecture in one diagram
┌──────────────────────────────────────────────────────────┐
│ Broker side (MT5 EA / IB / Alpaca) │
│ ├ chart capture │
│ ├ order execution │
│ └ heartbeat to watchdog │
└────────────┬──────────────────────────────────────────────┘
│ HTTP (loopback)
▼
┌──────────────────────────────────────────────────────────┐
│ Brain side (single Python process under NSSM) │
│ ├ Ingress (FastAPI) │
│ ├ Message bus (Postgres LISTEN/NOTIFY) │
│ ├ 32 specialised agents │
│ │ • Regime / Strategist / Risk Gate / FactChecker │
│ │ • DoubleChecker / Macro Officer / CEO / ... │
│ ├ LLM Router (4 routes, circuit-broken) │
│ │ • Claude CLI (subscription) → primary │
│ │ • Codex CLI (subscription) → code tasks │
│ │ • Anthropic API → fallback │
│ │ • OpenAI API → fallback │
│ └ Context FS (markdown on disk) │
└──────────────────────────────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────────┐
│ Storage (Postgres / Supabase) │
│ ├ agent_messages (durable bus log) │
│ ├ agent_journal (decision history) │
│ ├ llm_cost_ledger (every call costed) │
│ └ app_state (stance, heartbeat, halt flags) │
└──────────────────────────────────────────────────────────┘
┌────────────────────┐
│ Watchdog (separate │
│ process, outside │ ← can L4 HALT everything,
│ the agent graph) │ overrides the CEO
└────────────────────┘
Every component has a job. None overlaps. The watchdog deliberately runs outside the brain side — it cannot be silenced by any agent.
Pillar 1 — Multi-agent coordination
The first architectural decision is how the agents talk to each other. There are three viable patterns:
Pattern A — Direct function calls (don't use)
Agent A calls Agent B's function. Simple, fast, fragile. When Agent B crashes, Agent A errors out. No durable log of who said what to whom. No replay. Don't use this in production.
Pattern B — In-memory queue (acceptable)
A queue inside the same process. Better than direct calls — the queue decouples the agents — but still loses everything on a crash. Acceptable for development; not for production.
Pattern C — Durable message bus (use this)
A Postgres table (agent_messages) with LISTEN/NOTIFY for real-time wakeup. Every message ever sent is recorded, with columns for from_agent, to_agent, topic, payload, signatures, timestamps. You can SELECT * from a time window and replay exactly what happened.
This is the pattern every production multi-agent trading system uses. It is also what makes the system postmortemable — when something goes wrong, you can read the bus log and understand exactly what each agent thought.
Pillar 2 — Subscription-first LLM routing
This is the single most important cost-control mechanism. A 32-agent system making 5-15 LLM calls per cycle, running 24/7, will hit ~7,000 LLM calls per day. At Sonnet API rates, that's $50-80/day. The same workload on a Claude Max x20 subscription is essentially free (within fair use).
The router:
Priority 1: Claude CLI (subscription)
Priority 2: Codex CLI (subscription)
Priority 3: Anthropic API
Priority 4: OpenAI API
Each route has a circuit breaker. After N consecutive failures, the route is marked open and the next route in priority is tried. After a cooldown, a single trial request tests recovery.
A cost ledger records every call: route, model, input/output tokens, computed cost, latency, success. SELECT against the ledger to see exactly where the money went.
Read the routing pattern in detail →
Pillar 3 — Fail-closed safety
The single most important safety property of a production LLM trading system is that it fails closed, not open. A single LLM call returning a malformed response, a Postgres connection drop, a broker disconnect — none of these should result in an open position that goes unmanaged.
Three patterns enforce fail-closed:
Pattern 1 — Watchdog outside the agent graph
A separate process, with its own permissions, that does one thing: watch invariants. Heartbeat from the EA. Drawdown vs hard cap. Margin utilization. Open-position stop-loss attachment. If any invariant breaks, the watchdog issues an L4 HALT that closes positions, locks the EA, and pages the operator. No agent — including the CEO — can countermand it.
Pattern 2 — Three independent signatures per trade
Before an order leaves the system, three different agents must sign:
- Risk Gate — numerical state check.
- FactChecker — re-verify inputs at signing.
- DoubleChecker — blind second opinion.
Each can reject. Each rejection is journaled. Compounded independent vetoes catch the trades a single model would have rushed into.
Pattern 3 — Operator-acknowledged resume
After any emergency halt, only the operator can restart trading. Not the CEO agent. Not an auto-recovery timer. The operator reads the journal entry that triggered the halt, understands the cause, and manually resumes. This is expensive operator-time but it is what every real fund does, and it is what prevents the "unsupervised auto-recovery" failure mode that has killed more retail accounts than any single market move.
Auto-recovery after an emergency halt is the single most dangerous setting in any LLM trading system. The recovery loop will, eventually, restart trading into the same condition that triggered the halt. Production systems do not auto-recover. They wait for the operator.
Pillar 4 — Skills as markdown
Production multi-agent systems do not hardcode prompts into Python files. They store each agent's prompt as a structured markdown skill:
skills/iqntx-phase1-regime/SKILL.md
skills/iqntx-strategist-trend-pullback/SKILL.md
skills/iqntx-risk-gate-numerical/SKILL.md
...
Each skill has three sections: Persona (who is reasoning), Inputs (what arrives in the prompt), Output schema (what the response must look like). The SkillLoader watches the skills directory and reloads on file change.
This pattern matters for two reasons:
- Auditability. The prompts are in version control. The diff history shows exactly how the persona has evolved over time.
- Iteration. Operators can tune a skill without restarting the service. New skills can be added by creating a new directory.
The pattern is borrowed from Anthropic's Claude Skills convention. Production multi-agent trading systems generally use it because it is the cleanest way to manage 30+ specialised prompts.
Pillar 5 — Markdown context filesystem
The system's institutional memory lives as markdown on disk:
context/philosophy.md— the bedrock doctrine every agent reads at boot.context/journal/<date>/— per-day decision journal, organized by timestamp.context/<agent>/— per-agent working notes.
Why markdown and not a database? Because a human operator should be able to cat the journal and read it without translation. A trade that the system vetoed at 09:31:14 produces a file context/journal/2026-05-20/09-31-14-EURUSD-veto.md that explains in plain English what the Strategist proposed, why the Risk Gate vetoed, and what state the FactChecker would have re-verified.
The database (agent_journal table) is for SQL replay. The markdown is for human postmortems. Both are written; both are needed.
Pillar 6 — Self-optimization
The cheapest source of edge in a multi-agent system is the journal. Every night, a Self-Optimizer agent reads the day's context/journal/, scores what worked, retires what didn't, and queues experiments for tomorrow.
- A strategy that vetoes itself 80% of the time may be configured too aggressively for the current regime — reduce its sensitivity.
- A strategy whose vetoed setups would have been winners more often than not is being too conservative — relax the threshold.
- A regime classification that the Macro Officer disagreed with three times this week — flag for review.
The Self-Optimizer doesn't auto-tune live parameters; it produces a markdown report the operator reviews. The operator decides what changes. The Self-Optimizer is institutional memory; it is not autonomy.
What this architecture is not
Two things this is NOT:
Not a guarantee of profit
Architecture is necessary but not sufficient. A well-architected system with a bad strategy bank still loses money. A poorly-architected system with a great strategy bank still blows up. The architecture is what gives the strategy room to compound; it does not create the strategy.
Not a black box
Every layer is auditable. The skills directory is markdown. The journal is markdown. The bus log is Postgres rows. The cost ledger is Postgres rows. An operator can — and should — sit down with the system and read why every trade fired (or didn't) for the last week.
Black-box LLM trading is an oxymoron. The whole point of using LLMs is the explainable reasoning chain. A system that hides its reasoning is a system that doesn't trust its own architecture.
How to build one (the honest version)
If you want to build this:
- Start with the message bus and the journal. Get those right first. Everything else plugs into them.
- Add one agent at a time. Strategist first, then Risk Gate, then FactChecker, then DoubleChecker. Each addition exposes integration bugs in the bus.
- Add the router last. Until you have multiple agents working, the router is just a glorified API client.
- Add the watchdog before going live. Not optional. The fail-closed property is what keeps you from blowing up on the first edge case.
- Spend weeks reading your own journal. The first month after going live, more time should be spent on postmortems than on building new features.
The architecture is portable. iQntX is one implementation; there are others on GitHub (TauricResearch/TradingAgents being the most prominent open-source example). The shape is convergent — the same pillars show up in every production system that has lasted longer than a quarter.
Keep reading
- What Is Multi-Agent Trading? — the architectural foundation.
- Claude Trading Bot — the LLM-specific implementation.
- Building Multi-Agent Trading Systems with Claude — the LLM-routing layer.
- How a 32-Agent AI Hedge Fund Beats a Single-Model Bot — the full production system.
See it shipping
iQntX is one production implementation of this architecture. Join the waitlist for early access.
Writes about multi-agent AI trading architecture, hedge-fund operations, and risk discipline for retail and prop-firm traders.
Questions readers ask about this
If you find a question we should add, send it to hello@iqntx.com.
What is an LLM trading system?
An LLM trading system uses one or more large language models as the reasoning substrate for trading decisions. Instead of fixed rules or black-box ML predictions, decisions emerge from a structured prompt-and-response process where the model reasons over inputs and produces explainable outputs. The 'architecture' is everything around the model — broker integration, agent coordination, risk gating, journaling — that makes the system tradeable.
Why use LLMs for trading at all?
Three reasons. First, reasoning over context: regime classification, news interpretation, and stance setting benefit from the kind of structured reasoning LLMs do well. Second, explainability: every decision comes with a natural-language reasoning chain you can audit. Third, generality: a single model architecture can handle multiple instruments and multiple regimes without per-strategy ML training.
How is an LLM trading system different from an ML trading bot?
ML bots predict numbers (probability of price up, expected return, etc.). LLM systems produce structured reasoning ('this looks like a Strong Trending Bullish regime; the proposed BUY is consistent with the active stance; the Risk Gate vetoes because correlation with existing positions is elevated'). The ML bot's decision is opaque; the LLM system's decision is auditable in plain English. For postmortems and regulatory compliance, the difference is enormous.
Can LLMs really trade fast enough?
Not for HFT. A typical LLM call takes 1-5 seconds; an HFT decision must fit in microseconds. LLM trading is for slower cycles: swing trading, day trading, position trading, prop firm work. For those timeframes, the LLM latency is acceptable — adding 5-10 seconds to a 30-minute setup doesn't change the outcome.
What's the biggest architectural mistake LLM trading systems make?
Treating the LLM as the sole decision-maker. A single LLM call with a good prompt looks like it works for the first 2-4 weeks. It doesn't work over months because LLMs hallucinate, overstate confidence, and miss regime transitions. The fix is multi-agent: multiple LLM calls in different roles with veto authority. The mistake is shipping a single-call system to a real account.
Are LLM trading systems expensive to run?
Depends on the architecture. Pure-API multi-agent systems cost $1,500-2,400/month for an active trader because every agent call is metered per-token. Subscription-first routing (using the Claude Max x20 or Codex CLI subscriptions where available) brings the cost down 10-50x for the same workload. The architecture choice is what makes LLM trading retail-affordable.
What's the simplest LLM trading system that actually works?
Three components: (1) a Regime Classifier LLM call that picks the active stance, (2) a Strategist LLM call that proposes setups, (3) a Risk Gate LLM call that vetoes setups that don't fit the stance. Add a journal and a watchdog. That's the minimum viable LLM trading system. Everything else — fact-checker, double-checker, macro officer, self-optimizer — is a refinement of that core.
Keep reading
RelatedHow a 32-Agent AI Hedge Fund Beats a Single-Model Bot
What Is Multi-Agent Trading? (And Why It Beats Single-Model Bots)
What Is an AI Hedge Fund? (And How It's Different From a Trading Bot)
Ready to put this on autopilot?
The waitlist is your fastest path to a private cohort. We open in waves so the system never gets in front of itself.