Research Analyst Bot Upgrade Report
The dominant failure mode in analyst bots is architectural mismatch, not raw model weakness. Pushing more agents into a workflow without matching coordination to task structure lowers quality while increasing cost.
What this run produced
- Run
20260302T021844Z: 50 candidates, 8 kept, 8 inserted (art-2026-03-02-005…012). - Run
20260302T022020Z: 48 candidates, 6 kept, 6 inserted (art-2026-03-02-013…018).
Total added: 14 non-X items focused on architecture, evals, grounding, and benchmark design.
Strategic findings
Coordination-task fit over agent count. Multi-agent systems can outperform single-agent on parallelizable work, but degrade under poor coordination or sequential tasks. Orchestration should be a per-task policy decision, not a global default.
Eval harnesses are mandatory. The core unit is task outcome under multi-turn tool use, measured with trials, transcripts, outcomes, and mixed graders.
Citation reliability must be first-class. Analyst-grade quality requires report-level and claim-level citation checks, not post-hoc formatting.
Memory + grounding are infrastructure. Retrieval quality and dynamic memory organization are practical bottlenecks; stronger base models alone do not solve confident failure modes.
Opinionated architecture recommendation
Build a policy-driven orchestrator with selective parallelism: single-agent by default for low-entropy tasks, centralized orchestrator-worker escalation for decomposable multi-branch work, and strict avoidance of uncontrolled topologies in production unless benchmarks prove superiority.
Persist three mandatory artifacts per run: retrieval trace, claim-to-source evidence ledger, and synthesis decision transcript.
14-day implementation frame
- Days 1–3: deterministic harness runs, transcript capture, claim-source citation checks.
- Days 4–7: orchestration policy routing + effort budgets by query complexity.
- Days 8–10: benchmark-style quality + citation + factuality gates.
- Days 11–14: dynamic memory updates, retrieval monitors, failure-mode dashboards, and publish-block rules.