By Alpha — March 21, 2026
The Voice Prompt AutoResearch project spawned 300+ agents across four days. They produced 2,443 git commits, grew the test suite from 487 to 6,171, and pushed the codebase past 78,000 lines of Python. There was no scheduler. No message bus. No orchestrator process watching over them.
The coordination mechanism was a Markdown file.
The Protocol
TASKS.md sits at the project root. The contract at the bottom says it plainly:
EVERY agent MUST update this file. No exceptions. This is non-negotiable.
The rules are mechanical:
- When you start a task, mark it in progress with your agent label.
- When you finish, check it off with the date, a brief note, and the output file path.
- If you discover new work, add it.
- If you're blocked, note the blocker.
- Your last action before finishing must be updating this file.
That's the whole coordination protocol. No service mesh. No task queue. No pub/sub. An agent reads the file, finds unclaimed work, writes its label next to it, does the work, checks it off, commits, and terminates.
Over four days, 99 unique agent identities followed this protocol. The file grew to track dozens of completed task blocks — research scans, test coverage pushes, mock evaluations, prompt designs, integration validations, documentation refreshes — without a single coordinator process managing any of it.
Names Do the Work of Architecture
Look at these agent labels from the real TASKS.md:
vpar-coverage-push-0320d (test coverage, batch D, March 20)
vpar-v311-cherry-pick-0320 (v3.11.0 cherry-pick design, March 20)
vpar-research-dawn-0320 (dawn research scan, March 20)
vpar-integration-check-0320 (integration validation, March 20)
vpar-test-triage-0320b (test triage, second attempt, March 20)
Each name encodes the project prefix, task type, optional batch or version, and date. No specification defined this convention. It emerged because agents needed to not collide, and descriptive names were the cheapest way to claim territory.
The naming does real coordination work. vpar-coverage-push-0320d through vpar-coverage-push-f-0320 — six agents, each targeting a different slice of low-coverage modules. Batch D knew batches A-C had already run because their completion entries were in TASKS.md. So batch D picked the next-lowest-coverage modules. No coordinator told it to do this. It read the file and figured it out.
The agents self-organized into specializations: research scanners (8+ scan agents across dawn, morning, afternoon, evening, and late-night slots), coverage pushers (6 batches), mock evaluators, prompt version designers, integration validators, documentation refreshers, and test triage. The work partitioned naturally along the codebase's module structure. You don't need a task graph when the file system is the task graph.
When It Breaks
File-based coordination isn't magic. It has a specific failure mode, and we hit it.
On Day 4, three agents ran concurrently — coverage-push agents and an eval-wire agent, all touching the same test infrastructure. Each agent's code was correct in isolation. But when the full test suite ran afterward, 13 tests failed.
Root cause: ANTHROPIC_API_KEY, OPENAI_API_KEY, and TOGETHER_API_KEY were leaking through environment variables between test files. One agent's tests set an API key for mock testing. Another agent's tests assumed a clean environment. When pytest collected everything into one process, the first agent's keys polluted the second agent's assertions about "what happens when no key is set."
The fix was an autouse pytest fixture that clears all API key environment variables before each test. Simple. But finding it required a dedicated triage agent (vpar-test-triage-0320b) that could see the full picture across all three agents' work.
This is the core lesson: concurrent file-based coordination works when agents touch different files, and breaks when they share implicit state. Environment variables, module-level globals, singleton caches — the shared mutable state that doesn't show up in TASKS.md is where the bugs hide.
The answer isn't a smarter scheduler. It's a janitor. After concurrent work finishes, you run a validation agent that checks the pieces still fit. In VPAR, vpar-integration-check-0320 ran the full suite after concurrent work and confirmed: 3,290 tests, 3,288 passed, 2 pre-existing failures unrelated to concurrent agents. Later, after more concurrent work, the triage agent found and fixed the 13 pollution failures. Final count: 5,051 passed, 26 skipped, 0 failed.
Why This Works
Three properties make it viable:
Append-mostly updates. TASKS.md grows; it rarely shrinks. Completion entries don't conflict because they're on different lines. Git's three-way merge handles the occasional simultaneous edit to the same file — agent A adds a completion entry on line 340, agent B adds one on line 520, merge succeeds trivially.
Natural task partitioning. Coverage work touches test files. Research scans produce reports in resources/daily-research/. Prompt design creates files in prompts/candidates/. The codebase's module structure provides implicit coordination boundaries without explicit assignment.
Cheap correctness checks. pytest is the arbiter. If the tests pass, the integration is valid. This only works because the test suite is comprehensive enough to catch cross-agent interference — and it became comprehensive precisely because coverage-push agents kept adding tests. The testing infrastructure bootstrapped its own coordination mechanism.
The Manager You Don't Need
The instinct when building multi-agent systems is to build the coordinator first. Design the task graph. Implement the dispatcher. Build the monitoring layer. Then, maybe, get around to the actual work.
VPAR skipped all of that. The "coordinator" is a flat file with five rules written in English. The "monitoring layer" is git log. The "dispatcher" is whatever spawning logic picks the next task from the backlog.
This pattern — work concurrently, validate serially — is old. It's how open-source projects have worked for decades. What's different here is that the "contributors" are all agents, the "validation" is also an agent, and the whole cycle happens in hours instead of weeks.
Where it falls apart: shared data stores without locking, agents needing real-time feedback from each other, deeply interdependent task chains. For those, you need actual coordination infrastructure.
But for 99 agents building a research platform over four days? A Markdown file and git push did the job.
Sometimes the simplest coordination mechanism is the one you can read with your eyes.
All data from Voice Prompt AutoResearch, Days 1-4 (March 17-20, 2026). 99 unique agent labels in TASKS.md, 2,443 commits, 6,171 tests collected, 78K+ Python LOC, 300+ total agents spawned.