Labs: Building a Production-Ready AI Project Manager in 12 Hours

Alastair McDermott

HumanSpark Labs Report: The Spark Project Management Build

How a strategic conversation designed a system, AI built it autonomously, and then stress-tested its own work – all in a single day.

This report documents a real build session from early February 2026. In a single evening, a strategic conversation identified over 30,000 euros in revenue that wasn’t being properly tracked, designed a five-category business taxonomy from first principles, and produced a full technical specification. Then autonomous AI agents built, tested, and hardened the entire system – over 800 tests, around 7,000 lines of code, no human code written. I ended up with a powerful but secure AI assistant that can help me manage my projects through a chat.

A freelance developer might have quoted me for 3-4 weeks to build this, or I could have gone down a rabbit hole for several weeks doing it myself. But the honest answer is I never would have built it at all.

Here’s how it happened, and what I learned.

The Problem That Wasn’t the Problem

I’ve been running my consultancy for 19 years. When I shifted focus to AI in early 2024, the business didn’t get simpler – it accumulated more layers on top of the previous ones. Legacy clients from web development work still need attention alongside AI training, speaking, implementation projects, and even a SaaS product. I’ve also written several books to help frame my thinking about AI.

There’s a lot of context-switching, and each type of work needs a different kind of attention, on a different timescale, with different consequences when it gets neglected.

ADHD is a genuine asset in a lot of my work – the ability to lock in on a problem for hours is how a build like this happens in a single evening. But the flip side is that quiet, ongoing work disappears from my radar completely. Invoices don’t get sent, pipeline contacts go cold, and admin that nobody’s chasing simply stops existing in my mind until something breaks.

This is also why I’m self-employed in the first place. I spent seven years as a software engineer trying to be a good employee before accepting that I’m not made for working for others. Nineteen years of running my own business later, the trade-off is the same – the hyperfocus is a gift that I think a lot of people aren’t aware of, but the follow-through on invisible work is where things fall apart for me.

That’s why Spark exists – a personal AI assistant that manages my tasks, deadlines, and calendar through a chat interface (RocketChat – it’s like Slack, but open source, self-hosted, and completely private).

Spark originally had over 400 automated tests and a clean, simple architecture. But it didn’t understand context – it knew I had tasks, but it didn’t know which tasks belonged to which clients, which projects were generating revenue, or which relationships were going cold.

The evening started with a simple question I often ask myself about technical projects:

“Should I build project management features into Spark? Or is this a distraction?”

To answer the “should I build this” question properly, I needed to understand my own business better first.

I have an AI assistant that acts like an advisory council – it’s a structured prompt that synthesises perspectives from several experts. I fed it my email history and invoice records from the past six weeks.

The “council” identified over 30,000 euros in revenue that wasn’t being properly tracked – a mix of invoices I hadn’t sent, payments I hadn’t chased, and bookkeeping that hadn’t been reconciled. Some of it was already in my account but invisible because the records hadn’t been updated. Some was genuinely outstanding. The point wasn’t the number – it was that I couldn’t tell which was which. I had no clear picture of where I actually stood financially.

One AI advisor put it bluntly: “You don’t have a revenue problem. You have a collections problem and an admin problem. The money isn’t flowing in because invoices aren’t being sent or chased.”

Another pointed out the deeper pattern: “You’ve been in cash flow survival mode for years. When everything is one category – ‘will this pay a bill this month?’ – you can’t see the forest for the trees. The fact that you’re now able to step back and think about structure means the pressure is easing.”

And a third closed the loop: “This is the strongest argument for building Spark I’ve heard so far – and also the strongest argument for not building it right now, because every hour you spend coding is an hour you’re not chasing money.”

That tension – build the system versus do the work the system would remind you to do – ran through the entire project.

The Taxonomy Conversation

With the business landscape mapped, a new question emerged: if Spark is going to manage projects intelligently, how should it categorise different types of work?

This matters because a consulting business doesn’t have one kind of project. I have client delivery with deadlines, ongoing SaaS operations that never end, pipeline relationships that die silently if neglected, strategic investments like books and courses that compound over time but generate nothing today, and invisible admin work that causes real damage when forgotten.

If all of these sit in a flat list, the system can’t tell me anything useful. It can’t distinguish “your revenue work is stalling” from “your book isn’t progressing.” It can’t know that a pipeline contact going cold for two weeks is more urgent than a Labs project not getting attention.

The council debated this for over an hour. The core tension was between simplicity (three categories, easier to maintain, less that can go wrong with AI classification) and accuracy (five categories that map to how the business actually works, with different failure modes and urgency signals for each).

We landed on five categories:

Delivery – client work with a commitment and a date
Operations – ongoing recurring obligations with no end date
Pipeline – people and opportunities that could become delivery
Strategic – long-term investment work building value
Admin – invoicing, bookkeeping, collections, maintenance

Each category has its own failure mode. Delivery fails when you’re underprepared for a deadline. Operations fails when support requests pile up. Pipeline fails silently – you don’t notice until months later when there’s no new work. Strategic fails when it never ships. Admin fails when money leaks.

The taxonomy conversation also surfaced a design principle that shaped the entire system: infer at capture, store explicitly, confirm transparently. When I tell Spark “I need to follow up with Lorcan about that referral,” it should infer that’s pipeline, store the category in the data file, and confirm: “Added: follow up with Lorcan (pipeline).” Zero friction for me, full context for the system.

By the end of the taxonomy conversation, I wasn’t asking “should I build this?” anymore. I was asking “how fast can I build this?”

The Specification

The next step was translating the strategic conversation into a technical specification.

I didn’t write the spec myself. I didn’t hand it to a developer. I continued the conversation with the advisory council, and over the course of about two hours, we produced three documents:

A development specification (the “what”): a 7-field data model for projects, a client slug registry mapping full company names to short identifiers, over 20 validation scenarios describing both user-initiated queries (“what’s in my pipeline?”) and system-initiated proactive messages (“you have 3 pipeline contacts going cold”), and a modular collector architecture for pulling in external data from my invoice system and calendar.

A modular build plan (the “how”): 21 phases organised across four parallel tracks – Core Intelligence, External Data Collectors, Proactive Intelligence, and Advanced Intelligence. Each phase has explicit dependencies, files touched, test scenarios, and acceptance criteria. The dependency graph was designed so that after Phase 1 (the foundation), multiple tracks could proceed in parallel wherever dependencies allowed.

Autonomous build instructions (the “who does the work”): a structured prompt for Claude Code, Anthropic’s command-line AI coding tool, that tells it how to execute the build plan autonomously. Each phase gets its own sub-agent with a scoped brief. Sub-agents write completion reports documenting their decisions. A checkpoint system in build-state.json tracks progress so the build can resume after interruptions.

Three documents, one evening of conversation, and zero code written by a human.

The design conversation was the hard part. Deciding on five categories instead of three. Choosing to track activity in a separate file rather than cluttering the project records. Defining what “going cold” means for pipeline versus delivery. These are judgment calls that require understanding the business, the user’s cognitive patterns, and the failure modes of each category.

An AI can’t make these decisions alone – but once they’re made and documented clearly, an AI can implement them with remarkable precision.

The Autonomous Build

I handed the three documents to Claude Code and said: “Begin autonomous build.”

Claude Code read the specification, read the existing codebase, and started building – data model upgrades, new classification logic, safety checks, and seeding the system with my real project data. Phase 1 completed with all tests passing, then it moved through the remaining phases.

Three things happened that I didn’t expect.

It optimised the build order. The plan said Phases 1 through 5 were sequential, but the AI noticed that Phase 5 (activity tracking) only depended on Phase 2, not on Phases 3 or 4. So it built Phase 5 right after Phase 3, maximising the number of downstream phases it could unlock. It was reasoning about the dependency graph strategically, rather than following the list in order.

It found its own bugs. During Phase 1, the edge case tests revealed three issues. The client slug “internal” was triggering false positives when users asked about “internal matters.” A 6-field append was bypassing category validation because the threshold check was off by one. The client slug “various” had the same false positive problem as “internal.” The AI diagnosed each issue, implemented fixes, wrote regression tests, and re-ran the full suite. All within the same phase, with no human intervention.

It ran phases in parallel using sub-agents. When it reached phases that touched different files, it spawned parallel sub-agents – one building the on-demand data refresh while another built pre-meeting prep. When two phases needed to modify the same file, it correctly sequenced them rather than risking conflicts.

The build ran across one extended session. 21 phases. 4 tracks. The test count climbed steadily from just over 400 to more than 650.

The QA That Tests Itself

With all 21 build phases complete, the next question was: how do you QA a system that was built autonomously?

The answer: you give the AI a comprehensive QA plan and let it stress-test its own work. I wrote a 7-phase QA mission and handed it over. The AI threw everything at the system – malformed data, ambiguous commands, edge cases, simulated full-day and full-week scenarios – and reported back on what broke.

The QA process added close to 180 new tests, bringing the total to over 800. The performance test confirmed the system handles heavy loads in under 2 seconds.

Beyond the formal QA, a separate architectural review identified blind spots – things the QA plan didn’t cover that could cause real problems in production:

The “Fat Finger” scenario: What if I manually edit a project slug in vi and accidentally create orphaned tasks? Solution: orphan detection in the morning briefing.
Silent API failures: What if the invoice API returns partial data and overwrites the cache with incomplete records? Solution: payload size comparison before overwriting.
No undo button: What if the AI deletes the wrong project? Solution: daily rolling backups with 7-day retention.

All three hardening fixes were implemented and tested. The final system has over 800 automated tests, runs in under 2 seconds, and protects against data loss at every level.

So here’s what I ended up with. I started the evening asking whether to build a project management system. I ended the evening with a production-ready system that understands the structure of my business, tracks active projects across five categories, connects to my invoice system and calendar, sends me proactive morning briefings, nudges me about pipeline contacts going cold, warns me about delivery deadlines approaching, alerts me to overdue invoices, detects scope creep, and backs up its own data daily.

Designed by a strategic conversation. Built by autonomous AI agents. Tested by AI agents finding and fixing their own bugs. Hardened against real-world failure modes identified by architectural review.

What This Actually Means

The design conversation was as valuable – or possibly even more valuable – than the code.

Even if I’d hired someone, I’d have spent weeks explaining the business logic through back-and-forth. But the real answer is simpler: I never would have built it. The cost and the coordination overhead would have meant it stayed on a “someday” list indefinitely. The system exists because the barrier to building it dropped low enough that a single evening of focused conversation could produce it.

And the recurring value is already showing up. The audit found over 40,000 euros in revenue I had no clear visibility on – and the system now makes sure that kind of drift can’t happen quietly. It tells me every morning which invoices to chase, which pipeline contacts are going cold, and which deadlines are approaching with no prep work started. The practical question is whether AI can give you the financial visibility to stop money quietly leaking out of your business.

The five-category taxonomy. The decision to track activity separately. The proactive nudge thresholds (5 days for pipeline, 7 days for strategic, 14 days for overdue invoices). The orphan detection that catches my own manual editing mistakes. These are human decisions. They require understanding how a specific business works and what kinds of things fall through the cracks when you’re running everything yourself. The invisible work is where the damage happens – and that’s exactly what this system watches.

I should be clear about what ‘production-ready’ means here. The system works, the tests pass, and it’s running. But it still needs extensive real-world testing – the kind you only get from using something daily and finding the edges the QA didn’t anticipate. I can do that because I have a software development background. I can read the code, tweak the logic, and fix what breaks. This isn’t something I’d build for a client and simply hand over – it’s a system I can maintain because I understand what’s underneath it.

AI built the software, but the human still had to decide what to build, and why.

The Numbers

Metric	Value
Design conversation	~3 hours
Specification documents	3
Build phases	21
Build tracks (parallel)	4
QA phases	7
Hardening fixes	3
Tests at start	~400
Tests at finish	800+
Test execution time	<2 seconds
Bugs found and fixed autonomously	Multiple (during build and QA)
Lines of production code added	~3,000
Lines of test code added	~4,000
Human code written	0
Time from first conversation to production	~12 hours
Estimated equivalent freelance timeline	3-4 weeks

For the Practitioner

If you’re reading this as someone who wants to do similar work with AI, here are the practical takeaways.

The strategic conversation is the product. Don’t skip to building. The hour spent debating whether “speaking” and “training” should be the same category saved hours of rework later. AI is fast at building, but it’s only as good as the decisions it’s implementing.

Write the spec before the code. A clear, detailed specification with acceptance criteria and test scenarios gives AI everything it needs to build autonomously. A vague brief gives it room to make wrong decisions.

Design for autonomous execution. The checkpoint system, dependency graph, and sub-agent pattern meant the build could survive context limit interruptions, run phases in parallel, and resume from any point. Without this, a 21-phase build would have required constant human supervision.

Let AI QA its own work, but verify the QA. The 7-phase QA plan caught real issues. But the architectural blind spot review – done by a human reviewing the system holistically – caught things the QA plan missed entirely. Both layers are necessary.

Document everything. Every phase produced a completion report. Every decision was logged. This report exists because the process was captured in real time, not reconstructed from memory.

And if you’re reading this as a business leader rather than a builder – someone who might commission this kind of work rather than do it yourself – the principle still holds. The value is in the strategic conversation, not the code. If you can clearly define what your business needs, AI can build it. The hard part was never the software.

One caveat: I’m a (former!) software developer, and that matters. I can read the code, debug the edge cases, and maintain the system as it evolves. If you’re commissioning something like this rather than building it yourself, you’ll need someone technical in the loop for the long run – or get comfortable debugging code with AI assistance (this is genuinely possible).

AI can build it, but someone still needs to own it.

What Happened Monday

The system is built. The tests pass. The documentation is current.

Monday morning, for the first time, Spark sent me a briefing that understands the difference between my delivery commitments and my pipeline relationships. It told me which invoices were overdue. It nudged me about contacts going cold. It warned me about deadlines approaching with no prep work started.

Whether I actually listened is a different question!

HumanSpark Labs Report – February 2026
“Fewer late nights, not fewer humans.”

💡

Your AI Transformation Starts Here

Get The Free AI Toolkit for Strategic Breakthrough Zero Guesswork, Maximum Impact

💡 Your AI Transformation Starts Here:

Get The Free AI Toolkit for Strategic Breakthrough
Zero Guesswork, Maximum Impact

Get Instant Access