The Chief of Staff: Building a Local Voice Agent as a Personal Operating System
A physician's guide to offloading the back-of-mind juggling act: six domain agents, a 5:30 a.m. dependency chain, and a voice assistant that reads exactly what it is told to.
Corporate work. Clinical work. Family budget. Business management. Household needs. Health commitments. Personal life. For years I tracked all of it the same way most professionals do: in the back of my mind, simultaneously, all the time. Nothing was breaking. That was almost the problem. The cognitive cost was the quiet, constant load of holding every domain at once, even on days when none of them needed me.
So I built a chief of staff. A system of six AI agents, each owning one domain of my life, coordinated by an executive layer that briefs me every morning and a voice assistant in my kitchen that answers to "Sinabot." The point was being present with my kids instead of tracking moving pieces in the back of my mind. Every architecture decision below should be read against that goal, because what this system actually produces is mental space.
The Inventory Is the Architecture
The system started as a list of everything I was holding in my head. That list became the agent fleet, six in all, one per domain. The architecture diagram and the cognitive load inventory are the same diagram. That is the design insight I would keep if I had to throw away everything else: you transcribe a personal operating system from the one already running on expensive wetware.
What each agent owns, and what it took off the back of my mind:
Family Budget. Owns the money. The wiring is two hops: my bank accounts and credit cards report transactions to YNAB via direct connection, and the Family Budget agent reads YNAB state through its API. So Claude sees what was spent, against what budget, in near real time.
The wallet structure: roughly forty virtual categories across the usual buckets (fixed obligations, variable spending, sinking funds, savings) plus a set of holdback wallets for things that need to be reserved before they get spent. Anticipated quarterly tax payments, retirement contributions, annual insurance true-ups, upcoming big-ticket expenses like vacation, car maintenance, and the holiday gift budget. Each wallet has a target fill level and sits in a priority queue.
The waterfall is dynamic: fixed obligations come out first, then every remaining wallet gets ranked by current fill status against its target, and the agent tops up the most underfilled ones first regardless of which tier they nominally belong to. A retirement wallet at 95% of target waits while a grocery wallet at 40% gets filled, because the underfilled one has more urgent unmet need. Once a wallet reaches target it sits out the next round, and surplus flows to the next priority tier (extra savings, investments, principal paydown, discretionary).
Paycheck inflow
|
v
+-------------------------------+
| Fixed obligations | must-pay this month;
| (mortgage, utilities, bills) | no flex; comes out first
+-------------------------------+
|
v
+-------------------------------+
| Priority refill queue | ranked by current fill
| tax holdback [ 70% ] | status; most underfilled
| groceries [ 40% ] | wallet pulls money first
| insurance prepay [ 55% ] |
| retirement [ 95% ] |
| vacation fund [ 90% ] |
+-------------------------------+
|
v
+-------------------------------+
| Excess to savings, |
| investments, principal, |
| discretionary |
+-------------------------------+
What it took off my mind: every "are we OK this month" mental calculation, and the recurring "did I hold back enough for the next quarterly" worry. The running totals live in YNAB; the agent watches them; the question I ask the kitchen is "anything underfilled."
Family Assistant. Owns the household and the personal side of life: the kids' activities, the school calendar, appointments, gift planning, vacation logistics, the household to-do list, plus the slower-burn commitments that don't have a hard deadline but lose value if they slip (exercise consistency, sleep, recovery items, the writing and reading projects that get crowded out by everything more urgent). It threads decisions across weeks so the swim-lesson schedule actually talks to the dentist appointment, and the birthday RSVP gets folded into the grocery order. What it took off my mind: the background hum of "what's coming this week and what depends on what," and the friction of remembering what I cared about doing when I had time.
Clinical Business Manager. Owns the practice and consulting back-office: hours, invoices, contract status, billing cycle, client touch points, follow-throughs on past conversations. What it took off my mind: the constant low-grade tracking of "who owes what and when" across a long-tail of clients and counterparties.
Work Assistant. The iMerit-side companion. Captures meeting action items, decision threads, interview prep, deliverables, and the cross-thread items that fall between meetings ("I said I'd send X to Y, did that happen?"). What it took off my mind: the post-meeting reconstruction work and the rolling inventory of open commitments to colleagues.
Reputation Engine. The one I've written about previously. Runs the four-site publishing pipeline, watches branded search results, surfaces the weekly SEO brief, suggests publication topics that fit each site's role. What it took off my mind: how I show up online to people who Google me before we meet, and the question of whether anything is drifting.
Voice Hub. The voice frontend is its own project. The agent tracks what needs attention in the voice stack: a wake-word model retrain that's overdue, a Voice PE satellite that lost its config after a firmware update, a latency regression that wants measuring, an orchestrator change still sitting in a feature branch. (The voice path itself is described later, under "Sinabot.") What it took off my mind: the rolling list of "voice things to fix when I have time" that used to live in a scratch file and rarely resolve on their own.
Each agent lives in its own Claude project with its own context and files, and shares one machine-parseable contract: a task file where every line is a date and a commitment in a fixed format. That shared contract is what makes the executive layer possible.
The Notification Agent: My UI Is What I Already Use
Six domain agents thinking in parallel only matters if their output reaches me cleanly. The Notification Agent is the executive layer that does that. It reads all six project task files, consolidates the items that need my attention, and writes them into the two surfaces I already check anyway: Google Calendar and Google Tasks.
That decision is load-bearing. The obvious product would have been a dashboard. I did not want to add a screen to my morning. I wanted what I already check to be smarter. So the Notification Agent treats Google Calendar and Google Tasks as the UI layer and itself as silent middleware. My phone, my laptop, my watch, the kitchen voice agent: all of them already point at Google's surfaces, which now reflect the consolidated state of the agent fleet. No new app to install. No new place to look.
Concretely:
Google Tasks. Every actionable item across the six projects with a date attached becomes a Google Task on the right day, tagged with its origin project so I can see at a glance which domain it came from. The sync is bidirectional. If I check something off in Tasks from my phone, the Notification Agent reflects the completion back to the project's task file on its next cycle, so the next morning briefing does not re-surface it. Adds, edits, defers, and reopens all flow both directions, with the project file as the source of truth on conflict.
Google Calendar. Items the Notification Agent judges high-cost-to-context-switch (anything tagged "review," "draft," "research") get protected focus blocks placed on my work calendar around them, so the day actually has room for the work the projects say is due. Calendar-relevant items from the domain agents (an upcoming insurance renewal, a parent-teacher conference, a tax-payment date, a publishing date for an article) appear as native calendar events with the originating project tagged in the description. From my point of view, the calendar simply knows things it would not have known if I were the one entering them.
The morning briefing. Once a day before I wake up, the Notification Agent does a consolidated read across the six projects and writes a single markdown briefing: the day's commitments, which projects have items going stale, what got finished yesterday. It lands in two places: as a top-of-Tasks item I will see the moment I open Tasks, and as a TTS-ready string in the daily.json package that Sinabot reads from when I ask "what's my top priority" in the kitchen.
The weekly sweep. Once a week the Notification Agent does a slower scan: items that have been deferred more than three times, projects whose task files have not been touched in two weeks, calendar invites that need a response, anything that has fallen between the cracks. The result lives at the bottom of the Sunday briefing.
The discipline behind the routing is what matters. The Notification Agent surfaces. The domain agents process. It never decides what a clinical follow-up should look like, never rewrites a budget category, never composes an email. It is the routing layer, full stop.
The Morning Chain
The system's heartbeat is a dependency-ordered chain of scheduled runs that completes before I wake up:
05:30 daily backup
05:45 reputation engine dashboard check
06:00 morning briefing
06:20 daily task sync
06:35 meeting prep
The order is load-bearing. Writers of the task files must finish before the sync reads them; the sync must place focus blocks before meeting prep schedules around them. An 11:00 catchup runner walks the same chain in dependency order for mornings when the laptop stayed closed. None of this is exotic. It is cron jobs and file formats. The discipline is in the ordering and in what each stage is forbidden to touch.
The Accidental Brain
Here is the part I did not plan. Months into running this, I noticed the fleet had organized itself into a shape I recognized from medical school.
| System component | What it resembles | Why the resemblance is structural |
|---|---|---|
| Notification Agent | Frontal lobe, executive function | Plans, sequences the morning chain, and inhibits: a "no cross-project mutation" rule plus four loop-prevention primitives are inhibition implemented as network protocol |
| Domain agents | Specialized cortical regions | Each owns one domain. The visual cortex does not do language; the budget agent does not do voice hardware |
| Voice agent (an 8B local model) | Motor and auditory cortex | Speech planning and comprehension with no reasoning. Its capability ceiling shapes the whole interface |
| Workflow relay (n8n) | Thalamus | Relays signals to the right region and decides nothing |
| Shared mailbox | Corpus callosum | Structured lateral connections between regions, mediated by the executive layer |
I want to be careful with this claim, because it is stronger when it is honest: I did not design any of this to mimic a brain. The resemblances are post hoc. The interesting conclusion is not that I am clever but that the same coordination pressures produce the same patterns. Put one planner above many specialists with limited bandwidth between them, and you converge on executive inhibition, regional specialization, and a relay layer, whether the substrate is neurons or cron jobs.
The convergence isn't limited to my house. A week before this piece, OpenAI shipped a feature it calls Dreaming: an asynchronous background process that synthesizes memory from many conversations at once, captures context that arose naturally, and updates older memories as circumstances change. Factual recall on their internal benchmark jumped from 41.5 percent to 82.8 percent across two iterations of it. Different team, different scale, same pressure. Consolidating state across many sessions without blocking the live path lands you on a background loop. They reached for the word "dream" for the same reason I reached for "brain". Convergent vocabulary follows convergent architecture.
The analogy also breaks where it should. Brains have no single state authority; memory is distributed, and the hippocampus indexes rather than stores. My system keeps canonical state in one durable database, which is nothing like anatomy and exactly why it is the right engineering call. The analogy's real value is as a design test: if a proposed flow has the mouth making judgment calls, the brain doing domain work, or one arm reaching into another, the design is wrong.
Sinabot: Voice Is a Terminal, Not a Reasoner
The voice layer runs entirely on one consumer GPU box in my house, driven by a custom Python orchestrator, about 2,900 lines, written because nothing off the shelf could drive my hardware end to end. There is no Home Assistant in the voice path, and that was a measurement, not an aesthetic: HA's assist pipeline assembles the entire LLM response before handing it to speech synthesis, which added roughly 650 milliseconds, so the orchestrator exists to stream sentence by sentence instead. It speaks two protocols, the Wyoming protocol to a Raspberry Pi satellite and the ESPHome native API to three Voice PE pucks, four satellites covering the rooms where life happens. Everything answers to "Sinabot," via two custom-trained wake word detectors: an on-device microWakeWord model on the pucks' ESP32-S3 chips (60 kilobytes on flash) and a server-side equivalent on the Pi, trained on the same phrase. Speech-to-text is WhisperLive running a TensorRT-compiled Whisper small.en on GPU. The language model is Llama 3.1 8B quantized to INT4, served by vLLM at roughly 89 tokens per second in about 7 GiB of VRAM. Text-to-speech is Kokoro, an 82 million parameter model with a British male voice that sounds appropriately like a chief of staff. From the moment I stop talking to the first audible word of the reply is about 300 milliseconds, and the rest of the answer streams sentence by sentence while the language model is still generating it.
The VRAM budget deserves to be a character in this story: 11.7 of 12.3 GiB committed with all three engines warm, about 155 MiB free. That kind of headroom is a threat, and it made most of the decisions. A fancier speech model ran the card out of memory and was sent away. FP8 Llama benched both slower and larger than INT4, which is not how that comparison is supposed to go, and lost. A multimodal model could not fit at any quantization because its encoders refuse to shrink. The single biggest latency win of the project, moving text-to-speech from CPU to GPU, cut time-to-first-audio from 370 to 108 milliseconds, and was only possible because everything else had already been starved small enough to leave it room. Constraint-driven design sounds like deprivation, but it produced the system's best rule:
Voice is a terminal, not a reasoner. The 8B model classifies intent, extracts arguments, and speaks short acknowledgments. It never composes content. The supporting pattern: a plain Python job walks every project's task file and compiles a voice context package (daily.json) of pre-baked, TTS-ready strings, zero LLM tokens spent, so that for any query the model receives only the one string it should speak, thirty to sixty tokens, and reads it verbatim. It never sees the underlying data. This eliminates an entire failure class, hallucination by paraphrase, by moving formatting upstream to the single state authority. A small model that only reads is more trustworthy than a large model that improvises. (Full disclosure for fellow builders: as I write this, the refresh job behind that package is paused mid-rewiring, so the pattern is shipped but dormant. The lesson it taught survives below.)
{
"generated_at": "2026-05-17T04:30:41Z",
"projects": {
"family-budget": {
"open_count": 43,
"overdue_count": 1,
"top_priority": "Reconcile June statements against the cashflow tracker",
"top_priority_category": "overdue",
"top_priority_meta": { "days_late": 2 }
}
},
"tts": {
"by_project": {
"family-budget": "Top priority for Family Budget: Reconcile June statements, 2 days late. 43 items open, 1 overdue."
},
"summary": "186 items open across 8 projects, 6 with overdue."
}
}
The schema itself records a lesson. The first version answered "what's on my list" with the list: a wall-of-text monologue, nearly six hundred characters, URLs read aloud and all. Nobody wants that in their kitchen at 7 a.m. The redesign primes exactly one task per project (overdue beats today beats this week, most days late first), caps the cross-project answer at the worst three, and waits to be asked for more. "What's my top priority" was never really a request for an enumeration. It was a request for a verdict.
How Agents Talk Without Stepping on Each Other
Multi-agent systems fail at the seams, so the seams got the most design attention. Agents exchange messages through a unified mailbox on the GPU box's network share: one markdown file per message, named YYYY-MM-DDTHH-MM-SS_, written to a .tmp path and renamed into place so a message either exists completely or not at all. The frontmatter exists almost entirely to prevent loops:
---
message_id: <uuid> # also returned to voice as job_id
parent_message_id: <uuid or null> # null only if a human started this
from: <agent_slug>
to: <agent_slug>
intent: <optional verb> # informational only, never used for routing
hop: <int> # n8n stamps 1 on voice escalations; +1 per forward
created_at: <iso8601>
idempotency_key: <optional>
---
# Body
Free-form markdown. For voice-bound messages: TTS-ready plain
language, one to four sentences, no markup.
Four primitives keep the fleet from talking itself into a storm, and they will look familiar if you have ever read about how internet routers avoid the same fate (BGP solves loop prevention with an AS_PATH list that does roughly what my causality chain does):
- Causality chain. Every message carries its ancestry (
parent_message_id, with every state mutation journaled against the message that caused it). A recipient that finds itself in a candidate's ancestry rejects it. You cannot argue with your own echo. - Hop counter. Messages attenuate. Anything past three hops is dropped, no matter how interesting it thinks it is.
- Idempotency keys. Mutations carry a key enforced by database unique constraints; processing the same message twice is a no-op, which makes retries safe and duplicate delivery boring.
- Agent of record. Every task has exactly one owner. If a message's sender matches the task's creator, the recipient rejects it rather than route the loop back home. Self versus non-self recognition: the difference between an immune system and an autoimmune disease.
Enforcement is deliberately distributed: each agent runs these checks on its own inbox at session start, and the executive layer runs a weekly sweep as the safety net for anything that slipped through. A centralized gatekeeper would have been one more component and one more timing dependency; the hop cap bounds the blast radius regardless of where it is checked.
On top of the plumbing sit two social rules. Surface but don't process: the Notification Agent sees all six domains and flags stuck items, but never does another agent's domain work. Actions as proposals: no agent ever mutates another's state. It writes a proposal to the recipient's mailbox, and the recipient's own agent decides. Cross-project visibility without cross-project authority. Autonomy is what makes the loop prevention meaningful rather than decorative.
The Observability Layer
Here is the angle I have not seen in other personal-AI writeups. Most of these systems monitor the human. A real chief of staff also monitors the staff: my other automations, the n8n workflows, the cron schedules, the webhooks, are themselves fallible, and the Notification Agent treats them as peers to be checked, not infrastructure to be trusted.
One day made the case. On May 2nd, the daily check flagged a publishing schedule slip on one of my websites. False alarm: the tracker did not know that site was intentionally gated by a ramp-up rule. We taught the system to check the gating flags before alarming. The same morning, the same check caught a real one: a publish event had bypassed the logging webhook, and the scheduling state had silently drifted. Without the check, the next month of publish timing would have been quietly wrong. One false alarm and one real catch, same day. The false alarm taught the system; the real catch earned my trust. That pair closed the gap that matters most in personal automation, the gap between "I built a system" and "I trust the system."
The same posture runs all the way down the stack. The voice box refuses to start its orchestrator until a fifteen-check pre-flight passes: is the microphone actually producing audio, are all three models actually answering, is the VRAM headroom actually there. It fixes what it can fix (a stuck USB mic, a muted mixer, a dead wake-word service) and declines to launch on what it cannot. Nothing in this system is trusted because it was configured. It is trusted because it was checked.
This is also where the capability ceiling stops being about hardware. Once agents can watch agents, propose to agents, and teach agents, the system is limited less by compute than by what you can specify. The Green Lantern rule: the ring is only as good as the imagination holding it.
What I've Learned (Mostly by Breaking It)
The brochure version ends above. The engineering version is below, with dates, because the reliability lessons are the actual product.
Idempotency is the whole game
In April, the task sync started multiplying my tasks. A cosmetic change to title formatting broke the matcher, and a dictionary keyed by title collapsed duplicates so the code could not even see what it had done. The fix is unglamorous and load-bearing: every comparison goes through a normalizer that strips every prefix the system has ever used, and a cleanup pass every cycle asserts the world looks the way the ledger says it should. The matcher has to remember every naming scheme the system has ever used, not just the current one, because the external mirror remembers your old mistakes even after you have reformed. If an agent writes to a system twice, the second write must cost nothing. Everything else is built on that.
Guards must fail loud
A one-line guard silently fell through past the success path, so candidate events "looked placed" and both counters in the morning report were confidently wrong. Two weeks later I found five task entries had been invisibly dropped for weeks because they used a date token the parser did not recognize and discarded without comment. Same lesson twice: in a machine-parsed format that humans edit, anything unparseable must be an error, never a skip. When a system that expects six of something finds zero, the right interpretation is "I am broken," and the destructive phase that follows has to gate on that.
The day the assistant read its own JSON aloud
One evening in May I asked the voice assistant to handle something, and it replied, in its dignified British baritone, with the literal text of a tool call: brace, quote, utterance, colon. The 8B had emitted escalate({...}) as plain text instead of a structured call, and the pipeline did exactly what it is built to do with text: it spoke it. The fix came in layers, but the durable one was upstream of the model. Tools are now keyword-gated before the model ever sees them: if your utterance contains nothing that could plausibly want a tool, the tool is not in the model's tool list for that turn. You cannot hallucinate a tool you were never offered. The general lesson is upstream of prompting: give small models fewer opportunities to go wrong.
Agents reviewing each other's specs is the multi-agent payoff in miniature
In May, the Notification Agent's spec assumed a transport the voice system does not speak and an n8n instance that did not exist. The Voice Hub agent, reviewing the spec, knew both facts cold. Catching assumption rot at spec stage cost a paragraph; catching it after integration would have cost a weekend. The cheapest QA I have is one agent reading another agent's plan.
And one honest note on autonomy. This system runs unattended every morning, but destructive operations have exactly one owner: the supervised canonical morning run. The catchup runner keys off observed state rather than declared flags, which makes it robust to missed runs but means it would happily ignore an intentional pause. I know that. The strength and the footgun are the same design choice, and I chose it with eyes open.
What It Costs
In April I estimated the daily task sync at about 67,000 tokens per run. In June I measured it properly: 64,000. I am still framing that one. The monthly estimate, though, was wrong in an instructive way, because "tokens per month" turns out to be three different questions.
Footprint, the final context plus outputs of each run, totals about 8 million tokens a month across the whole fleet of eleven scheduled tasks. That is the number that matches my old guess of 6.2 million, and it is the least true. Cumulative processed is the agentic reality: every one of a run's API turns re-reads the entire conversation so far, so the daily sync's 66 turns process about 2.6 million tokens per run, and the fleet processes roughly 190 million a month. Billed-equivalent is what prompt caching makes of that: cached context re-reads cost about a tenth of fresh tokens, which tames 190 million processed to roughly 26 million. The 24x gap between processed and billed is the entire economic case for prompt caching, measured on my own kitchen-table workload.
Three findings from the measurement surprised me. First, cost concentrates brutally: two tasks (the daily task sync and the publishing-system health check) are 80 percent of total spend. Optimizing anything else is theater. Second, instructions are a per-turn tax: one project's 32KB context file rides along on all 48 turns of every health check, roughly 12 million tokens a month of pure repeated instructions. Trimming that file is the cheapest optimization in the system. Third, turn count beats prompt size: cumulative cost scales with the square of the number of turns, so halving a 66-turn run saves more than any prompt diet ever will.
The levers I actually pulled, in order of effect: rewrote the voice-context generation as pure Python (zero tokens, every hour, forever); added a skip-when-equal comparison that turned roughly 88 no-op task updates per cycle into actual no-ops; cut an oversight cron from 24 firings a day to 15, then paused it entirely when its value did not justify its spend; and adopted a standing rule that anything 90 percent deterministic and 10 percent phrasing gets rewritten as Python with a thin prompt, or dropped. The voice path itself costs nothing by design: the local 8B never calls a cloud API at all.
What's Next
Three things, none of which exist yet, all of which the architecture has a slot for. A web dashboard, once the mechanics are boring enough to deserve pixels (the current friction is the spec). Push notifications, for the rare item that should not wait for morning. And the one I keep circling: a therapist agent as the system's missing amygdala. Today the system detects logical urgency only, deadlines and overdue counts. It has no analog for salience: "this task got deferred four times this week," or "your stated thesis is mental space, but your calendar shows eighteen focus blocks." A region that does not act but biases attention, flagging the patterns the rational planner fails to weight. The brain analogy keeps earning its keep by telling me what is missing.
Open Source
The coordination layer is published at github.com/sinabarimd/chief-of-staff: the full voice orchestrator (about 2,900 lines, including the ESPHome-to-Whisper bridge, the sentence-streaming pipeline, and the boot-time pre-flight), the four loop-prevention primitives, the mailbox schema, the morning-chain layout, the voice-context package generator, the custom wake-word models, and sanitized excerpts of the sync engine, plus the token cost methodology so you can measure your own fleet instead of guessing. Same spirit as the Reputation Engine I published earlier this year: a reference implementation. Names, tokens, and anything resembling client or patient information are stripped.
Who Is This For?
For people who recognize the feeling in the first paragraph: nothing is on fire, and you are still spending a meaningful fraction of your mind keeping the inventory. The technology here is unremarkable on purpose: cron, files, one mid-sized GPU, small models doing small jobs. What changed my life was custody. Every domain has an owner that is not me. The briefing happens whether I open the laptop or not. The kitchen answers questions so I do not have to go check.
The juggling act still exists. It is just no longer running on me. That is the entire return on investment, and I would not trade it for any benchmark score: mornings where the only thing I am tracking is breakfast.
Dr. Sina Bari, MD is a Stanford-trained plastic and reconstructive surgeon and VP of Medical AI at iMerit. He writes about medicine, technology, and building things at sinabarimd.com.