jonnonz.com

Product market fit isn't a stage, it's a gauntlet

Fri, 01 May 2026 00:00:00 GMT

Product market fit gets sold as a milestone. Find it and you're off to the races. That's the bit nobody who's been through it actually believes.

PMF is a gauntlet. It eats teams, it bends founders, and it quietly poisons the technical decisions you're proud of at the time. Most of the damage I've watched done to good companies wasn't from missing PMF. It was from how they behaved while they were looking for it.

It's not for everyone, and that's fine

There's a particular kind of person who does well in the pre-PMF phase. High tolerance for ambiguity, low need for closure. The deck never feels finished, the metric you're chasing changes every six weeks, and the answer to "what are we doing in three months" is "depends."

Plenty of really good operators just cannot function in that environment. That's not a character flaw — it's a stage mismatch. Some people thrive at zero to one. Some thrive at one to ten. Almost no one thrives at both, and the industry pretending otherwise has cost a lot of careers and a lot of sanity.

Naming this honestly so people can self-select is one of the kindest things a founder can do. You're not letting someone down by saying "this stage probably isn't for you, but the next one will be." You're saving them eighteen months of feeling broken.

The variables you don't control will eat you alive

Timing, market, economy, what your one big regulator decides on a Tuesday — none of that is yours. You can have a sharp thesis and a great team and ship something nobody buys, because something three layers above you shifted while you were heads-down.

The only protection I've found against this is a vision the team is genuinely bought into. Not the slide. Not the wall poster. The actual reason you all got out of bed this morning. When the macro turns and the metric you were proud of last quarter goes sideways, that vision is what stops the org from devouring itself.

I've watched startups where the thesis was right but the timing was a year early lose half their team in three months because nobody could explain why they were still doing what they were doing. It wasn't a strategy problem. It was an alignment problem dressed up as a strategy problem.

This is also where founders take the most damage personally. You can do everything well and still get hit by something nobody could have predicted. If your sense of self is tied to PMF being a verdict on you, that breaks people. The ones I've seen come through it healthy treated PMF like weather they were navigating, not a test they were passing.

Agility is the actual moat at this stage

Your moat isn't the product. It isn't the tech. It definitely isn't the brand. Your moat is how fast the org can spot a shift in TAM or target market and translate it into a product move.

Days, ideally. Weeks if you have to. Not quarters.

This is where the technical decisions made in the name of "scaling" quietly cripple you. The microservices you split out before you needed to. The custom infrastructure someone stood up because their last job had it. The platform abstractions that mean a small UI change touches four repos. Each of those felt disciplined at the time. Each is now a tax on the only thing you actually have — speed.

Andreessen wrote the original PMF essay almost twenty years ago, and the line that's aged best is the bit about doing whatever it takes — changing people, rewriting the product, moving markets. That's not a license to be chaotic. It's a reminder that the org needs to be physically capable of those moves. If your architecture, process, or contracts make rewriting the product a six-month project, you've already lost the gauntlet whether you know it yet or not.

I've got a strong opinion on this one: when in doubt, build it boring. Boring is fast to change.

The first cohort is a dance, and you have to lead

The customers who signed up first kept you alive. They also signed up for a slightly different company than the one you're now trying to become. That gap is where a lot of startups quietly die.

Keep them too happy and you slow your evolution. Push too hard toward the new vision and you churn the cohort that's funding your runway. The actual job is to do both at once, which is why I sometimes call it internal schizophrenia. You're a different company to them than you are to yourselves, and that's not a bug — that's the mode you're operating in.

The skill is being honest with the early cohort about where you're going without selling them something they didn't buy. The art is using their feedback to sharpen the bigger vision rather than letting yourself be pulled back into being their bespoke vendor. The dance is doing both of those without your team thinking you've gone off-piste, because the gap between "what we're shipping today" and "where we're going" looks weird from the inside.

Where PMF teams quietly self-sabotage

Three patterns I keep seeing.

Engineering over-scales the architecture. The team builds for the company they want to be in two years instead of the company they need to be this quarter. By the time PMF actually shows up, the org can't move. Worse, the engineers feel busy and capable the whole time it's happening, which is why it's so hard to stop. Nobody is asking to slow down — everyone is shipping.

Product holds the roadmap too tightly. The roadmap is the experiment at this stage. Treating it like a commitment is a category error. The product teams I've seen do this well treat the roadmap like a hypothesis with version numbers — last month's was wrong, this month's is less wrong, and that's how it's supposed to feel. The ones who don't end up defending decisions they made when they knew less.

Founder comprehension debt builds up faster than anyone notices. The founder is heads-down on signal — every customer call, every dropped deal, every weird pattern in the data lands in their head and gets metabolised on the spot. The team is two beats behind, working from last week's mental model. Each individual delay feels minor. The cumulative gap is the thing that kills decisions.

Each of these looks like discipline from the inside. Each of these is fragility wearing discipline's clothes.

AI changes the moat conversation, not the gauntlet

Moats in the AI space are shifting quarter by quarter right now. Feature moats have basically collapsed — anything you can describe in a screenshot can be cloned in a weekend with the current generation of tools. What's actually defensible has moved toward proprietary data, deeply embedded workflows, distribution, trust, and regulatory positioning.

For a founder in the PMF gauntlet that means the playbook is unreliable in a way it wasn't five years ago. You can't just lift what worked for the last cohort of SaaS winners and run it. You have to reason from first principles about where your actual edge is going to come from over the next eighteen months, and place chips accordingly.

The gauntlet itself hasn't changed. The chips you're placing have. That's harder than it sounds, because most of us were trained in an era when the moat conversation was settled.

The unglamorous work that decides whether you survive

The thing nobody tells you is that the founder or leader's most important job during the PMF stretch isn't strategy or product or sales. It's getting the context that's in your head out into the org while everyone is running at five thousand miles an hour.

You will not feel like you have time for this. You won't. You have to carve it out anyway. The teams I've seen come through PMF intact are the ones whose leaders forced themselves to stop, write things down, repeat themselves more than felt necessary, and trust that the slowdown was the work.

The teams that don't make it tend to look back and realise everyone was busy and nobody knew why.

Change management

Sat, 25 Apr 2026 00:00:00 GMT

There's a whole business discipline called change management. Frameworks, certifications, consultancies, the lot. Every big company has someone running it during a restructure or a tech migration. Nobody runs it for you when your life turns over.

Which is strange, because the personal version is the harder problem — and right now, more people are facing it than at any point in recent memory.

More than 92,000 tech workers have been laid off in 2026 alone, bringing the total close to 900,000 since 2020. Meta cut 8,000 jobs last week. Microsoft offered buyouts to 7% of its US workforce — the first time in its 51-year history. Oracle has started cuts that could reach 30,000 by year end. Closer to home, Xero, Sharesies, Spark, One NZ and Eroad have all run their own rounds. AI is the headline reason, but the impact lands the same regardless of the cause: hundreds of thousands of people closing a laptop and discovering their working identity has just been deleted.

That's a lot of people being handed a forced version of personal change management without ever signing up for the course.

The business framing has the right insight buried in it. William Bridges made a distinction in the 90s that most people miss: change is external, transition is internal. Change is the new org chart, the redundancy email, the merger. Transition is what happens inside people's heads while all that is going on. Change can happen overnight. Transition takes as long as it takes.

Personal change management is just transition without the org chart.

I've been through a few years of it now. Not one big event — more like a slow stack of endings, some chosen, some not. Companies, relationships, versions of myself I'd been building for a decade. The kind of stretch where you don't really notice you're changing until you look up one day and the old you is gone.

That's the part nobody warns you about. Real change isn't transformation. It's a controlled demolition followed by a slow rebuild, with a long, weird middle bit where neither the old you nor the new you is really there.

Something has to die

The thing that goes is usually the organising self. Whatever the old you was arranged around — a fear, a need for approval, a story about who you had to be, an ambition that was really a wound. When that goes, the structure it was holding up collapses. That's the death. It's real.

What survives is everything that wasn't load-bearing on the old arrangement. Your humour, your curiosity, the way you actually see people, the things you genuinely care about. Those don't die because they weren't propping anything up. They were just you, underneath.

The disorienting part is feeling like a stranger to yourself and entirely continuous, at the same time. Both are true. The continuous parts are continuous. The organising self is gone. You're in between.

The middle is the work

Bridges calls this the neutral zone. The old reality has gone, the new one isn't there yet. He says it's the hardest phase to manage, and most organisations rush through it because it looks unproductive. People do the same thing to themselves.

The temptation is to build a new identity fast, because the empty space is uncomfortable. Don't. Whatever you grab in a hurry will be made of whatever was lying around — which usually means the old patterns sneak back in wearing new clothes. Workaholism becomes "building my legacy". Approval-seeking becomes "being of service". Avoidance becomes "protecting my peace". Same machine, new paint.

The test is always: is this coming from fear or from truth? You'll know. The body knows before the mind does. Pay attention to the part of you that goes quiet around certain people, certain projects, certain decisions. That's the signal.

Fearlessness is a side effect

You don't get to fearless by trying. You get there by going through enough endings that the bluff stops working.

Fear runs on a specific con: if this thing happens, you won't survive it. Not literally die — but the you that exists now won't continue. You'll be broken, finished, unrecognisable. The con works as long as it's untested. Then the thing happens, and you go through it, and on the other side you notice you're still here. Different, scarred, but continuous. The fear was lying about its hand.

After that, fear can still show up — it doesn't leave — but it can't run the same con. You've seen the card it was holding. Next time it says you won't survive this, some quiet part of you knows: I already did.

That's not the absence of fear. It's knowing you can act from what's true even with the fear in the room.

What's on the other side is ordinary

Here's the bit that surprised me. Once the demolition is done and the rebuild starts, what comes back isn't impressive. It's just real. Less reactive. Less noise. Less performance. You stop needing to be seen a particular way, partly because you've watched a few of those selves die and you don't trust the next one enough to stake everything on it.

The goal of change management — the personal kind — isn't to become someone admirable. It's to become someone who's the same alone as in public. Someone who does the next true thing without announcing it. Most of the depth of this stuff lives in the texture of regular days. How you handle a boring Tuesday. Whether you rest when you're tired or push through to prove something to nobody.

Business change management has all this in it, and most people read it as a project manager's manual. It's also a personal one. Endings, neutral zone, new beginnings. Same shape, different blast radius.

The seeds that grow through the demolition are the ones worth tending. The rest sorts itself out.

Three Ways to Look at Time

Thu, 23 Apr 2026 00:00:00 GMT

ST-ResNet's core insight is that not all history is created equal.

When you're predicting crime in Auckland next month, three different kinds of past information matter. What happened in the last couple of months: the recent trend. What happened at the same time last year: the seasonal pattern. And what's been happening over the longer term: whether crime is generally rising or falling in an area.

ConvLSTM treats all of this as one continuous sequence and hopes the network figures out which parts matter. ST-ResNet takes a more opinionated approach. It separates these three temporal scales explicitly and gives each one its own dedicated neural network branch.

The original paper by Zhang et al. was about predicting crowd flows in Beijing. People move through cities in patterns that look a lot like crime patterns: daily rhythms, weekly cycles, long-term trends. The architecture translates well to crime data, with some modifications.

Closeness, period, trend

The three branches each look at different slices of history:

Closeness captures what's been happening recently. For our monthly data, this means the last 3 months. If South Auckland has been trending upward over the last quarter, the closeness branch sees that momentum.

Period captures seasonal patterns. It looks at the same month in previous years. So to predict January 2026, it pulls in January 2025 and January 2024. The assumption is that crime has an annual rhythm, and the same month tends to look similar year to year.

Trend captures longer-term shifts. It uses quarterly averages from further back: broad strokes of whether an area is seeing more or less crime over time. This is the slowest-moving signal.

Each branch independently processes its temporal slice through a stack of residual convolutional blocks, then a learned fusion layer combines the three outputs:

prediction = W_c · closeness + W_p · period + W_t · trend + bias

Where W_c, W_p, and W_t are learned weights that vary by grid cell. This is a nice touch. It means the model can decide that the CBD's crime is mostly driven by recent trends (closeness), while a residential suburb might be more seasonal (period). Different areas get different temporal recipes.

Residual blocks

Each branch uses residual convolutional units, the building blocks that made ResNet so successful in image recognition.

The key idea: instead of learning the full output at each layer, the network learns the residual, the difference between input and output. The identity shortcut connection means gradients flow cleanly through the network during training, which lets you stack more layers without the signal degrading.

ResUnit(X) = ReLU(Conv(ReLU(Conv(X))) + X)

That + X at the end is the skip connection. If the layer has nothing useful to add, it can learn weights near zero and just pass the input through. This makes deeper networks stable, which matters when you're trying to learn spatial features at multiple scales.

For our grid, I use 4 residual units per branch. Each unit has two 3×3 convolutional layers with 32 filters. That's deep enough to capture spatial relationships across several kilometres without being so deep that the model overfits on 36 months of training data.

The NZ-specific problem

Here's where theory meets reality, and it gets a bit awkward.

ST-ResNet was designed for dense, high-frequency data. The Beijing crowd flow paper used 30-minute intervals over months of data: thousands of timesteps. The crime papers that report strong results typically use daily data over several years.

We have 48 monthly timesteps. Total. The period branch (which looks at the same month in previous years) has at most 3 data points per month (2022, 2023, 2024 to predict 2025/2026). The trend branch is working with quarterly averages from a four-year window. It's not a lot of temporal data for an architecture that's specifically designed to decompose temporal patterns.

I had a feeling this would be the bottleneck, and it was.

Implementation

Closeness branch:
  Input: last 3 months (3 × 6 channels = 18 input channels)
  → 4 ResUnits (32 filters, 3×3 kernels)
  → Output: 32 channels

Period branch:
  Input: same month from 2 prior years (2 × 6 = 12 input channels)
  → 4 ResUnits (32 filters, 3×3 kernels)
  → Output: 32 channels

Trend branch:
  Input: 2 quarterly averages (2 × 6 = 12 input channels)
  → 4 ResUnits (32 filters, 3×3 kernels)
  → Output: 32 channels

Fusion:
  → Learned weighted sum across branches
  → Conv2d(32, 6, 1×1) → 6 crime type predictions

Total parameters: roughly 180k. Slightly smaller than the ConvLSTM, which is fine. ST-ResNet's power is supposed to come from the temporal decomposition, not from model size.

Training uses the same setup as ConvLSTM: Adam optimiser, learning rate 1e-4, MSE loss on log1p-transformed values, early stopping with patience of 15 epochs. On CPU, each run takes about 35 minutes, a bit faster than ConvLSTM since there's no sequential recurrence to deal with.

Results

Crime Type	Hist. Avg MAE	ConvLSTM MAE	ST-ResNet MAE
Theft	1.28	1.14	1.18
Burglary	0.35	0.32	0.33
Assault	0.20	0.19	0.19
Robbery	0.04	0.04	0.04
Sexual	0.03	0.03	0.03
Harm	0.01	0.01	0.01
All types	0.39	0.35	0.36

ST-ResNet beats the historical average but doesn't quite match ConvLSTM. The aggregate MAE of 0.36 is a 7.7% improvement over the baseline, compared to ConvLSTM's 10.3%.

That's not a terrible result, but it's not what I was hoping for.

Why ConvLSTM wins here

When I dug into the learned fusion weights, the story became clear. The closeness branch dominates. It gets 60–70% of the weight across most grid cells. The period branch gets 20–25%, and the trend branch barely contributes at 10–15%.

The model is basically saying: "Recent months matter most, seasonal patterns help a bit, and long-term trends are mostly noise." That's not a failure of the architecture. It's a fair assessment of what's in the data.

With only 2–3 examples of each calendar month, the period branch can't reliably learn seasonal patterns. It's overfitting to individual years rather than extracting a stable seasonal signal. ConvLSTM handles this better because it processes the full sequence and implicitly learns seasonality from the continuous flow of months, without needing to explicitly align calendar periods.

The trend branch suffers even more. Quarterly averages over a four-year window don't give it much to work with. In the original crowd flow papers with years of half-hourly data, the trend branch captures genuine long-term shifts in population movement. Here, it's essentially learning a constant.

Where ST-ResNet does shine

Despite losing on aggregate, ST-ResNet has one clear advantage: it's better at predicting seasonal transitions.

The months where crime shifts gears (the spring uptick in September/October and the February dip) ST-ResNet handles more gracefully than ConvLSTM. The period branch, sparse as its data is, does capture enough of the annual rhythm to anticipate these transitions a bit earlier.

ConvLSTM tends to lag these transitions by about a month. It needs to "see" the uptick starting before it predicts continuation. ST-ResNet, by explicitly looking at last year's same month, can anticipate the shift before it fully materialises in the recent sequence.

For an operational forecasting tool, that one-month lead time on seasonal transitions could be valuable. But in our test set metrics, it's a small advantage that doesn't overcome ST-ResNet's overall weaker performance on month-to-month dynamics.

Head to head

Metric	Historical Avg	ConvLSTM	ST-ResNet
Overall MAE	0.39	0.35	0.36
Theft MAE	1.28	1.14	1.18
Training time (CPU)	N/A	~40 min	~35 min
Parameters	0	~200k	~180k
Seasonal transitions	Poor	Lagging	Better
Spatial dynamics	None	Good	Good

ConvLSTM is the better model for this specific dataset. Not by a lot. We're talking about small differences on already-small error values. But consistently better on the main crime types that have enough signal to matter.

Neither model is a revelation. A 7–10% improvement over "just use the historical average" is real but modest. Deep learning's strengths (learning complex nonlinear dynamics from huge datasets) are somewhat wasted on 48 monthly timesteps over a relatively low-crime city.

If I had daily data instead of monthly, or ten years instead of four, I'd expect ST-ResNet to close the gap or pull ahead. Its architecture is fundamentally sound. The temporal decomposition is a genuinely good idea. It's just starved of the data it needs to shine.

Both models meaningfully beat the baselines. Both learn spatial patterns that simple averages can't capture. And both are honest about the sparse crime types: they predict near-zero and move on, which is the right call.

Next up: we'll take these predictions and build something you can actually look at. A 3D interactive dashboard where you can watch crime patterns evolve across Auckland over time. The modelling was the hard bit. Making it visual is the fun bit.

What an hour of your attention is worth

Tue, 21 Apr 2026 12:00:00 GMT

I stood up a working social network for eight mates last weekend. Profile pages, a shared feed, a photo wall, a jukebox bolted onto a spare domain. It took me a Saturday, about forty bucks in Claude credits, and exactly zero product-market-fit meetings.

The same weekend, Meta earned about six bucks off me. Google made ten. LinkedIn, YouTube, TikTok, X — all quietly billing in the background, none of them sending a receipt. If you add them all up for the average American, the annual total is north of $1,000. You just never see it, because no money changes hands and no invoice arrives.

The clever thing about "free" on the internet isn't that the trade doesn't exist. It's that it's been designed so you can't see it. No money moves. No invoice lands. No app shows you the meter ticking as you scroll. The exchange is real — your attention and your data in, Instagram and Google and LinkedIn out — but by the time the numbers get tallied, they live in a quarterly earnings report you'll never read. So the trade feels weightless.

It isn't. You just can't see the price tag.

The strange thing is the price tag has been public the whole time. Every platform listed on a stock exchange tells you, four times a year, exactly what you're worth to them. You've just never been shown how to read it — and until recently, the only practical alternative to reading it was "live in a cabin." That part has changed, and it's the part almost nobody is talking about.

The invisibility isn't an accident either. If Meta had to send you a cheque every month for the money they made off you, you'd treat the relationship very differently. You'd notice when the amount went up. You'd notice that the teenager version of the payment looks nothing like the adult version. You'd wonder why the Auckland cheque was ten times the Jakarta one for the exact same hour of scrolling. The whole edifice of "free" rests on keeping the accounting one-sided — they measure you in basis points to three decimal places, you experience the trade as a vague sense of having lost your afternoon.

The price tag they're legally required to print

The number you want is called ARPU — average revenue per user. Every public platform reports it, because investors demand it. The maths is blunt: take the company's annual revenue, divide by monthly active users. What comes out is what the platform earns off the average human who shows up, per year.

For Meta last year the global figure was about $52 per user. For YouTube's ad-supported side, around $24. For LinkedIn it's $15 averaged across all 1.2B members, but much higher once you strip out the dormant accounts.

These aren't guesses from a watchdog group. They're from the companies themselves, in the part of the earnings release where the whole purpose is to convince shareholders each user is worth more than last quarter. The incentive is to talk the number up, not down.

Whatever ARPU says, the reality on the ground probably isn't lower. If anything, it's a floor.

Your annual bill, itemised

Rough figures, from the companies' own filings:

Meta: ~$52/yr global, ~$320 in the US
Google (all products): ~$100/yr globally, ~$500 US — $400B in revenue across ~4B users spanning Search, Android, YouTube, Cloud and Workspace combined
YouTube ads alone: ~$24/yr global, ~$80 US
LinkedIn: $15/yr averaged across all 1.2B members, but ~$57/yr across the 310M monthly active ones
TikTok: ~$16 global, ~$70 US — doubled in two years
Snapchat, Reddit, Pinterest, X: all in the $10–30/user/yr range

The geographic skew is the part most people miss. Meta's figure in the US is roughly ten times what it is in Asia-Pacific. Europe sits in the middle at about $92. Same product, same features, same algorithm — different rate card, because ad buyers pay more to reach wealthier audiences. You are literally worth more in Auckland than you are in Jakarta, and your feed is tuned accordingly.

The same skew shows up across every ad-funded platform. The US rate card is the one the rest of the world gets compared to:

Marketplaces don't fit ARPU cleanly, but the extraction is still there if you look for it. Uber and Lyft take around 20% of each fare. Airbnb combines host and guest fees for about 14–16%. DoorDash and Uber Eats take closer to 25%. Shopify's card take is 2.9% plus 30 cents per transaction. Different mechanism, same game — a percentage of every transaction, quietly skimmed, never itemised.

The meter, in dollars per hour

ARPU is annual. Attention isn't spent in years though — it's spent in hours, in the little windows between other things. So the honest conversion is to divide.

The average US Meta user burns about 200 hours a year across Facebook and Instagram. $320 ÷ 200 = roughly $1.60 per hour of your attention. YouTube works out to about $0.27/hour. TikTok $0.22. Snapchat cheaper still. Do the same sum on global averages and Meta drops to around 26 cents an hour, YouTube to 8.

Those rates are only what the platform earns this year, mind. They aren't what your data is ultimately worth. Everything you click and hover and pause on feeds ad targeting across the wider web, plus — now — AI training corpora. ARPU is the rent. The equity is bigger, and the equity compounds.

The AI-training bit is genuinely new and worth pausing on. For fifteen years the data you generated on these platforms powered one thing: better ad targeting on those same platforms. It was a closed loop. You scrolled, they learned, they sold the targeting back to advertisers, the advertisers bought your attention again. Bounded. Weird, but bounded.

That loop isn't bounded anymore. Your posts and comments and DMs are now training data for models that will be sold, resold, and embedded into every piece of software you touch for the next decade. The $320 Meta earned off you in the US last year is a rounding error next to what the underlying corpus is worth to the next generation of AI products. ARPU doesn't capture any of that. It's literally last quarter's ad rent, with none of the capital gains on the asset.

Even the rent, laid out per hour, makes one thing obvious: you can see exactly why every platform is obsessed with "time spent" as a north-star metric. If one extra hour a week on Facebook is worth ~$83 a year per US user, multiplied across three billion users, the maths for why the feed never stops scrolling is not mysterious. The feed is a meter. Keeping it running is the business. Every "new feature" that shows up in your settings — reels, shorts, a nudge to open the app on your commute — is a hand on that meter.

Once you see it that way, a lot of product decisions stop looking like product decisions.

Run your own numbers

The point of making the numbers this concrete is that you can plug in your own usage and see what you personally throw into the machine each year. Drag the sliders for how much time goes into each platform and watch the ledger tally up. Rates are global averages.

The Ledger

global averages · per year

extracted per year

Notes on the method

ARPU is rent, not equity. What a platform earns this year isn't what the underlying data is worth across the wider web and AI training corpora.

Averages hide heavy users. Freemium smears free and paying users into one figure. If you're all-in, you're worth more than average.

Multi-product companies cheat the top line. Google's per-user number isn't all Search — it's Search plus Android plus YouTube plus Cloud.

The rates come from the earnings-report maths above — global ARPU divided by average annual hours on the platform.

The weekend social network

Once the number has somewhere to sit, it's much harder to ignore.

Most people look at a total over $1,000/yr and go quiet for a second. Not because any one platform is egregious — on a per-hour basis they really aren't — but because the aggregate is real, and it's been invisible until now. That's the first useful thing the exercise does. It makes a choice possible.

The obvious next move is to look at alternatives. Signal instead of WhatsApp. Kagi or Brave Search instead of Google. Paid Spotify instead of ad-supported Spotify. Bluesky or Mastodon instead of X. Fastmail instead of Gmail. None are perfect, and some cost actual money — but once you can price what you're currently "not paying", the paid alternative often looks less expensive than it did five minutes ago. Fastmail at $5/month stops being a luxury when the honest comparison is "$60/yr vs being the product for an ad network that paid $500 for me last year."

That's the defensive move. It's the one everyone talks about, every time one of these pieces gets written. You switch to the more honest vendor, you feel slightly better, and the fundamental shape of the market doesn't move.

The more interesting move is what's happened on the build side, and it's the part almost nobody has internalised yet.

Standing up a social app used to take a small team months. You needed a backend engineer, a frontend engineer, a designer, probably a DevOps person, and a spare three months. That was the real moat — not the network effects, not the algorithm, but the sheer human-hours required to put a working thing on the internet. That's why the only viable answer for twenty years was to build something big enough to run ads against. Small social didn't exist because small social couldn't pay the salaries.

With Claude Code, Cursor, v0, and Lovable, that equation has quietly inverted. A profile page, a shared feed, a wall for photos, maybe a jukebox, a chat wall — a MySpace-sized thing for you and a dozen friends, on a domain you own, with none of it feeding anyone's ad platform — is a weekend. I know because I just did it. Not as some Silicon Valley startup trying to replace Facebook. As a Saturday project for eight mates, on a domain that cost twelve bucks, running on a box that costs ten a month.

The bill of materials is embarrassingly short. A boring Postgres. A boring Next.js app. Auth via magic link. Storage for photos. An LLM for the fiddly bits nobody wants to write from scratch. All of it plumbed together in an afternoon of prompting, an evening of cleanup, and a Sunday of adding the jukebox because my mate Hamish wouldn't stop asking.

It is not good software. It is good enough software for eight humans who know each other.

That qualifier is the whole thing. Facebook has to be good software at planet scale because Facebook is selling ad impressions at planet scale. A group of eight doesn't need p99 latency and a content moderation policy. A group of eight needs a place to put photos from the weekend where the photos don't end up training someone's image model in twelve months' time. Those are very different engineering problems, and the second one is much, much easier than the first.

A lot of things genuinely don't work on the weekend version. There's no recommendation algorithm. There's no real search. The feed is reverse-chronological and that's it. When someone posts something at 3am nobody sees it until the morning. There's no cleverness about which photos get surfaced or which memories get resurrected. If you go on holiday for two weeks, you come back to a feed that's exactly what your eight mates posted, in the order they posted it.

That sounds like a limitation until you notice the thing it is not doing is optimising for your engagement. Reverse-chronological across eight friends is not a meter. It's a wall. You check it, you see what's there, you leave. There's no reason for the software to try to keep you around because there's nobody paying the software to keep you around. That inversion — from meter to wall — is the entire point.

The thing that would have been a VC round in 2015 is now a side quest you finish before the roast is in the oven. The tools genuinely got that much better in the last eighteen months. We just haven't updated our intuitions yet about what that means.

What it means, specifically, is that the ad-supported social network is no longer the only technically viable answer. For twenty years it was. That was the constraint the whole "free web" was built around. The constraint is gone, and nobody has sent the memo.

The cheapest social network in 2026 is the one you and six mates build on a Saturday afternoon. It doesn't scale. It doesn't need to. It costs less than a month of Netflix, produces no ad revenue for anyone, and feeds no one's training set. You own the domain. You own the data. You own the product decisions — which in practice means there are no product decisions, because nobody is trying to squeeze another hour out of anyone's week.

None of this replaces the platforms, to be clear. You still need Gmail for the recruiter, LinkedIn for the job hunt, YouTube for the tutorial, WhatsApp for the group chat your family refuses to leave. The ad-supported internet isn't going anywhere and I'm not pretending it is. What's changed is that it's no longer the only game in town. For the circle of people you actually care about — the eight mates, the cousins, the old uni flat — you don't have to hand them over to the ad machine anymore. You can build them a room of their own, and the tools to build that room have become trivial in a way we haven't fully absorbed yet.

The meter's been running your whole life. You just got the tools to turn it off.

Teaching a Neural Network to Watch Crime Like Video

Thu, 16 Apr 2026 00:00:00 GMT

ConvLSTM was invented to predict rainstorms.

Specifically, Shi et al. at the Hong Kong Observatory needed to forecast radar echo maps: 2D grids of rainfall intensity that evolve over time. They had sequences of spatial images and wanted to predict the next frames. Sound familiar?

That's exactly what we built in Part 3. Crime on a 500m grid, one frame per month, six channels for crime types. The Auckland crime tensor is structurally identical to a weather radar sequence. Same dimensionality, same prediction task, just a very different domain.

Why not regular LSTM?

Standard LSTM networks are fantastic at learning sequences. They're the backbone of a lot of time-series forecasting. But they have a fundamental problem with spatial data: they need flat vectors as input.

To feed our 77×59 grid into a regular LSTM, we'd have to flatten it into a vector of 4,543 values per crime type. That's 27,258 values per timestep across all six channels. The network would process this as a sequence of big flat vectors, with no concept that cell (10, 5) is next to cell (10, 6).

All the spatial structure (the fact that crime clusters, that hotspots have neighbourhoods, that the CBD is a contiguous area) gets thrown away. The model would have to rediscover spatial relationships from scratch, purely from correlations in the flattened vector. With only 36 training months, that's not happening.

The convolutional trick

ConvLSTM's insight is elegant. Take the standard LSTM equations (the input gate, forget gate, output gate, cell state update) and replace every matrix multiplication with a convolution operation.

In a regular LSTM:

input_gate = sigmoid(W_xi * x_t + W_hi * h_{t-1} + b_i)

In ConvLSTM:

input_gate = sigmoid(W_xi ∗ X_t + W_hi ∗ H_{t-1} + b_i)

That ∗ is a convolution instead of a matrix multiply. X_t is the full 2D grid at time t, and H_{t-1} is the previous hidden state, also a 2D grid. The convolution kernel slides across the spatial dimensions, so each cell's gate values depend on its local neighbourhood.

This means the network naturally learns that a spike in cell (10, 5) might affect predictions for cell (10, 6). Spatial proximity is baked into the architecture. It doesn't need to learn it from data.

The kernel size controls how much spatial context each cell sees. A 3×3 kernel means each cell looks at its immediate 8 neighbours. Stack multiple ConvLSTM layers and the effective receptive field grows. Deeper layers can capture relationships between cells that are several kilometres apart.

Architecture choices

Here's what I settled on after a fair bit of experimentation (which on CPU means "a lot of patient waiting"):

Input: (batch, 6, 6, 77, 59), 6 months, 6 crime types, 77×59 grid
  ↓
ConvLSTM2d(in=6, hidden=32, kernel=3×3, padding=1)
  ↓
BatchNorm2d
  ↓
ConvLSTM2d(in=32, hidden=32, kernel=3×3, padding=1)
  ↓
BatchNorm2d
  ↓
Conv2d(in=32, out=6, kernel=1×1), project to 6 crime type channels
  ↓
Output: (batch, 6, 77, 59), next month prediction

Two ConvLSTM layers with 32 hidden channels each. The 3×3 kernel gives each cell a neighbourhood view, and stacking two layers means the effective receptive field covers about 1–1.5 km. Enough to capture the spatial extent of most crime hotspots.

Why only 32 hidden channels? This is where the CPU constraint actually helps. A bigger model would be tempting with a GPU, but on a Ryzen 5 we need to keep it tight. 32 channels gives us about 200k trainable parameters: small enough to train in under an hour, large enough to learn meaningful spatial-temporal patterns.

The 1×1 convolution at the end is a channel projection. It maps the 32 learned features back to 6 crime type predictions.

Sequence length: six months

The lookback window is six months. The model sees January through June and predicts July. Then February through July to predict August. And so on.

Six months captures one half of the seasonal cycle, which turned out to be the sweet spot. Shorter sequences (3 months) missed seasonal context. Longer sequences (12 months) didn't improve results, likely because the model doesn't have enough data to learn year-long dependencies with only 36 training months total.

The training set gives us 30 sequences (months 1–6 predict 7, months 2–7 predict 8, all the way to months 30–35 predict 36). That's not a lot. Every sequence counts.

Training details

optimiser = Adam(lr=1e-4)
loss = MSE  # on log1p-transformed values
batch_size = 4  # small because sequences are large
epochs = 150 with early stopping (patience=15)

The log1p transformation from Part 3 is critical here. Raw crime counts range from 0 to 50+. After log1p, the range compresses to 0–4. Without this, the loss function would be dominated by the handful of high-count CBD cells, and the model would essentially ignore the rest of the grid.

Training on CPU takes about 40 minutes per run. Not fast, but manageable. I could typically fit in 3–4 experimental runs per evening, which meant progress was slow but steady. Each run I'd tweak one thing (kernel size, hidden channels, learning rate) and compare validation MAE.

Early stopping triggers around epoch 80–100 in most runs. The model converges relatively quickly, which makes sense given the small dataset and architecture.

Results

So how does ConvLSTM stack up against the baselines from Part 5?

Crime Type	Hist. Avg MAE	ConvLSTM MAE	Improvement
Theft	1.28	1.14	10.9%
Burglary	0.35	0.32	8.6%
Assault	0.20	0.19	5.0%
Robbery	0.04	0.04	2.5%
Sexual	0.03	0.03	~0%
Harm	0.01	0.01	~0%
All types	0.39	0.35	10.3%

A 10% improvement on the aggregate MAE. Not earth-shattering, but real.

Theft gets the biggest lift because there's the most signal to work with. The model genuinely learns spatial dynamics that the historical average can't capture. When a cluster of cells in South Auckland trends upward over several months, ConvLSTM picks up on that momentum and adjusts its predictions accordingly.

Burglary sees a decent improvement too, likely driven by the spatial correlation with theft that we spotted in the EDA.

For the sparse crime types (robbery, sexual offences, harm) ConvLSTM basically learns to predict near-zero, same as the baseline. There simply isn't enough signal at 500m monthly resolution for these types. The model is honest about what it doesn't know, which I actually respect.

Where it shines and where it doesn't

The improvement isn't uniform across the grid. ConvLSTM does best in the transition zones: cells on the edges of established hotspots where crime counts fluctuate month to month. It learns that these boundary cells tend to follow the trend of their neighbours, which is exactly the kind of spatial-temporal pattern it was designed to capture.

In the stable hotspot cores (the CBD, Manukau) the model performs about the same as the baseline. Those cells are consistently high, and the historical average already captures that well.

Where it properly struggles is with sudden spikes in normally quiet areas. A cell that's been near-zero for months and then gets 5 thefts in one month: the model doesn't see that coming. Neither does any other model, to be fair. Those events are closer to random noise than learnable signal.

Putting it in perspective

A 10% MAE improvement is meaningful but modest. Recent ConvLSTM crime prediction papers report larger gains, but they typically work with much more data: years of daily records across cities with higher crime density. Our setup is tougher. Monthly resolution limits temporal signal, Auckland is relatively low-crime by global standards, and we only have four years.

The model is also running on CPU with a deliberately small architecture. A bigger model on a GPU might squeeze out more performance. But the point of this project was always to see how far you can push it with modest resources, and a 10% beat over simple baselines feels like a real result.

The question now is whether ST-ResNet's different approach to temporal modelling can do better. ConvLSTM processes time as one continuous sequence. ST-ResNet breaks it into three separate temporal scales: closeness, period, and trend. With a seasonal dataset like crime, that decomposition might be exactly what's needed.

Open-Source Agent That Teaches Claude Code Your Architecture

Wed, 15 Apr 2026 00:00:00 GMT

AI has made building software cheap. A solo founder with Claude Code or Cursor can ship an MVP in a weekend that would've taken a small team a month two years ago. I've watched this happen across the NZ startup scene. Ideas that used to die in the "can we afford to build it" phase now get built over a long weekend.

This is mostly great. Velocity is what startups need. Cost of testing an idea is now close to zero, and the business prioritises speed.

The catch shows up when the idea works.

AI builds for right now. It optimises for the current prompt, the current file, the current feature. It doesn't think about what happens when your billing service needs to handle 10x the volume, or when your email notifications need to move from inline calls to a queue. It doesn't plan for the evolutionary pressure your system will face once it has users.

That's the gap I've been thinking about, and it's what led me to build domain-agents.

Give the tools their credit

I want to be fair to the current generation of AI coding assistants. They're not stupid about finding code.

Claude Code runs an agentic search loop (grep, glob, file reads) iterating through your codebase to find what's relevant. Boris Cherny (who created Claude Code) has said they tried RAG with a local vector database early on and dropped it because agentic search outperformed it. Cursor takes a different approach: it chunks your codebase, generates embeddings, and stores them for semantic search so you can find code by concept rather than keyword. Copilot combines semantic indexing with LSP-powered reference tracing from VS Code.

The search works. If you ask Claude Code to find your billing service, it'll find it. Ask Cursor for authentication logic and the embeddings will surface it even if the code never uses the word "authentication."

None of them understand the architecture those files live in.

All the information needed to understand domain relationships sits in the code: import graphs, interface signatures, dependency patterns. These tools don't extract or structure it that way. They find files one at a time. They don't map out that your billing service depends on the email service, that BillingService is consumed by two other domains, or that changing its interface is a cross-domain event. The information is in the codebase. Nobody's pulling it together.

And every session starts from zero. The AI learned your architecture yesterday and forgot it today.

Evolutionary architecture for the AI era

My thesis: cheap AI-built MVPs plus expensive scaling problems point toward evolutionary architecture with domain-based boundaries.

The idea isn't new. The reason it matters now is.

In an evolutionary architecture, you focus on clean interfaces between business domains. Your email service exposes a contract like sendEmail(to, subject, body), and the rest of the system calls that interface. Behind the interface, the implementation evolves through stages as your scaling needs change:

graph LR
    A["Inline\n(direct call)"] --> B["Async\n(fire & forget)"]
    B --> C["Queued\n(BullMQ/SQS)"]
    C --> D["Separate Service"]
    D --> E["Distributed"]

Day one, sendEmail is a function that calls Resend directly. Inline, synchronous, dead simple. When traffic picks up, you drop the await and let it run in the background. Later, you introduce BullMQ or SQS. Eventually it becomes its own service. The interface stays put. Only the implementation behind it changes.

This is the kind of evolution AI coding assistants are terrible at planning for. They'll inline that email call because it works right now. They have no concept of where this domain sits on its scaling trajectory.

Where domain-agents fits in

domain-agents is a CLI tool that runs static analysis on TypeScript codebases, discovers business domains, and generates AI agent context files for Claude Code and Cursor.

domain-agents discover .    # Analyse codebase → proposal.json
domain-agents init .        # Generate agents/*.md + AGENTS.md
domain-agents hooks claude  # Wire into Claude Code (rules + MCP server)
domain-agents hooks cursor  # Wire into Cursor (.mdc rules)

After setup, opening src/billing/invoice.ts in Claude Code loads the billing domain agent into context. The AI now knows: billing depends on email (coupling score 0.23), exposes BillingService consumed by 2 other domains, sits at the "inline" scaling stage with a path toward async queuing, and has 3 tracked tech debt items.

It plans work accordingly. The context was loaded before the first prompt, no search required.

Five signals, not one

The discovery engine runs 5 analysis passes because no single signal identifies business domains on its own.

Directory structure works for greenfield projects (src/auth/, src/billing/) but fails for legacy MVC apps. Import graphs capture coupling but not business intent. Package dependencies hint at external integrations but miss internal domains.

graph TD
    S["Structure Analysis"] --> O["Signal Orchestrator"]
    I["Import Graph\n(TS Compiler API)"] --> O
    N["Naming Patterns"] --> O
    D["Dependency Mapping\n(npm → domain hints)"] --> O
    IF["Interface Detection"] --> O
    O --> M["Merge Pipeline"]
    M --> R["Domain Proposal"]

Structure detects whether the codebase is feature-organised, layer-organised, mixed, or flat. Import graph uses the TypeScript Compiler API to parse each .ts file, resolve imports, and build a directed edge graph. Type-only imports get weighted at 0.3 because they're a weaker coupling signal than value imports. Naming patterns extract domain prefixes: auth.controller.ts → "auth". Dependency mapping maps npm packages to domain hints (stripe → billing, @sendgrid/mail → email). Interface detection identifies files imported across domain boundaries and calculates coupling scores between domain pairs.

Each pass produces weighted signals. The orchestrator combines them with confidence scoring: average signal strength plus a bonus for signal count, capped at 0.99. Layer-organised codebases get an 0.85 multiplier because they're harder to discover.

Most real codebases aren't clean

Feature-organised codebases are easy. The directory structure is the domain. But most real codebases look like this:

src/
  controllers/
    auth.controller.ts
    billing.controller.ts
  services/
    auth.service.ts
    billing.service.ts
  models/
    invoice.model.ts
    user.model.ts

Here auth.controller.ts, auth.service.ts, and auth.routes.ts all belong to the "auth" domain despite living in three different directories. domain-agents uses naming pattern extraction cross-referenced with import graph cohesion to cluster these. The auth.* files form a tight import cluster, which confirms the naming signal.

Merging is the hard bit

Raw signals produce too many small, overlapping clusters. The orchestrator runs a multi-phase normalisation pipeline.

Plurals merge: journals + journal → whichever has more files. Compound names consolidate: bank-balance + bank-statement + bank-transaction → bank-accounts (the largest cluster). Small clusters merge into their strongest import target, but only if they have a dominant dependency: more than 40% of imports from one target, and that target is at least 2x larger. This prevents cascading, where A merges into B, B gets bigger and attracts C, C pulls in D.

Files that import from 3+ domains get moved to "unassigned." These are coupling hotspots: middleware, orchestrators, shared handlers. Assigning them to one domain would mislead the AI, so the tool surfaces them for a human decision. That's the right call for architectural boundaries.

The E2E test suite validates the complete pipeline against 3 fixture codebases (feature-organised, layer-organised, mixed). Current benchmark: 100% activation accuracy across all 3 patterns and all 3 activation levels (domain assignment, glob matching, MCP lookup).

Auto-activation, not search

The integration into Claude Code and Cursor uses glob-based rule activation, the native mechanism both tools already support.

Each domain gets a rule file with glob patterns in the frontmatter:

---
description: billing domain
globs:
  - src/billing/**
  - **/billing.*
  - **/billing-*
---

When Claude Code opens a file matching those globs, the domain context loads. No MCP call, no background process, zero runtime overhead.

An MCP server complements the rules with 4 on-demand tools: domain_lookup(file), domain_context(name), domain_files(name), and list_domains(). A SessionStart hook prints a domain summary at the start of every Claude Code session, so the AI has system-level awareness from the first prompt.

Agents as a team model

This is the bit I'm most keen on long-term.

At Vend and Xero, teams owned domains. The billing team owned billing, the integrations team owned integrations. Ownership meant knowing the interfaces, the coupling points, the tech debt, and where things were headed. That knowledge lived in people's heads and got passed on through code reviews, architecture chats, and tribal memory.

Domain-specific AI agents formalise that same ownership model. An email agent loads the email domain's interface contract, its coupling to other domains, its current scaling stage, and its tracked tech debt. A billing agent carries the same for billing. They work within their boundaries and flag when a change crosses a domain line.

You don't need this from day one. Early on, one agent covers multiple areas. As the product grows, agents split along the same lines engineering teams split: by business domain. The operator (that's you) resolves conflicts where agents disagree, the same way an engineering manager resolves cross-team dependencies.

The analogy is rough, but it captures how AI-assisted development scales past a single person staring at a single context window.

Claude Code Can Now Spawn Copies of Itself in Isolated VMs

Mon, 13 Apr 2026 00:00:00 GMT

The moment this project went from "fun weekend hack" to something I actually use every day was when I got the MCP server working. Claude Code on my laptop sends a prompt to the orchestrator sitting under my desk, which boots a VM, runs Claude Code inside it with full permissions, and streams the results back. Claude delegating work to Claude.

It's a weird feeling watching it happen. You're in a conversation with Claude, it decides a task needs isolation, calls the MCP tool, and a few seconds later you can see a fresh VM spinning up in the dashboard. Like having an intern who can clone themselves.

Part 1 covered why I built this. Part 2 was the guts of it — rootfs, networking, the guest agent. This last post is about the interfaces, the streaming pipeline, and what I'd change if this needed to work for more than just me.

The MCP server

The orchestrator exposes an MCP server with eight tools. The main one is run_task — give it a prompt, optional config (RAM, vCPUs, timeout, max turns), and it blocks until the task completes. Returns the task ID, status, exit code, result files, cost, and the output truncated to 4000 characters.

Two transport modes. Stdio for when Claude Code runs on the same machine:

{
  "mcpServers": {
    "orchestrator": {
      "command": "sudo",
      "args": ["/opt/firecracker/bin/orchestrator", "mcp"]
    }
  }
}

And Streamable HTTP for network access — Claude Code on any machine on the LAN can use it:

{
  "mcpServers": {
    "orchestrator": {
      "type": "http",
      "url": "http://192.168.50.44:8081/mcp"
    }
  }
}

The other tools are for poking around: get_task_status, list_vms, exec_in_vm (run a command in a still-running VM), read_vm_file, destroy_vm, list_task_files, and get_task_file. That last one is smart about content types — text files come back as plain text, images come back as base64 MCP image content so Claude can actually see screenshots the VM took.

if isImageMime(mimeType) {
    encoded := base64.StdEncoding.EncodeToString(data)
    return mcplib.NewToolResultImage("Screenshot from task "+taskID, encoded, mimeType), nil
}

The migration that broke everything

This bit is worth telling because it'll save someone else the debugging time.

I originally built the MCP server with mcp-go v0.45.0 using SSE (Server-Sent Events) transport. Worked great. Then Claude Code updated to expect the newer Streamable HTTP transport, and everything fell over.

The failure mode was confusing. Claude Code would try to connect, attempt OAuth discovery against the /sse endpoint, get a 404 (my server doesn't do OAuth), and fail with:

Error: HTTP 404: Invalid OAuth error response: SyntaxError: JSON Parse error: Unable to parse JSON string

Nothing in my code changed. The client just started speaking a different protocol.

The fix was small once I understood it:

// Before — SSE transport
func (s *Server) ServeSSE(addr string) error {
    sseServer := server.NewSSEServer(s.mcpServer,
        server.WithBaseURL("http://"+addr),
    )
    return sseServer.Start(addr)
}

// After — Streamable HTTP transport
func (s *Server) ServeHTTP(addr string) error {
    httpServer := server.NewStreamableHTTPServer(s.mcpServer,
        server.WithEndpointPath("/mcp"),
        server.WithStateLess(true),
    )
    return httpServer.Start(addr)
}

Bumped mcp-go from v0.45.0 to v0.46.0, swapped the server constructor, changed the endpoint from /sse to /mcp, updated the client config. Done. But diagnosing "OAuth error on a server that doesn't do OAuth" — that bit took a while.

Output streaming

When Claude Code runs inside a VM, its output needs to get from stdout inside the guest all the way to a browser tab on my laptop. The path:

flowchart LR
    A["Claude Code stdout"] --> B["Guest agent\nvsock frame"]
    B --> C["Host vsock client\nExecStream"]
    C --> D["Task runner\nOnEvent callback"]
    D --> E["Stream Hub\nring buffer + fan-out"]
    E --> F["WebSocket\nto browser"]

The stream hub (internal/stream/hub.go) is a per-task pub/sub system. Each task gets a stream with a 1000-event ring buffer. When a WebSocket client connects, it gets all the buffered history first, then live events as they arrive.

Fan-out is non-blocking:

for ch := range s.subscribers {
    select {
    case ch <- event:
    default:
        // Subscriber is slow, drop the event
    }
}

A slow WebSocket client can't block the task runner. If the browser can't keep up, it misses events. In practice this never happens because the bottleneck is always Claude thinking, not the network.

The web dashboard

The React frontend is compiled to static files and embedded into the Go binary:

//go:embed all:web-dist
var webDistEmbed embed.FS

Single binary deployment. No nginx, no separate frontend server, no CORS headaches in production. The API server falls through to index.html for unknown paths, which gives you SPA client-side routing.

The most interesting page is the task detail view. Claude Code's --output-format stream-json spits out one JSON object per line — thinking blocks, text responses, tool calls, tool results, cost summaries. The dashboard parses these into coloured blocks:

Purple for thinking (Claude's internal reasoning)
Blue for text responses
Orange for tool calls (shows the tool name and input)
Grey for tool results (truncated to 2000 chars — some of these are enormous)
Green for the final result with cost

A useWebSocket hook connects when the task is running and disconnects when it's done. Green pulsing dot for live streaming. Auto-scroll to the bottom as events arrive. Image files in the results get inline previews pointing at the API's file download endpoint — so when Claude takes a screenshot inside the VM, you see it immediately.

Dark theme. Orange accents. Obviously.

What productionising looks like

This runs on one box with no auth. It's a home lab project. But the gap between "works for me" and "works for a small team" isn't as big as it looks.

Persistence is the most obvious one. The task store is an in-memory Go map. Orchestrator restarts? All task history gone. VM metadata already persists to disk and gets recovered on startup — tasks should too. SQLite or bbolt, a few hours of work. I just haven't needed it because I don't restart the process very often.

Task queue with backpressure. Right now tasks fire as goroutines with no concurrency limit. Submit 20 tasks on a 30GB machine where each VM wants 2GB and the last few fail because there's no memory left. A buffered channel or semaphore would fix this. You could get fancier with priority queues — quick code generation tasks ahead of long research tasks — but even a simple concurrency cap would be enough.

Authentication. The REST API and MCP server accept requests from anyone who can reach the port. For a team: API keys at minimum, mTLS if you're serious about it. The MCP spec supports auth flows now — that'd be the right way to do it for the MCP endpoint.

The OnEvent callback race. This one's a latent bug. The task runner's OnEvent callback is stored on the runner struct, not passed per-task:

s.taskRunner.OnEvent = func(id string, event agent.StreamEvent) {
    taskStream.Publish(event)
}
s.taskRunner.Run(context.Background(), t)

Two simultaneous tasks overwrite each other's callbacks. It works today because MCP tasks block (one at a time) and the API handler sets up the stream before the goroutine runs. But it's the kind of thing that works until it doesn't. Fix is trivial — pass the callback into Run() as a parameter.

Graceful shutdown. There's no signal handler. Ctrl-C kills the process, running VMs become orphans. They keep running as Firecracker processes — the recoverState() function on next startup finds them and starts tracking them again — but their tasks are lost. A proper signal handler would stop accepting new tasks, wait for running ones to finish with a timeout, then tear everything down cleanly.

For real multi-user you'd want result storage on S3 or R2 instead of local disk. A web auth layer. Per-user credential vaults so different people's Claude tokens don't mix. Usage tracking and cost attribution.

What I wouldn't change: the single-binary deployment, vsock for host-guest communication, ephemeral VMs as the isolation model, the embedded frontend. Those are the right calls regardless of scale. The architecture is sound — it's the operational bits around it that need work.

Most of these are a weekend each. The project is about 3,200 lines of Go and 860 of TypeScript. It's not a big codebase. Adding persistence, auth, and a task queue would maybe take it to 4,500 lines. Still fits in your head.

For now, it sits under my desk and boots VMs when I ask it to. Claude delegating to Claude, in complete isolation, on hardware I own. That's enough.

OpenHealth – Chat with Apple Health Data, Anywhere

Mon, 13 Apr 2026 00:00:00 GMT

For years I've worn an Apple Watch and let my iPhone quietly hoover up my resting heart rate, HRV, sleep stages, every workout, every nutrition log. Millions of data points. And for most of that time, when I wanted to actually ask something about my training — "am I cooked this week?", "has my recovery gotten worse since Christmas?" — I'd open ChatGPT and get an answer that was basically vibes, because it couldn't see any of the data.

So I built openhealth. It turns your Apple Health export into seven short markdown files any LLM can read. Drop the zip in your browser at openhealth-axd.pages.dev, run the CLI, or beam the zip straight from your iPhone over WebRTC. Paste the output into Claude or ChatGPT and start asking the questions you actually wanted to ask.

What's US-only and why that's annoying

In January, Anthropic shipped an Apple Health connector for Claude. OpenAI has one in ChatGPT. Both are US-only — if you're in New Zealand like me, or the UK, EU, or Switzerland, they're not available. That's a lot of people locked out of the most natural way to use this data.

And even if you are in the US, you're letting Anthropic or OpenAI decide what the model reads, how it's framed, and what tier unlocks it. I wanted control over the whole pipeline — including which LLM I feed it into.

What I built

openhealth ships three ways.

A static web app. Drop export.zip, wait five seconds, download seven files. The browser does the parse. There's no upload endpoint because there's no server — the Cloudflare Pages site is static HTML plus a tiny Web Worker. Open DevTools, watch the Network panel, nothing goes out.

A Bun-compiled CLI. openhealth ~/export.zip -o ./output gets you seven markdown files. --bundle concatenates them into one. --clipboard pushes that bundle straight to your system clipboard so you can paste it into any chat window. Zero deps beyond saxes for XML and fflate for unzip — even the argument parsing is node:util parseArgs, not Commander. One binary, put it wherever.

A phone-to-desktop handoff over WebRTC. The desktop site renders a QR code. Point your iPhone camera at it, Safari opens a tiny receiver page, pick the zip, and it streams directly to your desktop browser over a DataChannel. The only backend in the whole stack is a ~100-line Cloudflare Worker that relays the WebRTC handshake — it never sees a byte of your health data.

How the parse actually works

Apple's export.xml is properly huge. A long-term Watch user can easily have a 500MB–4GB file with millions of rows. Most XML parsers build a tree in memory, which OOMs before they finish.

openhealth uses saxes — a streaming SAX parser in pure TypeScript. It's isomorphic, so the same parser runs in Bun, Node, and the browser. I tested it against a synthetic 169MB / 1 million-record export and it finished in about 5 seconds in Chrome, with the main-thread heap staying around 5MB because the parse runs in a Web Worker.

The rest of the core is a small pipeline: stream XML, accumulate per-record-type, roll up into weekly and monthly summaries, run each through a writer that produces one markdown file. Every writer is snapshot-tested against byte-for-byte expected output. 85 tests, TDD throughout.

What the seven files are

Each one is deliberately small and shaped to be LLM-readable:

health_profile.md — baselines, data sources, long-term averages
weekly_summary.md — current week plus a 4-week rolling comparison with week-over-week deltas
workouts.md — detailed log for the last 4 weeks: HR, duration, distance, energy
body_composition.md — weight trend, recent readings, nutrition averages
sleep_recovery.md — nightly stages, 8-week averages, HRV, resting HR, SpO2 trends
cardio_fitness.md — running log, HR-zone distribution, walking-speed trends
prompt.md — a ready-to-paste system prompt that frames the other six as coaching input

Drop one file or all seven, depending on which chat model you're using.

What it's actually good at

Feeding real data to an LLM is a different experience from answering its questions. When Claude can see that my resting HR has crept up 4bpm over the last fortnight while my HRV has dropped and my training load stayed the same, it gives a real answer — "you're likely undercooked on recovery this week, here's what I'd change" — rather than a generic reminder to drink water.

It's especially good if you've got multiple devices in the mix. I've got data from Apple Watch, the iPhone step counter, a Withings scale, and MyFitnessPal. The parser picks the highest-trust source per metric — Apple Watch wins over iPhone for steps, Watch sleep beats AutoSleep which beats Withings, duplicate-weight entries on the same day get deduped. You feed in one zip and get one coherent picture.

Ask it about your recovery, your training load, what you might be doing wrong, how your sleep correlates with your long runs. It'll tell you — and it'll be right more often than not.

If privacy matters, go all the way

openhealth itself never uploads your data. The web app parses in your browser tab. The CLI runs locally. The WebRTC handoff stays peer-to-peer — the Cloudflare Worker that relays the handshake never sees a byte of the file. Clone the repo, diff the build output, and confirm it yourself.

When you paste the seven files into ChatGPT or Claude, they see the data. That's the trade most people will take for convenience, and it's fine. But if you don't want to make that trade, you don't have to — run the CLI and pipe the bundle into a local model:

openhealth ~/export.zip --bundle -o ./out
ollama run llama3 < ./out/openhealth.md

Ollama, llama.cpp, LM Studio, whatever you run. Your health data never leaves your laptop. The output is just markdown — it doesn't care what reads it.

That's why the shape is seven files and not an API. You pick what sees them.

I'm not a doctor. Neither is the model. Use this for thinking out loud about your own training, not diagnosing anything.

MIT, source at github.com/jonnonz1/openhealth. Web app at openhealth-axd.pages.dev. If you've been sitting on a 200MB export.zip with nothing that'll open it, have a go.

The Future of Security Is an Open-Source Model That Detects and Acts on Threats

Sat, 11 Apr 2026 00:00:00 GMT

Anthropic just dropped Project Glasswing — a big collaborative cybersecurity initiative with a shiny new model called Claude Mythos Preview that can find zero-day vulnerabilities at scale. Twelve major tech companies involved. $100M in credits. Found a 27-year-old flaw in OpenBSD. Impressive stuff.

But let's be real about what's happening here. Anthropic trained a model so capable at breaking into systems that they decided it was too dangerous to release publicly. So they wrapped the release in a collaborative security initiative. The security work is genuinely valuable. But it's also a smart way to keep control of something they know is too powerful to let loose.

The part that actually matters, though, is who benefits. Glasswing is for the big players. The companies with security teams, budgets, and the kind of infrastructure that gets invited to sit at the table with AWS, Microsoft, and Palo Alto Networks. What about the rest of us? The startups, the small SaaS shops, the indie developers running production systems on a shoestring?

The internet is a dark forest. That's not a metaphor anymore — it's becoming the literal reality. Bots, scrapers, automated exploit chains, credential stuffing, AI-generated phishing. A server goes up and within hours it's being scanned, fingerprinted, and probed by systems that don't sleep. Visibility equals vulnerability. And AI is making the attackers faster, cheaper, and more autonomous every month.

The ISC2 put it plainly — both offence and defence now operate at speeds beyond human intervention. The threats aren't people sitting at keyboards anymore. They're autonomous systems running campaigns end-to-end.

So what do we do about it?

Offensive security — but not the kind you're thinking

When I say offensive security, I don't mean red-teaming or penetration testing. I mean giving your systems the ability to fight back.

Picture an LLM that sits across your centralised logs — network traffic, database queries, user interactions, access patterns — and builds an understanding of what normal looks like for your system over weeks and months. Not just pattern matching against known signatures. Actually understanding the shape of healthy behaviour.

When something breaks the pattern, it doesn't just alert. It acts.

Disable a compromised account. Kill a service that's behaving strangely. Block a database connection that shouldn't exist. Create an incident with full context for a human to review. The response is proportional and immediate — not waiting for someone to check their phone at 3am.

The architecture is pretty straightforward:

graph TD
    A[Application Logs] --> D[Secure Isolated Log Store]
    B[Network Traffic] --> D
    C[Database Queries] --> D
    D --> F[Baseline Health Model]
    E[User Activity] --> D
    F -->|Anomaly Detected| G[LLM Analysis]
    G -->|Analyse & Plan| H{Threat Assessment}
    H -->|Low| I[Alert & Log]
    H -->|Medium| J[Restrict & Escalate]
    H -->|High| K[Disable & Isolate]
    I --> L[Human Review]
    J --> L
    K --> L

The key is that the logging and analysis layer has to be isolated and secured separately from the systems it's watching. If an attacker can compromise the thing that's watching them, the whole model falls apart.

In practice that means separate infrastructure with its own auth boundary. Ingestion is write-only — your application services push logs in but can never read or modify what's already there. Append-only, immutable. The analysis layer gets scoped service accounts that can read logs, fire alerts, and pull specific emergency levers through a narrow API. Nothing else. If a compromised service tries to reach the log store directly, it hits a wall.

None of this is exotic. Centralised logging, immutable storage, scoped IAM — the building blocks exist. The hard part is wiring an LLM into that loop with the right constraints. Enough access to act, not enough to make things worse.

Adaptive, not rule-based

Traditional security tooling runs on signatures and static rules. Known bad patterns, blocklists, threshold alerts. That worked when threats were mostly human-paced. It doesn't work when you're up against autonomous systems that adapt faster than you can write rules.

The alternative is a system that learns what normal looks like for your environment — not a generic baseline, but the actual shape of healthy behaviour in your specific infrastructure. Traffic patterns, query frequencies, access timing, user behaviour. Weeks of observation before it starts making decisions.

When something breaks the pattern, the response is proportional. A sudden spike in unusual API calls might trigger deeper correlation — the system widens its search, pulls in more signals, lowers its threshold for flagging related activity. Repeated failed auth attempts from new IPs tighten access controls automatically. A database connection that shouldn't exist gets killed.

This isn't a static ruleset you configure once and hope covers everything. It's a system that develops behavioural intuition from running in your environment, responding to your traffic. The difference matters — static rules are brittle against novel attacks, while adaptive systems can catch anomalies they've never seen before.

The baseline isn't magic. It's watching five things:

Rate — how many events per time window. A user who averages 50 API calls per hour suddenly making 500 is a signal.
Composition — what's in those events. The same user always hitting /api/users and /api/orders suddenly hammering /api/admin/export.
Cardinality — how many unique values. One IP hitting 3 endpoints is normal. One IP cycling through 200 endpoints in an hour isn't.
Latency — how fast things happen. Legitimate users pause, think, navigate. Bots don't.
Novelty — things the system has never seen. A new endpoint, a new parameter, a user agent string that doesn't match anything in the training window.

Three layers of detection stack on top of each other. Layer one is simple thresholds — hard caps that trigger immediately. Layer two is statistical deviation — standard deviations from the learned baseline. Layer three is correlation — looking across multiple signals simultaneously. A spike in rate alone might be fine. A spike in rate plus unusual composition plus new source IP? That's a pattern.

Learning to recognise yourself

A pure anomaly detector would go nuts during deploys. New code paths, changed response times, config reloads — all of it looks unusual. Same with cron jobs. Your 3am batch job that hits the database hard every night would trigger alerts every night.

Tolerance patterns solve this. The system learns to recognise you.

Mark a deploy event, and the system creates a tolerance window — elevated thresholds for the next 30 minutes. Register a recurring cron job, and the system expects that exact spike at that exact time. These aren't exceptions you configure manually. They're patterns the system learns from watching.

After a few weeks, it knows when your weekly cache warm-up runs, when your daily reports generate, when deploys happen. It stops bothering you about the things you do on purpose.

The system gets cheaper over time

Calling an LLM for every anomaly would be expensive. The trick is building immune memory.

When the LLM analyses an anomaly and decides it's benign — say, a deploy spike or a legitimate traffic surge — that verdict gets stored. Next time the same pattern appears, the system recognises it. No LLM call needed.

This is how your security bill drops over the first few weeks. Early on, everything is novel. The LLM gets called constantly. A month in, most anomalies match patterns it's already seen. The LLM only gets called for genuinely new situations.

The more your system runs, the smarter it gets and the less it costs.

Setup without a PhD

The hardest part of any security tool is configuration. Getting thresholds right. Understanding your traffic patterns before you can tell the tool what's normal.

darkforest init flips this. Point it at a log sample — a day's worth of traffic, a week if you've got it — and Claude reads it. Not just parsing, actually understanding the shape of your system. It figures out what your endpoints are, what normal request rates look like, what user agents show up, where your traffic comes from geographically.

Then it writes your config file for you.

You review it, tweak anything that looks wrong, and you're running. No spreadsheets. No guesswork about what "normal" means for your specific stack. The LLM that's going to watch your logs already understands them.

This has to be open

Glasswing is cool. Open-source frameworks like CAI are making progress — but mostly on the offensive side, using LLMs for penetration testing and vulnerability research. On the defensive side, the tooling barely exists. There's no open-source equivalent for the kind of adaptive monitoring and response I'm describing here.

The building blocks are around. Centralised logging is a solved problem. Open standards for security event formats are maturing. Smaller open models are more than capable of pattern analysis on local infrastructure. What's missing is the glue — a framework that takes logs in, builds a baseline, detects anomalies, and can actually respond. Something a small team can deploy without a six-figure security budget.

The threats don't discriminate by company size. The defences shouldn't either. This can't be proprietary or locked behind enterprise contracts.

The dark forest doesn't care how big your company is. The bots scanning your infrastructure don't check your headcount before they attack. If the threats are going to be this accessible, the defences need to be too.

I'm building this. An open-source security agent — adaptive, autonomous, acts when something breaks the pattern. Small enough for a startup to run on their own infrastructure. Centralised logging, open LLMs, scoped response actions. The pieces are all there. I'm wiring them together now.

For v0.1, one real action working end-to-end: detect anomalous authentication patterns, call the LLM for analysis, and disable the compromised account via your identity provider's API. Not just alerting — actually responding while you're asleep. That's the proof of concept that matches the headline.

I'm actively working on this and looking for early testers. If you want alpha access when it's ready, or just want to follow along, drop your email below. I'll reach out when there's something to try.

I Spent 29 Hours Debugging iptables to Boot VMs in 4 Seconds

Fri, 10 Apr 2026 00:00:00 GMT

The first time I got a Firecracker VM to boot and respond to a vsock ping from the host, I sat there grinning like an idiot. Typed a command on my machine, it reached through a kernel-level socket into a completely separate Linux system with its own kernel, and got a reply. Under a second.

That was about 30 hours into the project. The previous 29 were mostly fighting with rootfs images and iptables rules.

Part 1 covered why I built this — Firecracker MicroVMs for running Claude Code in full-permission isolation. This post is the actual build. Rootfs, networking, the guest agent, and the streaming pipeline.

Building the rootfs

A Firecracker VM needs two things: an uncompressed Linux kernel (vmlinux, not bzImage — there's no bootloader) and an ext4 filesystem image to use as the root disk.

The kernel is straightforward — grab a prebuilt 6.1 LTS vmlinux. The rootfs took more work.

It's a standard ext4 image with Debian Bookworm, and it needs everything Claude Code might want: Node.js 24, Python 3.11, Chromium for browser automation, git, curl, jq, and the full Claude Code CLI installed globally via npm. The image ends up at about 4GB.

The guest agent — the Go binary that listens for commands from the host — lives inside the rootfs as a systemd service:

sudo mount /opt/firecracker/rootfs/base-rootfs.ext4 /mnt
sudo cp bin/agent /mnt/usr/local/bin/agent
sudo chmod +x /mnt/usr/local/bin/agent

sudo tee /mnt/etc/systemd/system/agent.service <<'EOF'
[Unit]
Description=Orchestrator Guest Agent
After=network.target

[Service]
Type=simple
ExecStart=/usr/local/bin/agent
Restart=always
RestartSec=1

[Install]
WantedBy=multi-user.target
EOF

sudo chroot /mnt systemctl enable agent.service
sudo umount /mnt

That RestartSec=1 matters. If the agent crashes for any reason, systemd has it back up in a second. The orchestrator polls vsock every 500ms waiting for the agent, so even a crash during boot is barely noticeable.

You build this rootfs once, by hand. Every new VM gets a sparse copy of it.

VM lifecycle

internal/vm/manager.go handles the whole lifecycle. It's sequential with cleanup at each step — if anything fails, it tears down what it already set up and returns the error.

flowchart TD
    A["Copy rootfs (sparse)"] --> B["Mount & inject network config"]
    B --> C["Create TAP device"]
    C --> D["Add iptables rules"]
    D --> E["Setup jailer chroot"]
    E --> F["Write Firecracker config JSON"]
    F --> G["Launch via jailer --daemonize"]
    G --> H["Find PID, save metadata"]
    H --> I["VM ready — poll vsock"]

The sparse copy is the first thing that happens:

cmd := exec.Command("cp", "--sparse=always", BaseRootfs, vm.RootfsPath)

--sparse=always means zero blocks aren't allocated on disk. A 4GB image might only use 2GB of actual disk space. Takes under a second on NVMe.

After copying, the rootfs gets mounted and three files are injected: a systemd-networkd config with a static IP, /etc/resolv.conf for DNS, and /etc/hostname. Then it's unmounted and copied again into the jailer chroot.

Yeah, that's two copies of the rootfs per VM. The first for network injection, the second because the jailer expects everything inside its chroot. I could collapse this into one copy by injecting the network config directly into the chroot copy, but it's never been a bottleneck — sparse copy of 4GB takes less time than Firecracker takes to boot. So I left it.

The jailer

Firecracker's jailer is a separate binary that creates a chroot, sets up minimal /dev entries (kvm, net/tun, urandom), and runs the Firecracker process inside it. The VM config is a JSON file:

vmConfig := map[string]interface{}{
    "boot-source": map[string]interface{}{
        "kernel_image_path": "/vmlinux",
        "boot_args":         "console=ttyS0 reboot=k panic=1 pci=off init=/sbin/init",
    },
    "drives": []map[string]interface{}{{
        "drive_id":       "rootfs",
        "path_on_host":   "/rootfs.ext4",
        "is_root_device": true,
        "is_read_only":   false,
    }},
    "machine-config": map[string]interface{}{
        "vcpu_count":  vm.VCPUs,
        "mem_size_mib": vm.RamMB,
    },
    "network-interfaces": []map[string]interface{}{{
        "iface_id":      "eth0",
        "guest_mac":     "06:00:AC:10:00:02",
        "host_dev_name": netCfg.TapDev,
    }},
    "vsock": map[string]interface{}{
        "guest_cid": vm.VsockCID,
        "uds_path":  "/vsock.sock",
    },
}

pci=off because Firecracker doesn't emulate PCI. Paths are relative to the jailer chroot. The vsock entry creates a Unix domain socket at /vsock.sock inside the chroot — that's how the host talks to the guest.

Launch looks like this:

cmd := exec.Command(JailerBin,
    "--id", vm.JailID,
    "--exec-file", FCBin,
    "--uid", "0", "--gid", "0",
    "--cgroup-version", "2",
    "--daemonize",
    "--",
    "--config-file", "/vm-config.json",
)
cmd.Run()

After launch there's a 2-second sleep — Firecracker needs a moment to start — then the PID is found via pgrep and saved to a metadata file. If the orchestrator restarts, it reads these metadata files and picks up where it left off. VMs survive orchestrator crashes.

Networking

This is where I burned the most time. Not because the concepts are hard, but because of one specific bug that had me questioning reality.

Each VM needs internet access for Claude Code to fetch packages, clone repos, and hit the Anthropic API. The approach: each VM gets a Linux TAP device on the host, a dedicated /24 subnet, and iptables rules for NAT.

IP allocation

Subnets are deterministic, derived from the VM name using FNV-1a hashing:

func NetSlot(name string) int {
    h := fnv.New32a()
    h.Write([]byte(name))
    return int(h.Sum32()%253) + 1
}

VM named task-a3bfca80 might hash to slot 61, giving it subnet 172.16.61.0/24, guest IP 172.16.61.2, TAP IP 172.16.61.1. No coordination needed, no DHCP server, no IP pool to manage. The collision space is 253 slots — more than enough for 12-13 concurrent VMs.

TAP devices

A TAP device is a virtual ethernet interface. Firecracker attaches the guest's eth0 to it.

tap := &netlink.Tuntap{
    LinkAttrs: netlink.LinkAttrs{Name: cfg.TapDev},
    Mode:      netlink.TUNTAP_MODE_TAP,
}
netlink.LinkAdd(tap)
addr, _ := netlink.ParseAddr(cfg.TapIP + "/24")
link, _ := netlink.LinkByName(cfg.TapDev)
netlink.AddrAdd(link, addr)
netlink.LinkSetUp(link)

TAP names are fc-, truncated to 15 characters because Linux interface names can't be longer. A fun constraint to discover at runtime.

The iptables rules

Three rules per VM:

// NAT — rewrite source IP when traffic exits the host
ipt.AppendUnique("nat", "POSTROUTING",
    "-s", cfg.Subnet, "-o", cfg.HostIface, "-j", "MASQUERADE")

// FORWARD — allow outbound from TAP
ipt.Insert("filter", "FORWARD", 1,
    "-i", cfg.TapDev, "-o", cfg.HostIface, "-j", "ACCEPT")

// FORWARD — allow established/related inbound
ipt.Insert("filter", "FORWARD", 1,
    "-i", cfg.HostIface, "-o", cfg.TapDev,
    "-m", "state", "--state", "RELATED,ESTABLISHED", "-j", "ACCEPT")

See those Insert calls with position 1? That's the bug fix.

The UFW bug

I originally used Append for the FORWARD rules. Traffic from the VM would leave the host fine (NAT worked), but return traffic got dropped. The VM could resolve DNS but couldn't complete TCP handshakes. I spent an embarrassing amount of time staring at tcpdump output before I figured it out.

Ubuntu's UFW adds a blanket DROP rule to the FORWARD chain. If you append your ACCEPT rules, they land after UFW's DROP. They never match. The packets hit the DROP rule first and get silently killed.

Insert at position 1 puts the rules before UFW's. Return traffic flows, VMs get internet access, everything works.

The traffic path through a working VM:

Guest (172.16.61.2) → eth0 → TAP (fc-task-xxx) → FORWARD ACCEPT
→ NAT MASQUERADE (rewrite src to host IP) → host interface → internet
→ response → RELATED,ESTABLISHED → TAP → guest eth0

VMs can't reach each other. Each TAP device is point-to-point on its own /24. There's no route between subnets.

The guest agent

cmd/agent/main.go — 420 lines of Go. It's a static binary that starts on boot, listens on vsock port 9001, and handles five request types: ping, exec, write_files, read_file, and signal.

The interesting one is streaming exec.

When the orchestrator wants to run Claude Code, it sends an exec request with stream: true. The agent spawns the command, reads stdout and stderr line by line, and sends each line back as a framed event over the vsock connection. When the process exits, it sends an exit event with the exit code.

Sounds straightforward. The tricky part is background processes.

Claude Code can start things that outlive the main command — dev servers, file watchers, whatever it decides it needs. These child processes inherit the stdout/stderr pipes. If the agent waits for the pipes to close (the normal approach), it hangs forever because the children are still holding them open.

The fix has three parts:

// 1. Process group isolation
cmd.SysProcAttr = &syscall.SysProcAttr{Setpgid: true}

// 2. Wait for the main process, not the pipes
<-waitDone

// 3. Kill the entire process group
pgid, _ := syscall.Getpgid(cmd.Process.Pid)
syscall.Kill(-pgid, syscall.SIGTERM)
time.Sleep(500 * time.Millisecond)
syscall.Kill(-pgid, syscall.SIGKILL)

Setpgid: true puts the command in its own process group. When the main process exits, kill the group (-pgid means "everything in this group"). SIGTERM first, wait half a second, then SIGKILL for anything that didn't listen.

Even after killing the group, there's a 3-second timeout waiting for the pipe-reading goroutines to drain. If they're still stuck after that, move on and send the exit event anyway. Can't let a hung pipe block the entire task.

The line-by-line reader uses a 256KB buffer because Claude Code's --output-format stream-json can produce enormous single lines — tool results that include the full contents of files it read.

Credential injection

Before Claude Code runs, the orchestrator writes five things into the VM via vsock:

OAuth credentials from the host's ~/.claude/.credentials.json (mode 0600). A settings file that allows all tools. An environment script that sets CLAUDE_DANGEROUSLY_SKIP_PERMISSIONS=true. Task metadata. And a marker file to create the output directory.

The prompt itself gets written to a temp file inside the VM to avoid shell escaping nightmares, then referenced in the command:

claudeArgs := fmt.Sprintf(
    "claude -p \"$(cat %s)\" --output-format stream-json --verbose",
    promptFile,
)
cmd := []string{"bash", "-c",
    "source /etc/profile.d/claude.sh && " + claudeArgs}

When the VM is destroyed, the rootfs — containing the credentials — is deleted. Credentials only exist for the lifetime of the task.

Collecting results

After Claude Code finishes, the orchestrator searches for files it created:

// Anything in the output directory
vsock.Exec(jailID, []string{"find", outputDir, "-type", "f", "-not", "-name", ".keep"}, nil, "/root")

// Any new files under /root, created after the prompt was written
vsock.Exec(jailID, []string{"find", "/root", "-maxdepth", "2", "-type", "f",
    "-newer", "/tmp/claude-prompt.txt"}, nil, "/root")

Each file gets downloaded via vsock.ReadFile and saved to /opt/firecracker/results//. The runner also scans the accumulated output for Claude's total_cost_usd field to record what the task cost in API credits.

Then the VM is destroyed. Firecracker process killed, TAP device removed, iptables rules deleted, jailer chroot deleted, VM state directory deleted. Clean slate.

The whole cycle — boot, inject, run, collect, destroy — typically takes 30-120 seconds depending on how complex the prompt is. The 4-second boot and ~1-second teardown are rounding errors compared to the time Claude actually spends thinking.

Part 3 gets into the fun stuff — the MCP server that lets Claude delegate tasks to itself, the streaming architecture, the web dashboard, and what productionising this would actually look like.

Can You Beat Last Month?

Thu, 09 Apr 2026 00:00:00 GMT

Every machine learning project needs a reality check.

It's tempting to jump straight to the neural network. That's the exciting bit, right? But if you don't establish what a dead-simple model can do first, you've got no idea whether your fancy architecture is actually learning anything useful or just being expensive.

So before ConvLSTM gets anywhere near this data, we're going to throw three gloriously simple baselines at it and see how they do.

Persistence: next month equals this month

The dumbest possible model. To predict April, just use March's values. Every cell, every crime type. Carbon copy.

It sounds ridiculous, but it works surprisingly well when patterns are stable. And as we saw in the EDA, Auckland's crime hotspots are remarkably persistent. The CBD doesn't suddenly go quiet. South Auckland doesn't randomly calm down.

On the six-month test set (August 2025 – January 2026):

Crime Type	MAE	RMSE
Theft	1.42	3.18
Burglary	0.38	0.91
Assault	0.22	0.64
Robbery	0.04	0.15
Sexual	0.03	0.12
Harm	0.01	0.04

Those MAE numbers for theft and burglary look small until you remember that most cells are zero. For the active cells (the ones we actually care about) the error is larger. A busy CBD cell might have 35 thefts in one month and 28 the next. Persistence would be off by 7 there, which is a 20% miss on an important prediction.

Seasonal naive: same month last year

Instead of copying last month, copy the same month from the previous year. January 2026 gets predicted from January 2025. This should capture seasonal patterns: the summer spike, the February dip.

The catch? We only have four years of data. The test set months (August–January) each have at most three prior examples of the same month. That's not a lot of seasonal training data.

Crime Type	MAE	RMSE
Theft	1.51	3.42
Burglary	0.41	0.97
Assault	0.24	0.68
Robbery	0.05	0.17
Sexual	0.04	0.13
Harm	0.01	0.04

Slightly worse than persistence across the board. That surprised me initially. Shouldn't capturing seasonality help?

The issue is that the 2023-to-2025 decline we spotted in the EDA bites hard here. If you predict January 2026 from January 2025, you're using data from a period when crime was higher. The seasonal pattern is real, but the year-over-year trend works against it. With more years of data, seasonal naive would likely pull ahead.

Historical average: the mean of all training months

For each cell and crime type, take the average across all 36 training months. This smooths out month-to-month noise and gives you a "typical" value for each location.

Crime Type	MAE	RMSE
Theft	1.28	2.95
Burglary	0.35	0.84
Assault	0.20	0.58
Robbery	0.04	0.14
Sexual	0.03	0.11
Harm	0.01	0.04

The best baseline. By averaging over three years, it smooths out the month-to-month noise and the year-over-year trend simultaneously. It won't capture seasonal peaks or sudden changes, but for the "typical month" prediction it's solid.

Why MAPE breaks down

You might wonder why I'm not reporting MAPE (Mean Absolute Percentage Error). It's the standard metric in a lot of forecasting work. The reason: sparse data.

MAPE divides the error by the actual value. When the actual value is zero (which it is for 91.7% of our tensor) you get division by zero. Even for cells with small counts (1 or 2 crimes), a prediction of 0 gives you 100% MAPE while a prediction of 2 gives you 0–100%. The metric becomes wildly unstable.

MAE and RMSE are more honest here. They tell you the absolute magnitude of your errors in actual crime counts, which is what we care about. A miss of 3 victimisations means the same thing whether the cell usually has 5 or 50.

The bar to clear

Here's the scoreboard going forward. Any deep learning model needs to beat the historical average baseline to justify its existence:

Crime Type	Historical Avg MAE	Historical Avg RMSE
Theft	1.28	2.95
Burglary	0.35	0.84
Assault	0.20	0.58
All types	0.39	0.95

Theft is the easiest to beat because there's the most signal: high counts, clear spatial patterns, strong seasonality. Robbery, sexual offences, and harm are essentially noise at this resolution. The models will probably predict near-zero for those types and be mostly correct.

The real test will be the middle ground. Can ConvLSTM or ST-ResNet predict the changes in theft and burglary better than a static average? Can they catch the months where a cell spikes or dips? That's where simple baselines fall flat, because they don't model dynamics at all.

If the deep learning can't meaningfully beat "just use the average," then it's not worth the CPU cycles. Or in my case, the many hours of a Ryzen 5 grinding away.

Claude Code Running Claude Code in 4-Second Disposable VMs

Wed, 08 Apr 2026 00:00:00 GMT

Running Claude Code with full permissions inside a Docker container is a terrible idea. I did it anyway for about a week, then built something better.

Anthropic has an internal platform — people have been calling it Antspace since it got reverse-engineered from the Claude Code source — that runs AI coding tasks in isolated environments. It's part of a vertical stack they're building internally: intent goes in, code comes out, and the agent never touches the host machine.

I wanted that. Not the whole platform-as-a-service thing, just the core idea: give Claude Code a prompt, let it run with zero permission restrictions, stream the output back, grab any files it created, and destroy everything when it's done. On a single Linux box sitting in my office.

The result is about 3,200 lines of Go and 860 lines of TypeScript. It boots a fresh Linux VM in ~4 seconds, runs Claude Code inside it, and tears it down when the task finishes. Three ways to use it: a CLI, a REST API with a web dashboard, and an MCP server so Claude Code on other machines can delegate tasks to it.

This first post is about why I built it this way. Part 2 and Part 3 get into the actual implementation.

The container problem

CLAUDE_DANGEROUSLY_SKIP_PERMISSIONS=true — that's the environment variable that tells Claude Code to stop asking before it runs shell commands or writes files. It just does whatever it thinks it needs to. For autonomous tasks, you need this. Claude can't ask for confirmation when there's nobody watching.

The question is where you let it run.

Docker is the obvious first thought. Fast startup, everyone knows it, easy to orchestrate. But containers share the host kernel. Every container on the machine issues syscalls to the same Linux kernel, and a kernel vulnerability is a vulnerability in every container on the host. The isolation boundary is the container runtime, not hardware — and that surface area is big.

For most workloads this is fine. Running a web server in Docker? No worries. But running an AI agent that can execute arbitrary shell commands with root-level permissions? That's a different threat model. A container escape gives you the host. And you've just given the thing inside the container permission to try anything.

Anthropic's own approach to sandboxing Claude Code uses OS-level primitives — bubblewrap on Linux, Seatbelt on macOS — for filesystem and network isolation. They report an 84% reduction in permission prompts internally. That's smart for the normal use case where Claude is helping you write code in your own project. But I wanted something more aggressive: full isolation where even a kernel exploit can't reach the host.

Why Firecracker

Firecracker is what AWS built for Lambda and Fargate. Each MicroVM is a real KVM-backed virtual machine with its own guest kernel, its own memory space, and hardware-enforced isolation via Intel VT-x or AMD-V. The attack surface is the KVM hypervisor — which the kernel team at AWS has spent years minimising.

The trade-off is boot time. Containers start in under a second. Firecracker VMs take about 4 seconds on my hardware once you account for the guest kernel boot, systemd init, and the agent process starting up. For tasks that typically run 20-120 seconds, 4 seconds of overhead is nothing.

Each VM also copies a 4GB rootfs image. Sparse copies make this fast (<1 second), but it does use disk. On a machine with a 1TB NVMe, I'm not losing sleep over it.

The hardware is an AMD Ryzen 5 5600GT with 30GB of RAM. Nothing exotic. About $400 worth of parts sitting under my desk. Each VM gets 2GB of RAM by default, so I can run roughly 12-13 VMs concurrently before the host runs out of memory.

Talking to a VM without a network

This was my favourite bit to figure out.

The obvious way to communicate with a process inside a VM is SSH. Set up keys, open a port, connect over the network. But SSH means key management, an open network port inside the VM, and another service to configure. If the guest's network breaks during a task, you've lost your control channel.

vsock (AF_VSOCK, address family 40) is a kernel-level host-guest communication channel. It doesn't touch the network stack. No IP addresses, no ports, no keys. Firecracker exposes the guest's vsock as a Unix domain socket on the host side — you connect to the socket, send CONNECT \n, and you're talking directly to a process inside the VM.

func Connect(jailID string, port int) (net.Conn, error) {
    socketPath := fmt.Sprintf("/srv/jailer/firecracker/%s/root/vsock.sock", jailID)
    conn, _ := net.Dial("unix", socketPath)
    conn.Write([]byte(fmt.Sprintf("CONNECT %d\n", port)))
    // Read "OK " response
    return conn, nil
}

On the guest side, Go's standard library doesn't support AF_VSOCK — address family 40 doesn't exist in the net package. So the guest agent uses raw syscalls:

fd, _ := syscall.Socket(40, syscall.SOCK_STREAM, 0)  // AF_VSOCK = 40
// Manually construct struct sockaddr_vm (16 bytes)
sa := [16]byte{}
*(*uint16)(unsafe.Pointer(&sa[0])) = 40          // family
*(*uint32)(unsafe.Pointer(&sa[4])) = uint32(port) // port (9001)
*(*uint32)(unsafe.Pointer(&sa[8])) = 0xFFFFFFFF   // VMADDR_CID_ANY
syscall.RawSyscall(syscall.SYS_BIND, uintptr(fd), uintptr(unsafe.Pointer(&sa[0])), 16)
syscall.RawSyscall(syscall.SYS_LISTEN, uintptr(fd), 5, 0)

Yeah, that's unsafe.Pointer and manual struct layout. Not the prettiest Go you'll ever write. But it works, it's fast, and the whole vsock layer is about 160 lines shared between both binaries.

The wire protocol is dead simple — length-prefixed JSON frames:

func WriteFrame(w io.Writer, v interface{}) error {
    data, _ := json.Marshal(v)
    binary.Write(w, binary.BigEndian, uint32(len(data)))
    w.Write(data)
    return nil
}

Each operation (ping, exec, write files, read file) opens a new connection, sends one request, reads the response, and closes. Connection-per-request. Not fancy, but vsock connections are local and effectively instant, so there's no reason to complicate things with multiplexing.

The shape of the thing

The whole system is two Go binaries — the orchestrator (runs on the host) and the agent (runs inside each VM).

graph TD
    subgraph "Host — orchestrator binary"
        API["REST API + WebSocket :8080"]
        MCP["MCP Server :8081"]
        VM["VM Manager"]
        NET["TAP + iptables"]
        TASK["Task Runner"]
        STREAM["Pub/Sub Hub"]
        VSOCK["vsock Client"]
    end

    subgraph "Guest — agent binary"
        AGENT["Guest Agent vsock:9001"]
        CLAUDE["Claude Code"]
    end

    API --> TASK
    MCP --> TASK
    TASK --> VM
    VM --> NET
    TASK --> VSOCK
    VSOCK --> AGENT
    AGENT --> CLAUDE
    TASK --> STREAM
    STREAM --> API

The orchestrator is a single 14MB binary with the React dashboard embedded via //go:embed. Copy it to a server, run it with sudo, done. Seven Go dependencies total — chi for routing, netlink for TAP devices, go-iptables for firewall rules, mcp-go for the MCP protocol, and a few others.

The agent is a 2.5MB static binary compiled with CGO_ENABLED=0. It ships inside the VM's rootfs and starts via systemd on boot. Within about a second of the VM coming up, the agent is listening on vsock port 9001 and ready to accept commands.

They share exactly one file — internal/agent/protocol.go — which defines the wire protocol types and framing functions. Everything else is independent.

What a task looks like

You give it a prompt. It does the rest.

Generate a task ID and VM name
Copy the base rootfs image (sparse, <1 second)
Inject network config into the rootfs
Create a TAP device and iptables rules for internet access
Launch Firecracker via the jailer
Poll vsock until the agent responds (~1 second)
Inject credentials and files via vsock
Run Claude Code with streaming output
Collect any files Claude created
Destroy the VM

From the CLI it looks like this:

sudo ./bin/orchestrator task run \
    --prompt "Write a Python script that generates Fibonacci numbers" \
    --ram 2048 \
    --vcpus 2 \
    --timeout 120

Output streams to your terminal in real time. When it's done:

=== Task Complete ===
ID:     a3bfca80
Status: completed
Exit:   0
Cost:   $0.0582
Files:  [fibonacci.py]

The VM is gone. The rootfs is deleted. The TAP device and iptables rules are cleaned up. All that's left is the result files in /opt/firecracker/results/a3bfca80/.

Or you use the MCP server, and Claude Code on your laptop delegates the task to a VM on the box under your desk. Claude spawning Claude. That bit is properly cool, and I'll get into it in Part 3.

Why Go

Quick aside on this because people always ask.

Go produces static binaries. The agent needs to be a single file with zero dependencies that runs inside a minimal Debian guest — CGO_ENABLED=0 makes this trivial. The orchestrator needs to manage concurrent VMs, and goroutines are a natural fit for that. Syscall support is first-class, which matters when you're doing raw vsock operations. And it compiles in about 2 seconds, which is nice when you're iterating.

build-agent:
	CGO_ENABLED=0 GOOS=linux GOARCH=amd64 go build -o bin/agent -ldflags="-s -w" ./cmd/agent

That -ldflags="-s -w" strips debug info and DWARF tables, dropping the agent binary from ~3.5MB to ~2.5MB. Every byte counts when you're baking it into a rootfs that gets copied for every VM.

Part 2 gets into the actual build — the rootfs, the networking (including a fun bug with Ubuntu's UFW that had me staring at iptables rules for an embarrassing amount of time), the guest agent, and the streaming pipeline that gets Claude's output from inside a VM to your browser.

What if your browser built the UI for you?

Sun, 05 Apr 2026 00:00:00 GMT

We're at a genuinely weird inflection point in frontend development. AI can generate entire interfaces now. LLMs can reason about data and layout. And yet — most SaaS products still ship hand-crafted React apps, each building its own UI, its own accessibility layer, its own theme system, its own responsive breakpoints. Not every service, but the vast majority.

That's a lot of duplicated effort for what's essentially the same job — showing a human some data and letting them do stuff with it.

I've been thinking about this a lot lately, and I built a proof of concept to test an idea: what if the browser itself generated the UI?

Where we are right now

The industry is circling this idea from multiple angles, but nobody's quite landed on it yet.

Server-driven UI has been around for a while — Airbnb and others pioneered it for mobile, where app store review cycles make shipping UI changes painful. The server sends down a JSON tree describing what to render, and the client just follows instructions. It's clever, but the server is still calling the shots. x.

Google recently shipped Natively Adaptive Interfaces — a framework that uses AI agents to make accessibility a default rather than an afterthought. Really cool idea, and the right instinct. But it's still operating within a single app's boundaries. Your accessibility preferences don't carry between Google's products and, say, your project management tool.

Then there's the generative UI wave — CopilotKit, Vercel's AI SDK, and others building frameworks where LLMs generate components on the fly. These are powerful developer tools, but they're still developer tools. The generation happens at build time or on the server. The service is still in control.

See the pattern? Every approach keeps the power on the service side.

Flip it

Here's the idea behind the adaptive browser: what if the generation happened on your side?

Instead of a service shipping you a finished frontend, it publishes a manifest — a structured description of what it can do. Its capabilities, endpoints, data shapes, what actions are available. Think of it like an API spec, but semantic. Not just "here's a GET endpoint" but "here's a list of repositories, they're sortable by stars and language, you can create, delete, star, or fork them."

Your browser takes that manifest, calls the actual APIs, gets real data back, and then generates the UI based on your preferences. Your font size. Your colour scheme. Your preferred layout (tables vs cards vs kanban). Your accessibility needs. All applied universally, across every service.

The manifest for something like GitHub looks roughly like this — a service describes its capabilities and the browser figures out the rest:

service:
  name: "GitHub"
  domain: "api.github.com"

capabilities:
  - id: "repositories"
    endpoints:
      - path: "/user/repos"
        semantic: "list"
        entity: "repository"
        sortable_fields: [name, updated_at, stargazers_count]
        actions: [create, delete, star, fork]

The browser takes that, fetches the data, and generates a bespoke interface — using an LLM to reason about the best way to present it given who you are and what you're trying to do.

Why this matters more than it sounds

When I was building the app store and integrations platforms at Xero, one of the constant headaches was that every third-party integration had its own UI patterns. Users had to learn a new interface for every app they connected. If the browser was generating the UI from a shared set of preferences, that problem just… goes away.

Accessibility is the big one though. Right now, accessibility is a feature that gets bolted on — and often badly. When the browser generates the UI, accessibility isn't a feature. It's the default. Your preferences — high contrast, keyboard-first navigation, screen reader optimisation, larger text — apply everywhere. Not because every developer remembered to implement them, but because they're baked into how the UI gets generated in the first place.

Customisation becomes genuinely personal too. Not "pick from three themes the developer made" but "this is how I interact with software, full stop."

The trade-off is real though

Frontend complexity drops dramatically, but the complexity doesn't disappear — it moves behind the API. And honestly, it probably increases.

API design becomes way more important. You can't just throw together some REST endpoints and call it a day. Your manifest needs to be semantic — describing what the data means, not just what shape it is. Data contracts between services matter more. Versioning matters more.

graph LR
    A[Service] -->|Publishes manifest + APIs| B[Browser Agent]
    C[User Preferences] --> B
    D[Org Guardrails] --> B
    B -->|Generates| E[Bespoke UI]

But here's the thing — this trade-off pushes us somewhere genuinely interesting. If every service needs to describe itself semantically through APIs and manifests, those APIs become the actual product surface. Not the frontend. The APIs.

And once APIs are the product surface, sharing context between platforms becomes the interesting problem. Your project management tool knows what you're working on. Your email client knows who you're talking to. Your code editor knows what you're building. Right now, none of these talk to each other in any meaningful way because they're all locked behind their own UIs. In a manifest-driven world, that context flows through the APIs — and your browser can stitch it all together into something coherent.

Where this is headed (IMHO)

I reckon we're about 3-5 years from this being mainstream. The pieces are all there — LLMs that can reason about UI, standardisation efforts around sending UI intent over APIs, and a growing expectation from users that software should adapt to them, not the other way around.

The services that win in this world won't be the ones with the prettiest hand-crafted UI. They'll be the ones with the best APIs, the richest manifests, and the most useful data. The frontend becomes a generated output, not a hand-crafted input.

Organisations will set preference guardrails — "our people can use dark or light mode, must have destructive action confirmations, these fields are always visible" — while individuals customise within those bounds. Your browser becomes your agent, not just a renderer.

I built the adaptive browser as a proof of concept to test this thinking — it uses Claude to generate UIs from a GitHub manifest and user preferences defined in YAML. It's rough, but the direction feels right.

The frontend isn't dying. But what we think of as "frontend development" is about to change. The interesting work moves to API design, semantic data contracts, and building browsers smart enough to be genuine user agents.

Stealing NanoClaw Patterns for Web Apps and SaaS

Sun, 05 Apr 2026 00:00:00 GMT

In Part 1 I pulled apart NanoClaw's codebase and found six patterns that make an 8,000-line AI assistant surprisingly robust. But NanoClaw is a single-user tool running on your laptop. Surely these patterns fall apart once you've got real tenants, real money, and real scale?

Nah. Four of them translate almost directly — and the ones that don't still teach you something useful.

The credential sidecar

NanoClaw's credential proxy — where containers get a placeholder API key and a localhost proxy injects the real one — sounds like a neat trick for a personal tool. But this exact pattern is showing up in production Kubernetes deployments right now.

The broader version is a sidecar proxy that handles credential injection for any service that needs API keys or tokens. Your application code never touches the real secret. A sidecar container intercepts outbound requests, swaps in credentials, and forwards them upstream.

At Vend we managed a bunch of third-party integrations — payment gateways, shipping providers, accounting platforms. Each one had API keys that needed to live somewhere. We went through the typical evolution: environment variables, then a secrets manager, then a service that distributed keys at startup. Every step was an improvement, but the keys still ended up in the application's memory.

The sidecar approach skips that entirely. Your app sends requests with a placeholder. The proxy — which is a separate process with its own security boundary — does the credential swap. Even if your application gets compromised, the real keys aren't there to steal.

If you're running any kind of multi-service architecture where services call external APIs, this pattern is worth adopting. Your API gateway might already be doing a version of it — the insight is making it explicit and consistent across all outbound credential flows.

Isolation as the security model

This is the one I keep thinking about.

NanoClaw uses filesystem mounts to control what each container can see. No application-level permission checks — the security model is the infrastructure topology. If a container can't see a file, it can't access it. No bugs, no missed checks, no escalation vulnerabilities.

In SaaS, we spend enormous amounts of time writing authorisation logic. Role checks, permission middleware, tenant-scoping queries. And it works — until someone forgets a WHERE clause.

AWS's own SaaS tenant isolation guidance makes this point explicitly: authentication and authorisation are not the same as isolation. The fact that a user logged in doesn't mean your system has achieved tenant isolation. A single missed tenant filter on a database query and you've got a cross-tenant data leak.

The NanoClaw-inspired approach is to push isolation down the stack. Separate database schemas per tenant. Separate containers. Separate cloud accounts for your highest-value customers. Not instead of application-level checks — but as a backstop that catches the bugs your application-level checks inevitably have.

At Xero, working across the integrations and app store teams, I saw first-hand how multi-tenant data isolation gets complicated fast. The teams that had the fewest incidents were the ones where the infrastructure itself enforced boundaries, not just the application code.

You don't need to go full NanoClaw and give every tenant their own container. But you should be asking: if my application-level authorisation has a bug, what's my second line of defence? If the answer is "nothing" — that's the pattern to steal.

Polling when it's the right call

NanoClaw polls SQLite every 2 seconds. No WebSockets, no event bus, no pub/sub. Just a loop that checks for new stuff.

The instinct for most teams is to treat polling as a temporary hack you'll replace with "proper" event-driven architecture later. Yan Cui wrote a solid breakdown of push vs poll in event-driven systems and the takeaway isn't that one is always better — it's that the right choice depends on your throughput, ordering, and failure-handling requirements.

For a lot of internal systems, polling is the correct permanent answer.

Admin dashboards. Background job status. Internal reporting. Webhook retry queues. Deployment pipelines. These systems don't need sub-second latency. They need reliability and simplicity. A polling loop against your database gives you both, with zero infrastructure overhead.

At Xero we shipped multiple times per day, and some of the internal tooling that supported continuous deployment was surprisingly simple under the hood. Cron jobs. Polling loops. SQL queries on a timer. Not because anyone was cutting corners — because the requirements genuinely didn't need anything more sophisticated.

The trap is reaching for Kafka or RabbitMQ because you think you'll need it eventually. 70% of startups fail due to premature scaling. The infrastructure you don't deploy is the infrastructure that never breaks.

Your database is your message queue

NanoClaw uses JSON files on the filesystem for inter-process communication. Atomic rename, directory-based identity, simple polling to pick up new messages. No Redis. No message broker.

That specific approach won't scale to a multi-tenant SaaS — but the instinct behind it absolutely does. The instinct is: use the infrastructure you already have.

For most web apps, that means Postgres. The Postgres-as-queue movement has been gaining serious traction, and tools like PGMQ make it practical. You get ACID guarantees, you don't need to manage another service, and your queue is backed by the same database you're already monitoring and backing up.

NanoClaw's atomic write pattern — write to a temp file, rename into place — maps to INSERT INTO queue_table followed by a SELECT ... FOR UPDATE SKIP LOCKED consumer. Same principle: the message either exists completely or doesn't exist at all. No partial state.

The "just add Redis" reflex is strong in our industry. Sometimes it's the right call. But I've seen plenty of teams introduce a message broker for a workload that Postgres could've handled without breaking a sweat — and then spend the next six months debugging consumer lag and dead letter queues.

The real pattern

The specific techniques matter less than the discipline behind them.

NanoClaw's developer looked at a 500,000-line framework and asked: what are my actual constraints? Single user. Local machine. One AI provider. And then built exactly the architecture those constraints required — nothing more.

Most teams don't do this. They build for imaginary scale, imaginary multi-tenancy requirements, imaginary traffic spikes. They reach for Kubernetes before they've outgrown a single server. They deploy event buses before they've outgrown a polling loop. They write complex authorisation middleware before they've considered whether infrastructure isolation would eliminate the problem entirely.

The pattern worth stealing isn't the credential proxy or the polling loop or Postgres-as-queue. It's the habit of understanding your constraints first and letting them delete complexity from your architecture.

Hardest pattern to adopt, though. Because it means admitting you're smaller than you think.

What the Data Actually Shows

Thu, 02 Apr 2026 00:00:00 GMT

You can't just shove a tensor into a neural network and hope for the best.

I mean, you can. People do it all the time. But you'll have no idea whether your model is learning something real or just memorising noise. Before we get anywhere near ConvLSTM or ST-ResNet, we need to properly understand what patterns actually exist in this data, and whether they're strong enough for a model to learn.

This is the part that most ML blog posts skip. It's also the part that saves you weeks of debugging later.

When does crime happen?

The monthly pattern across Auckland is surprisingly consistent year to year. Crime peaks in late spring and early summer (October through January) and dips in late summer through winter. February is reliably the quietest month at around 7,000–8,000 victimisations, while November and December regularly push past 9,000.

This tracks with what criminology research has found globally: warmer months mean more people out and about, more opportunities for property crime, and more interpersonal conflict. It's a well-documented pattern called seasonal variation in crime, and it shows up clearly in the NZ data.

The seasonal signal isn't uniform across crime types though. Theft drives most of the swing. It surges in summer and drops in winter, accounting for nearly all the monthly variance. Assault has its own rhythm. It peaks around the holiday period (December–January) and shows a secondary bump in winter weekends, probably pub-related. Burglary is flatter, with a slight winter uptick when houses are dark earlier.

2023 was the peak year across the board, with a noticeable decline through 2024 and into early 2025. Whether that's a real trend or a reporting artefact, I genuinely don't know. But it means the model's training data includes both an upswing and a downswing, which is useful. It can't just learn "crime always goes up."

Where does crime cluster?

Crime in Auckland is not randomly distributed. That's obvious to anyone who lives here, but it's worth quantifying.

Running a Moran's I test on our 500m grid confirms strong positive spatial autocorrelation. Cells with high crime counts are surrounded by other high-crime cells. The Moran's I statistic comes out at 0.43 (p < 0.001), which means the clustering is highly significant. Crime begets more crime in adjacent cells.

The hotspots are exactly where you'd expect. The CBD dominates: Queen Street, Karangahape Road, and the surrounding blocks consistently light up across all crime types. South Auckland corridors (Manukau, Ōtāhuhu, Papatoetoe) form a second cluster, particularly for assault and robbery. Henderson in the west shows up for burglary.

What's less obvious is how stable these hotspots are over time. The top 5% of cells (about 227 cells) account for over 60% of all recorded crime across the entire four-year period. These aren't random spikes. They're persistent. A cell that's hot in 2022 is almost certainly still hot in 2025. That temporal persistence is exactly what makes this data amenable to prediction. If hotspots moved randomly month to month, no model could learn them.

Crime type correlations

The six channels in our tensor don't behave independently. Theft and burglary show moderate positive correlation (r ≈ 0.52). Cells with lots of theft tend to have more burglary too, which makes sense given similar opportunity structures (commercial areas, transport hubs).

Assault correlates weakly with everything else (r ≈ 0.15–0.25). It has its own spatial logic (nightlife areas, specific residential pockets) that doesn't align neatly with property crime.

Robbery, sexual offences, and harm are so sparse at the 500m monthly resolution that correlation analysis is basically meaningless. Most cells have zero counts for these types in any given month. That sparsity is going to be a real headache for the models.

The sparsity problem, again

We flagged this in Part 3: 91.7% of the tensor is zeros. But the EDA makes the problem even clearer.

The distribution of non-zero cell values is heavily right-skewed. The median non-zero value is 1. One crime, in one cell, in one month. The mean is about 2.3. A handful of cells (the CBD, Manukau) hit 30–50+ in peak months for theft. The model needs to learn the difference between "always zero" cells, "occasionally one" cells, and "consistently busy" cells.

If you plot the crime count distribution across non-zero cells, it follows something close to a power law. A tiny number of cells carry an outsized share of the signal. This is textbook spatial concentration of crime, documented in basically every city ever studied.

For modelling, this means two things. First, aggregate metrics like RMSE will be dominated by how well the model predicts the high-count cells. Second, predicting "zero" for a sparse cell is almost always correct but completely uninformative. We'll need to think carefully about what "accuracy" actually means when we get to evaluation.

What this means for the models

The EDA tells us a few things that should directly shape how we build and evaluate the models:

The seasonal signal is strong and consistent. A model that can't capture monthly seasonality is worse than useless. It's worse than a calendar.

Spatial structure is real and persistent. Hotspots don't move much. A model that learns static spatial patterns will get a lot of the way there, even without understanding temporal dynamics.

We already know the CBD will have lots of theft next month. That's not what we're trying to predict. The real value is in the margins: the cells that go from quiet to active, or the months where a normally stable area spikes. That's where deep learning might actually add something over simple baselines.

Speaking of which, we need baselines. Otherwise we won't know if ConvLSTM is actually clever or just expensive. That's next.

How I Built an SMS Gateway with a $20 Android Phone

Thu, 02 Apr 2026 00:00:00 GMT

Twilio charges around $0.05–0.06 per SMS round-trip. Doesn't sound like much until you're building an MVP that sends reminders, confirmations, and notifications — suddenly you're looking at $50/month for a thousand messages. For an app that's not making money yet, that's a dumb tax.

Here's what I did instead: grabbed a cheap Android phone, installed an open-source app called SMS Gateway for Android, and turned it into a full SMS gateway with a REST API. My SMS costs dropped to whatever my mobile plan charges — which on plenty of prepaid plans is zero. Unlimited texts.

This post walks through exactly how to wire it into a Next.js app, from first install to receiving webhooks. The whole thing took an afternoon.

What You're Building

By the end of this you'll have:

An Android phone acting as your SMS gateway
A webhook endpoint receiving inbound SMS in real-time
Outbound SMS sent via a simple REST API call
A provider abstraction so you can swap between SMS Gateway, Twilio, or console logging

Prerequisites

An Android phone (5.0+) with a SIM card
A Next.js app (I'm using 15 with App Router, but any backend works)
Node.js 18+
ngrok for testing with cloud mode

Install SMS Gateway on Android

Install SMS Gateway for Android from the Google Play Store or grab the APK from GitHub Releases
Open the app and grant SMS permissions when prompted
You'll see the main screen with toggles for Local Server and Cloud Server:

The app supports two modes — local and cloud. Both work well, and I'll cover each.

Local Server Mode

Local mode runs an HTTP server directly on the phone. Your backend talks to it over your local network. No cloud dependency, no third-party servers — the simplest setup.

Configure It

Local server configuration

Toggle "Local Server" on
Go to Settings > Local Server to configure:
- Port: 1024–65535 (default 8080)
- Username: minimum 3 characters
- Password: minimum 8 characters
Tap "Offline" — it changes to "Online"
Note the local IP address displayed (e.g. 192.168.1.50)

Your phone is now running an HTTP server. Verify it:

# Health check
curl http://192.168.1.50:8080/health

# Swagger docs
open http://192.168.1.50:8080/docs

Send Your First SMS

curl -X POST http://192.168.1.50:8080/message \
  -u "admin:yourpassword" \
  -H "Content-Type: application/json" \
  -d '{
    "textMessage": { "text": "Hello from my SMS gateway!" },
    "phoneNumbers": ["+15551234567"]
  }'

That's it. The phone sends the SMS from its own number, using your mobile plan's rates.

Register a Webhook for Inbound SMS

To receive SMS messages as webhooks:

curl -X POST http://192.168.1.50:8080/webhooks \
  -u "admin:yourpassword" \
  -H "Content-Type: application/json" \
  -d '{
    "id": "my-webhook",
    "url": "http://192.168.1.100:4000/api/sms/webhook",
    "event": "sms:received"
  }'

Replace 192.168.1.100 with your dev machine's local IP. Both devices need to be on the same WiFi network.

Local Mode Gotchas

AP isolation: Many routers — especially mesh networks and office WiFi — block device-to-device traffic. If you can't reach the phone, check your router settings for "AP isolation" or "client isolation" and disable it. This one caught me out for a good 20 minutes.
Battery optimisation: Android will kill the background server to save battery. Disable battery optimisation for SMS Gateway in your phone settings. dontkillmyapp.com has device-specific instructions — genuinely useful site.
Keep it plugged in: During development and in production, the phone lives on a charger. It's not going anywhere.

Cloud Server Mode

Cloud mode is easier to set up and works from anywhere — no local network required. The phone connects to SMS Gateway's cloud relay (api.sms-gate.app), and your backend talks to the same cloud API.

Cloud server configuration

Enable It

Toggle "Cloud Server" on in the app
Tap "Offline" — it connects and registers automatically
A username and password are auto-generated (visible in the Cloud Server section)
Note these credentials — you'll need them for API calls

The cloud uses a hybrid push architecture: Firebase Cloud Messaging as the primary channel, Server-Sent Events as fallback, and 15-minute polling as a last resort. It's well thought through.

Send an SMS via Cloud API

curl -X POST https://api.sms-gate.app/3rdparty/v1/messages \
  -u "YOUR_USERNAME:YOUR_PASSWORD" \
  -H "Content-Type: application/json" \
  -d '{
    "textMessage": { "text": "Hello from the cloud!" },
    "phoneNumbers": ["+15551234567"]
  }'

Register a Webhook (Cloud Mode)

Your webhook URL must be HTTPS in cloud mode. For local development, use ngrok:

# Start ngrok tunnel to your dev server
ngrok http 4000
# Output: https://abc123.ngrok.app

# Register the webhook
curl -X POST https://api.sms-gate.app/3rdparty/v1/webhooks \
  -u "YOUR_USERNAME:YOUR_PASSWORD" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://abc123.ngrok.app/api/sms/webhook",
    "event": "sms:received"
  }'

Manage Webhooks

# List webhooks
curl -u "YOUR_USERNAME:YOUR_PASSWORD" \
  https://api.sms-gate.app/3rdparty/v1/webhooks

# Delete a webhook
curl -X DELETE -u "YOUR_USERNAME:YOUR_PASSWORD" \
  https://api.sms-gate.app/3rdparty/v1/webhooks/WEBHOOK_ID

The Code — Next.js Integration

Here's how I integrated SMS Gateway into a Next.js app with a clean provider abstraction. The idea is simple — swap providers without touching business logic.

Provider Interface

// src/lib/sms/provider.ts

export interface InboundSms {
  from: string;
  body: string;
  receivedAt?: Date;
}

export interface SmsProvider {
  send(to: string, body: string): Promise;
  parseWebhook(req: Request): Promise;
  webhookResponse(replyText?: string): Response;
}

export async function getSmsProvider(): Promise {
  const provider = process.env.SMS_PROVIDER || "sms-gate";

  switch (provider) {
    case "sms-gate": {
      const { SmsGateProvider } = await import("./sms-gate");
      return new SmsGateProvider();
    }
    case "console": {
      const { ConsoleProvider } = await import("./console");
      return new ConsoleProvider();
    }
    default:
      throw new Error(`Unknown SMS provider: ${provider}`);
  }
}

SMS Gate Provider

The provider handles both local and cloud API differences:

// src/lib/sms/sms-gate.ts

import type { InboundSms, SmsProvider } from "./provider";

const SMSGATE_URL = process.env.SMSGATE_URL || "http://localhost:8080";
const SMSGATE_USER = process.env.SMSGATE_USER || "";
const SMSGATE_PASSWORD = process.env.SMSGATE_PASSWORD || "";

export class SmsGateProvider implements SmsProvider {
  private headers(): Record {
    const auth = Buffer.from(
      `${SMSGATE_USER}:${SMSGATE_PASSWORD}`,
    ).toString("base64");
    return {
      "Content-Type": "application/json",
      Authorization: `Basic ${auth}`,
    };
  }

  async send(to: string, body: string): Promise {
    const isCloud = SMSGATE_URL.includes("api.sms-gate.app");
    const endpoint = isCloud
      ? `${SMSGATE_URL}/3rdparty/v1/messages`
      : `${SMSGATE_URL}/api/3rdparty/v1/message`;
    const payload = isCloud
      ? { textMessage: { text: body }, phoneNumbers: [to] }
      : { phoneNumbers: [to], message: body };

    const res = await fetch(endpoint, {
      method: "POST",
      headers: this.headers(),
      body: JSON.stringify(payload),
    });

    if (!res.ok) {
      const err = await res.text();
      throw new Error(`SMS Gate send failed: ${res.status} ${err}`);
    }

    const data = await res.json();
    return data.id || "sent";
  }

  async parseWebhook(req: Request): Promise {
    try {
      const body = await req.json();

      if (body.event !== "sms:received" || !body.payload) {
        return null;
      }

      const { phoneNumber, message, receivedAt } = body.payload;
      if (!phoneNumber || !message) return null;

      return {
        from: phoneNumber,
        body: message,
        receivedAt: receivedAt ? new Date(receivedAt) : new Date(),
      };
    } catch {
      return null;
    }
  }

  webhookResponse(): Response {
    return new Response(JSON.stringify({ ok: true }), {
      headers: { "Content-Type": "application/json" },
    });
  }
}

Webhook Route

A basic webhook handler that receives inbound SMS and replies:

// src/app/api/sms/webhook/route.ts

import { NextRequest } from "next/server";
import { getSmsProvider } from "@/lib/sms/provider";

export async function POST(req: NextRequest) {
  const provider = await getSmsProvider();
  const sms = await provider.parseWebhook(req);

  if (!sms) {
    return new Response("Bad request", { status: 400 });
  }

  const { from, body } = sms;

  // Look up the sender — replace with your own user lookup
  const user = await findUserByPhone(from);

  if (!user) {
    await provider.send(from, "Hey! Text us back once you've signed up.");
    return provider.webhookResponse();
  }

  // Known user — do whatever your app needs
  console.log(`[SMS from ${from}]: ${body}`);
  await provider.send(from, "Got it — we're on it!");
  return provider.webhookResponse();
}

Console Provider (for Testing)

For local development without a phone:

// src/lib/sms/console.ts

import type { InboundSms, SmsProvider } from "./provider";

export class ConsoleProvider implements SmsProvider {
  async send(to: string, body: string): Promise {
    console.log(`[SMS -> ${to}] ${body}`);
    return `console-${Date.now()}`;
  }

  async parseWebhook(req: Request): Promise {
    const data = await req.json();
    return {
      from: data.from || "+15550000000",
      body: data.body || "",
      receivedAt: new Date(),
    };
  }

  webhookResponse(): Response {
    return new Response(JSON.stringify({ ok: true }), {
      headers: { "Content-Type": "application/json" },
    });
  }
}

Environment Variables

# .env

# Provider: "sms-gate" | "console"
SMS_PROVIDER=sms-gate

# Local mode
SMSGATE_URL=http://192.168.1.50:8080
SMSGATE_USER=admin
SMSGATE_PASSWORD=yourpassword

# Cloud mode
# SMSGATE_URL=https://api.sms-gate.app
# SMSGATE_USER=auto-generated-username
# SMSGATE_PASSWORD=auto-generated-password

Webhook Payload Reference

When someone texts your Android phone, SMS Gateway sends a POST to your webhook URL:

{
  "id": "Ey6ECgOkVVFjz3CL48B8C",
  "webhookId": "LreFUt-Z3sSq0JufY9uWB",
  "deviceId": "your-device-id",
  "event": "sms:received",
  "payload": {
    "messageId": "abc123",
    "message": "Hello!",
    "sender": "+15551234567",
    "recipient": "+15559876543",
    "simNumber": 1,
    "receivedAt": "2026-04-01T12:41:59.000+00:00"
  }
}

Available Events

Event	Description
`sms:received`	Inbound SMS received
`sms:sent`	Outbound SMS sent
`sms:delivered`	Outbound SMS confirmed delivered
`sms:failed`	Outbound SMS failed
`system:ping`	Heartbeat — device still alive

Webhook Security

SMS Gateway signs webhook payloads with HMAC-SHA256. Two headers are included:

X-Signature — hex-encoded HMAC-SHA256 signature
X-Timestamp — Unix timestamp used in signing

import crypto from "crypto";

function verifyWebhook(
  signingKey: string,
  payload: string,
  timestamp: string,
  signature: string,
): boolean {
  const expected = crypto
    .createHmac("sha256", signingKey)
    .update(payload + timestamp)
    .digest("hex");
  return crypto.timingSafeEqual(
    Buffer.from(expected, "hex"),
    Buffer.from(signature, "hex"),
  );
}

Retry Behaviour

If your server doesn't respond 2xx within 30 seconds, SMS Gateway retries with exponential backoff — starting at 10 seconds, doubling each time, up to 14 attempts (~2 days). Solid default behaviour, you don't need to configure anything.

Testing the Full Flow

1. Start Your Dev Server

npm run dev
# Next.js running at http://localhost:4000

2. Expose It (Cloud Mode)

ngrok http 4000
# https://abc123.ngrok.app -> http://localhost:4000

3. Register the Webhook

curl -X POST https://api.sms-gate.app/3rdparty/v1/webhooks \
  -u "USERNAME:PASSWORD" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://abc123.ngrok.app/api/sms/webhook",
    "event": "sms:received"
  }'

4. Send a Text

Text your Android phone from another phone. You should see:

SMS Gateway receives the text
Webhook fires to your ngrok URL
Your Next.js server processes it
A reply SMS is sent back via the API
The sender's phone receives the reply

That moment when the reply lands on your phone — genuinely satisfying.

Test Without a Phone

# Simulate an inbound SMS with the console provider
SMS_PROVIDER=console npm run dev

curl -X POST http://localhost:4000/api/sms/webhook \
  -H "Content-Type: application/json" \
  -d '{"from": "+15551234567", "body": "Hello"}'

Production Considerations

The Phone Setup

Dedicated device: Use a cheap Android phone ($20) with a prepaid SIM. It sits on a charger plugged into power and WiFi. That's its whole life now.
Battery optimisation off: Disable battery optimisation for SMS Gateway or Android will kill it. dontkillmyapp.com for your specific device.
Auto-start: Enable "start on boot" in the SMS Gateway app settings.
Monitoring: Register a system:ping webhook to alert if the device goes offline.

Local vs Cloud

	Local	Cloud
Latency	Lower (direct)	Slightly higher (relay)
Network	Same network required	Works from anywhere
Privacy	Messages never leave your network	Messages transit through SMS Gateway's servers
Reliability	Depends on your network	Adds FCM/SSE redundancy
Cost	Free	Free (community tier)

I use cloud mode in production because my server's hosted on Railway and can't reach the phone's local network. For development on the same WiFi, local mode is simpler and faster.

Cost Comparison

Provider	SMS Cost	Monthly (1,000 msgs)
Twilio	~$0.05/msg	~$50
SMS Gateway + Prepaid SIM	$0/msg (unlimited plan)	~$8 (plan cost)

That's an 80%+ saving, and it scales linearly — 10,000 messages a month is still just your plan cost.

It's worth knowing this is a whole category now. httpSMS and textbee do similar things. I went with SMS Gateway for Android because the local mode is properly useful for development, the documentation is solid, and it's actively maintained — v1.56.0 dropped in March 2026.

For an MVP, the maths is obvious. A $20 phone and an $8/month plan gets you a programmable SMS gateway that you fully control. No per-message fees, no carrier contracts, no vendor lock-in. If you outgrow it, swap the provider interface to Twilio and you're done — that's why the abstraction exists.

Links:

Wrangling a Million Crime Records

Thu, 26 Mar 2026 00:00:00 GMT

The very first thing NZ Police's crime dataset teaches you is that government data is never straightforward.

You download the CSV from policedata.nz, expecting to do a quick pd.read_csv() and start exploring. Instead you get a 503MB file encoded in UTF-16 Little Endian with tab delimiters. Not a regular CSV. Not even close. This is a legacy format from old Excel exports and most tools just silently corrupt it if you try to read it as UTF-8.

df = pd.read_csv("data.csv", encoding="utf-16-le", sep="\t")

That one line took longer to figure out than I'd like to admit.

What's actually in here

Once you get past the encoding, there's a lot to work with. 1,154,102 rows covering every reported victimisation in New Zealand from February 2022 through January 2026. Each row tells you the crime type (ANZSOC Division), where it happened (down to meshblock level), when it happened (month, day of week, hour of day), and sometimes what weapon was involved.

There are 20 columns, but five of them are useless: three are duplicates of "Year Month" and two are constants that add zero information. Every area name and territorial authority has a trailing period stuck on the end: "Auckland.", "Woodglen.", "Christchurch City.". A quirk of the export that'll break any geographic join if you don't strip them.

And meshblock IDs? Some are 6 digits, some are 7. Stats NZ boundary files use 7-digit codes consistently, so shorter ones need zero-padding. The kind of thing that's invisible until your join silently drops 19% of your records and you spend an afternoon figuring out why.

What the missing data tells you

That bit actually made me stop and think. 32.2% of records have the hour of day recorded as 99 (unknown). Another 23.2% have the day of week as "UNKNOWN".

At first this looks like a data quality problem. But it's not. It's telling you something about the nature of the crime. If someone breaks into your house while you're at work, you come home to find your stuff gone. Was it 9am or 2pm? You've got no idea, and neither do the police.

Property crimes (theft, burglary) make up the bulk of these unknowns. Assault, by contrast, almost always has a precise time because there's a victim present when it happens. The absence of data is itself a signal about what kind of crime you're looking at.

78.6% of location type values are "." (effectively missing). That column is sparsely populated but still useful for the roughly one in five records that have it.

Cleaning it up

We built a modular pipeline where each cleaning step is its own function. Nothing fancy, just practical:

def ingest() -> pd.DataFrame:
    df = load_raw_csv(RAW_CSV)            # UTF-16 LE, tab-delimited
    df = drop_redundant_columns(df)        # Remove 5 useless columns
    df = rename_columns(df)                # snake_case everything
    df = parse_dates(df)                   # "July 2022" → datetime
    df = clean_strings(df)                 # Strip trailing periods
    df = clean_meshblocks(df)              # Zero-pad to 7 digits
    df = encode_unknowns(df)              # 99 → NaN, "UNKNOWN" → NaN
    df = map_crime_types(df)               # ANZSOC Division → short enum
    return df

Each function does one thing. If something breaks, you know exactly where. If someone wants to understand the pipeline, they can read it top to bottom in about thirty seconds. I've been bitten enough times by monolithic data scripts that I'm allergic to them now.

The crime type mapping turns six ANZSOC Division values into short enums:

Crime Type	Count	Share
Theft	761,977	66.0%
Burglary	247,034	21.4%
Assault	115,383	10.0%
Robbery	14,860	1.3%
Sexual	13,943	1.2%
Harm	905	0.1%

That 66% theft number is going to haunt us when we get to model training. Any loss function you throw at this data will overwhelmingly optimise for predicting theft, because that's two-thirds of everything. The class imbalance is real and it matters.

503MB to 6.3MB

The cleaned output goes to Apache Parquet with snappy compression. The result?

Input: 503MB CSV (UTF-16, 20 columns)
Output: 6.3MB Parquet (21 columns including derived fields)
Compression: ~80x

That's not a typo. Parquet's columnar storage is dramatically more efficient than row-oriented CSV, especially when you've got columns full of repeated values like crime types and territorial authorities. The file loads in under a second compared to 3+ seconds for the CSV. When you're iterating on analysis and loading this data hundreds of times, that adds up fast.

The 21 output columns include the original 16 we kept plus five derived ones: a proper datetime, year, month, day-of-week as an integer, and the short crime type enum.

Sanity checks

Before calling the data clean, we verify everything that matters:

Row count: 1,154,102 (all rows preserved, nothing dropped)
No nulls in key columns: crime_type, date, area_unit, territorial_authority, meshblock
Date range: Feb 2022 to Jan 2026 (all 48 months present)
Auckland: 412,669 records, 36% of total (exactly where it should be)
Theft: 761,977 records, 66% (as expected)
No trailing periods anywhere in area names
All meshblock IDs are 7 digits
Max hour value is 23 (no more 99s leaking through)

You want these checks automated and running every time you regenerate the data. Future you will thank past you when something upstream changes and a check catches it.

What's next

We've got clean, compressed crime data, but the records only have meshblock IDs and area unit names. No coordinates. No shapes on a map. In the next post, we'll download Stats NZ geographic boundary files and join them to our crime records, giving every victimisation a place in physical space.

Crime as Video

Thu, 26 Mar 2026 00:00:00 GMT

This is where the project gets properly fun.

We've got 1.15 million clean crime records. Every one of them has coordinates: either precise meshblock centroids or area unit fallbacks from Part 2. But a bag of lat/lon points isn't what a neural network wants. ConvLSTM and ST-ResNet are fundamentally image-processing architectures. They expect regular 2D grids, rows and columns, like pixels in a photograph.

So our job now is to convert the messy reality of crime locations into clean, regular "crime images" that a convolutional network can actually consume. And once you see it framed that way, crime prediction becomes video prediction. Each month is a frame. Each grid cell is a pixel. The brightness is the crime count.

Choosing 500m

This is the single most consequential decision in the entire data pipeline. Get the grid resolution wrong and everything downstream suffers.

Too fine (say 100m cells) and the vast majority of cells are empty in any given month. The model sees an ocean of zeros with occasional spikes, which is incredibly hard to learn from. Too coarse (say 2km) and you've blurred away the spatial patterns you're trying to detect. "Auckland CBD" and "Ponsonby" become the same cell, which is useless.

We computed Auckland's urban crime extent from the meshblock centroids (5th to 95th percentile to exclude outliers like Great Barrier Island):

Metric	Value
Urban extent	27.7 km × 36.9 km
Grid resolution	500m × 500m
Grid dimensions	77 rows × 59 columns
Total cells	4,543

At 500m, each cell covers roughly a few city blocks. That's fine enough to distinguish a commercial strip from a residential street, but coarse enough that most cells accumulate at least some crime over the 48-month period. It's a sweet spot, and it's consistent with what recent crime forecasting research uses for similar models in US cities.

Simple maths, no spatial joins

Working in NZTM2000 (the coordinate system we set up in Part 2, where units are metres) makes the next bit easy. Assigning a crime to a grid cell is just floor division:

grid_j = floor((x - xmin) / 500)  # column index
grid_i = floor((y - ymin) / 500)  # row index

No spatial joins, no polygon intersection, no geopandas overhead. Just arithmetic. It processes all 400k Auckland records in under a second.

For the ~22% of Auckland records that didn't get meshblock coordinates in Part 2, we fall back to area unit centroids converted to NZTM2000. Those records land at the centre of their suburb rather than their exact location. Less precise, but dropping them entirely would be worse.

The result: 354,387 of 412,669 Auckland records (86.2%) fall within the grid. The remaining 14% are in Auckland's outer fringes (Great Barrier Island, rural Rodney, the edges of the Waitakere Ranges) beyond our urban bounding box. That's fine. We're modelling urban crime patterns, not rural ones.

The 4D tensor

With every crime assigned to a cell, we aggregate by grid position, month, and crime type:

(grid_i, grid_j, month, crime_type) → sum(victimisations)

This gives us a 4D tensor:

Dimension	Size	Meaning
T (time)	48	Months: Feb 2022 – Jan 2026
H (height)	77	Grid rows (south → north)
W (width)	59	Grid columns (west → east)
C (channels)	6	Crime types: theft, burglary, assault, robbery, sexual, harm

Think of it as a 48-frame video with 6 colour channels. A regular video has 3 channels: red, green, blue. Ours has 6: theft, burglary, assault, robbery, sexual offences, harm. Each pixel's brightness in a given channel tells you how many of that crime type happened in that 500m cell during that month.

I genuinely love this framing. It takes a complicated spatial-temporal prediction problem and maps it onto something that decades of computer vision research already knows how to handle.

91.7% zeros

The tensor is overwhelmingly empty. 91.7% of all cells are zero.

This makes complete sense if you think about it. Most 500m squares in Auckland don't have a single reported crime in any given month. Crime clusters: commercial corridors, transport hubs, specific residential pockets. The non-zero 8.3% is where all the signal lives.

The sparsity does create a training challenge though. If the model just predicted zero everywhere, it'd be right 91.7% of the time. Useless, but technically accurate. That's why we'll use log1p normalisation during training. It compresses the range from [0, 50+] to [0, ~4], giving the model a more balanced gradient to learn from. And it's why the loss function needs to care more about the non-zero cells than the empty ones.

The upside of all those zeros is storage. The compressed numpy format handles sparse data beautifully. The full 4D tensor saves to just 0.2 MB. Compare that to the 21.9 MB Parquet from Part 2.

Train, validate, test

We split the 48 months temporally. No shuffling, no random sampling:

Set	Months	Range
Train	36	Feb 2022 – Jan 2025
Validation	6	Feb 2025 – Jul 2025
Test	6	Aug 2025 – Jan 2026

The model trains on three years, tunes on six months, and gets evaluated on the most recent six months it's never seen. There's no spatial leakage either. We don't hold out specific grid cells. The model has to predict all locations for future months simultaneously.

This is the only honest way to evaluate a time-series model. If you randomly shuffle months into train and test, the model can memorise seasonal patterns and look brilliant without actually learning anything useful about temporal dynamics.

What the tensor reveals

Even at this aggregate level, clear patterns jump out.

February tends to be the quietest month (~7–8k victimisations across Auckland), while October through January (spring and early summer) consistently peaks at 8.5–9.5k. 2023 was the peak year across the board, with a gradual decline through 2024 and into 2025.

Theft accounts for 72% of the tensor values (283k victimisations), burglary 17% (68k), and assault 9% (34k). That theft dominance from Part 1, the 66% figure, gets even more pronounced when you focus on Auckland, because theft clusters harder in urban areas than other crime types do.

What's next

The tensor is built. The model input is ready. But before throwing deep learning at anything, we need to properly understand what patterns actually exist in this data: when does crime peak, where does it cluster, and how do different crime types behave differently. Next post: exploratory data analysis.

Giving Crime a Place on the Map

Thu, 26 Mar 2026 00:00:00 GMT

A crime record that says "Woodglen, meshblock 0284305" is useless for spatial modelling. It's a name and a number. You can't plot it, you can't measure distances from it, and you definitely can't feed it to a neural network that thinks in grid cells.

To do anything spatial, every record needs actual coordinates: latitude, longitude, or ideally metres on a proper projection. That means downloading Stats NZ's geographic boundary files and joining them to our crime data.

NZ's geographic hierarchy

New Zealand has a neat nested system of geographic units maintained by Stats NZ:

graph TD
    A["Region (16)"] --> B["Territorial Authority (67)"]
    B --> C["Area Unit / SA2 (~2,000)"]
    C --> D["Meshblock (~53,000)"]

Regions are the big ones: Auckland, Canterbury, Wellington. Territorial authorities are your cities and districts. Area units are roughly suburb-sized. And meshblocks are the smallest unit, about 100 people each, roughly a city block. Our crime data uses area units and meshblocks, so those are the layers we need.

There's a gotcha here. Stats NZ replaced "Area Units" with "Statistical Area 2" (SA2) in 2018 as part of a geographic classification overhaul. But the NZ Police crime data still uses the old area unit names. So we need the 2017 vintage boundary files, not the current ones. Use the wrong vintage and your join silently fails on hundreds of area units. Ask me how I know.

Three boundary files

We downloaded three layers from Stats NZ DataFinder via their WFS API:

Layer	Features	Size
Area Unit 2017 (generalised)	2,004	88 MB
Meshblock 2018 (generalised)	53,589	213 MB
Territorial Authority 2023	68	34 MB

All three come in EPSG:2193, which is NZTM2000, New Zealand's official projected coordinate system. The units are metres, not degrees. This matters a lot later when we need to build a "500m grid". You want that to be 500 actual metres, not some approximation based on latitude.

We use generalised (simplified) versions rather than high-definition. The full-resolution meshblock layer is over a gigabyte. For centroid calculations and spatial joins, the generalised versions are more than accurate enough.

The area unit join: 99.4%

Joining crime records to area unit boundaries by name was almost perfect. 1,146,721 of 1,154,102 records matched, 99.4%.

Only two area unit codes failed:

999999: the official "unspecified" catch-all (7,331 records)
-29: a straight-up data entry error (50 records)

That's a genuinely excellent result. The unmatched records aren't a bug in our pipeline. They're unlocatable crimes that the police couldn't assign to a specific area. Nothing we can do about those, and nothing we should try to do.

The meshblock join: 81.2%

The meshblock join came in lower at 81.2%, with 937,604 records matched out of 1,154,102.

This is expected and it's fine. Here's why: NZ meshblock boundaries get revised with every census. We're using 2018 boundaries, but our crime data runs through January 2026. Any crime from 2023 onwards might reference a 2023-vintage meshblock code that simply doesn't exist in the 2018 file. Some meshblocks get split, some get merged, some get renumbered entirely.

81.2% still gives us fine-grained coordinates for the vast majority of records. For the ~19% that miss, we fall back to the area unit centroid. It's less precise (suburb-level instead of block-level) but better than dropping the records entirely.

Two coordinate systems

This is one of those things that seems like a minor detail but will bite you hard if you get it wrong. We use two coordinate reference systems throughout the project:

NZTM2000 (EPSG:2193) for all spatial analysis. The units are metres, which makes grid construction trivial: a 500m cell is literally 500 units on each axis. Distance calculations are straightforward. No need to worry about the fact that a degree of longitude means different things at different latitudes.

WGS84 (EPSG:4326) for the frontend dashboard only. deck.gl and MapLibre expect coordinates in degrees (latitude/longitude), which is the standard for web mapping.

The rule is simple: do everything in NZTM2000, convert to WGS84 at the very end when exporting for the dashboard. Mixing coordinate systems mid-pipeline is a recipe for bugs that are incredibly annoying to track down.

The output

Each crime record now has up to 8 new geographic columns: area unit centroids, meshblock centroids, and areas in both coordinate systems. The enriched dataset saves as crimes_with_geo.parquet at 21.9 MB with 29 columns.

Quick sanity check: Auckland's mean crime centroid lands at lat -36.90, lon 174.78. Right in the middle of the urban area. If that number had come back as somewhere in the Waikato, we'd know something went wrong.

What's next

Every crime record now has a place in physical space. But individual points aren't what the neural network needs. It needs a regular grid. In the next post, we'll overlay a 500m × 500m grid on Auckland, count crimes per cell per month, and build the 4D tensor that turns crime prediction into a video prediction problem.

Predicting Crime in Aotearoa

Tue, 24 Mar 2026 00:00:00 GMT

NZ Police publish every recorded victimisation in the country, over a million records, and most people have no idea.

I stumbled across policedata.nz a while back and was surprised by how much is there. Every reported theft, assault, burglary, robbery, broken down by location, time of day, day of week, and month. All the way down to meshblock level, which is roughly a city block. Updated monthly. Creative Commons licensed. You can just... use it.

So naturally I started wondering: what happens if you point deep learning at this?

A million rows of crime

The dataset I pulled covers February 2022 through January 2026. Four years, 1,154,102 records across the whole country. The breakdown is roughly what you'd expect: theft dominates at 66%, followed by burglary at 21% and assault at 10%. The remaining sliver covers robbery, sexual offences, and harm/endangerment.

What makes it interesting for modelling is the spatial granularity. Each record maps to one of 42,778 meshblocks (tiny geographic units defined by Stats NZ). That's detailed enough to see patterns at a neighbourhood level, not just "Auckland has more crime than Tauranga" (which, yeah, obviously).

Auckland alone accounts for about 36% of all recorded crime. Then there's a long tail: Wellington, Christchurch, Hamilton, and then it drops off fast. NZ's urban geography is weird like that. One mega-city and a bunch of mid-size towns.

The idea

The core question is pretty simple. Given the crime patterns of the last few months, can we predict what the next month looks like?

This isn't Minority Report. Nobody's getting arrested for crimes they haven't committed. It's pattern recognition on publicly available statistics, the same kind of modelling people do with weather data or traffic flows.

The neat trick is how you frame it. If you overlay a grid on a city (say 500m by 500m cells) and count crimes per cell per month, you get something that looks a lot like a video. Each month is a frame. Each cell is a pixel. The brightness is the crime count.

Predicting next month's crime becomes a video prediction problem. And there are some really cool deep learning architectures built exactly for that.

ConvLSTM and ST-ResNet

The two models I'm building are ConvLSTM and ST-ResNet. Don't worry if those sound like gibberish. The short version: they're neural networks designed to learn patterns that are both spatial (where things cluster) and temporal (how those clusters change over time).

ConvLSTM is the primary model. A standard LSTM network is great at learning sequences. It's the architecture behind a lot of language and time-series models. ConvLSTM swaps out the matrix multiplications for convolutions, which means it can process grid-structured data. Feed it the last six months of crime grids and it learns both the shape of hotspots and how they evolve. Recent research has shown these work well for crime forecasting across multiple US cities.

ST-ResNet takes a different angle. Instead of one sequential view, it captures three temporal perspectives: what happened recently, what happened at the same time last year, and what's the long-term trend. Each gets its own branch of residual convolutional networks, and a learned fusion layer combines them. The original paper was for crowd flow prediction in Beijing, but the architecture translates well to crime data.

Why NZ?

Almost all published crime prediction research uses US data. Chicago, Los Angeles, New York. A systematic review of spatial crime forecasting makes this pretty clear. The models are well-studied, but they're trained on American cities with American urban patterns.

New Zealand doesn't look like that. Our cities are smaller, more spread out, and the distribution is completely different. Auckland dominates in a way that no single US city does relative to the rest of the country. The spatial patterns here are their own thing, and I couldn't find anyone who'd applied these deep learning approaches to NZ data.

That's what got me keen. Not because I think I'll beat the published benchmarks. Those researchers have GPUs and PhD students; I have a Ryzen 5 desktop with no graphics card. But applying known techniques to new geography is useful work, and nobody else seems to have done it.

No GPU, no problem (mostly)

All of this runs on my desktop, an AMD Ryzen 5 5600GT with 12 threads and 30GB of RAM. No GPU at all. That sounds limiting, but the Auckland 500m grid works out to about 60 by 80 cells. The ConvLSTM model ends up around 5 million parameters, which trains in under an hour on CPU. You don't always need a beefy rig.

It does mean being smart about model sizing and not going crazy with hyperparameter searches. But for a hobby project, it's more than enough.

What's coming

This is the first post in a ten-part series covering the whole project end to end.

Part 1: Data Acquisition and Exploration. We start with a 503MB CSV file from NZ Police that's UTF-16 encoded (because of course it is), has trailing periods on area names, and 32% of records with unknown hour-of-day. We'll wrangle it into a clean, typed Parquet file and get our first look at what's actually in there.

Part 2: Geographic Data Pipeline. Crime records come with meshblock IDs, but no coordinates. We'll join them to Stats NZ geographic boundary files using geopandas, giving every record a place on the map.

Part 3: Spatiotemporal Grid Construction. This is where it gets fun. We overlay a 500m by 500m grid on Auckland, count crimes per cell per month, and build the 4D tensors that feed the neural networks. Crime prediction becomes video prediction.

Part 4: Exploratory Data Analysis. Before throwing deep learning at anything, we need to understand what patterns actually exist. When does crime peak? Where does it cluster? How do different crime types behave differently?

Part 5: Baseline Models. Simple benchmarks (historical averages, naive persistence) so we know whether the deep learning is actually adding value or just being fancy for the sake of it.

Part 6: ConvLSTM Architecture. Building and training the primary model. Three ConvLSTM layers, six-month lookback window, learning spatial hotspots and temporal dynamics simultaneously.

Part 7: ST-ResNet Architecture. The three-branch alternative that captures closeness, periodicity, and long-term trend separately, then fuses them with learned weights.

Part 8: Model Evaluation and Comparison. Which model wins? By how much? And more importantly, where do they fail?

Part 9: Building the Dashboard. A 3D interactive map built with deck.gl where you can watch crime patterns evolve over time. Dark theme, extruded columns, time-lapse playback.

Part 10: Deployment and Reflections. Shipping to Vercel, what worked, what didn't, and what I'd do differently next time.

Every post will include code and real results. The whole codebase will be open source. And I'll be upfront about the stuff that didn't work. Trust me, there's plenty of it.

This is a hobby project. It's not a policing tool, it's not a product, and it's definitely not claiming to solve crime. It's just me being curious about what's sitting in a publicly available dataset and seeing how far you can push it with some Python and a bit of patience.