<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:wfw="http://wellformedweb.org/CommentAPI/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:sy="http://purl.org/rss/1.0/modules/syndication/" xmlns:slash="http://purl.org/rss/1.0/modules/slash/" version="2.0">
  <channel>
    <title>jonnonz.com</title>
    <link>https://jonnonz.com/</link>
    <atom:link href="https://jonnonz.com/feed.xml" rel="self" type="application/rss+xml"/>
    <description>Mirror of jonno.nz — John Gregoriadis</description>
    <lastBuildDate>Mon, 04 May 2026 10:11:07 GMT</lastBuildDate>
    <language>en</language>
    <generator>Lume v2.4.2</generator>
    <author>
      <name>John Gregoriadis</name>
      <uri>https://jonnonz.com</uri>
    </author>
    <item>
      <title>Product market fit isn't a stage, it's a gauntlet</title>
      <link>https://jonnonz.com/posts/product-market-fit-is-a-gauntlet/</link>
      <guid isPermaLink="false">https://jonnonz.com/posts/product-market-fit-is-a-gauntlet/</guid>
      <description>
        PMF gets sold as a milestone. It's actually a gauntlet that bends founders, breaks teams, and quietly poisons technical decisions. Most of the damage isn't from missing PMF — it's from how you behave while looking for it.
      </description>
      <content:encoded>
        <![CDATA[<p>Product market fit gets sold as a milestone. Find it and you're off to the
races. That's the bit nobody who's been through it actually believes.</p>
<p>PMF is a gauntlet. It eats teams, it bends founders, and it quietly poisons the
technical decisions you're proud of at the time. Most of the damage I've watched
done to good companies wasn't from missing PMF. It was from how they behaved
while they were looking for it.</p>
<p><img src="https://jonnonz.com/img/posts/pmf-gauntlet/gauntlet-loop.svg" alt="The PMF gauntlet loop — Vision feeding Hypothesis, Ship, Market signal, and Adapt, with three drag forces (rigid roadmap, comprehension debt, over-scaled architecture) pulling on the loop."></p>
<h2>It's not for everyone, and that's fine</h2>
<p>There's a particular kind of person who does well in the pre-PMF phase. High
tolerance for ambiguity, low need for closure. The deck never feels finished,
the metric you're chasing changes every six weeks, and the answer to &quot;what are
we doing in three months&quot; is &quot;depends.&quot;</p>
<p>Plenty of really good operators just cannot function in that environment. That's
not a character flaw — it's a stage mismatch. Some people thrive at zero to one.
Some thrive at one to ten. Almost no one thrives at both, and the industry
pretending otherwise has cost a lot of careers and a lot of sanity.</p>
<p>Naming this honestly so people can self-select is one of the kindest things a
founder can do. You're not letting someone down by saying &quot;this stage probably
isn't for you, but the next one will be.&quot; You're saving them eighteen months of
feeling broken.</p>
<h2>The variables you don't control will eat you alive</h2>
<p>Timing, market, economy, what your one big regulator decides on a Tuesday — none
of that is yours. You can have a sharp thesis and a great team and ship
something nobody buys, because something three layers above you shifted while
you were heads-down.</p>
<p>The only protection I've found against this is a vision the team is genuinely
bought into. Not the slide. Not the wall poster. The actual reason you all got
out of bed this morning. When the macro turns and the metric you were proud of
last quarter goes sideways, that vision is what stops the org from devouring
itself.</p>
<p>I've watched startups where the thesis was right but the timing was a year early
lose half their team in three months because nobody could explain why they were
still doing what they were doing. It wasn't a strategy problem. It was an
alignment problem dressed up as a strategy problem.</p>
<p>This is also where founders take the most damage personally. You can do
everything well and still get hit by something nobody could have predicted. If
your sense of self is tied to PMF being a verdict on you, that breaks people.
The ones I've seen come through it healthy treated PMF like weather they were
navigating, not a test they were passing.</p>
<h2>Agility is the actual moat at this stage</h2>
<p>Your moat isn't the product. It isn't the tech. It definitely isn't the brand.
Your moat is how fast the org can spot a shift in TAM or target market and
translate it into a product move.</p>
<p>Days, ideally. Weeks if you have to. Not quarters.</p>
<p>This is where the technical decisions made in the name of &quot;scaling&quot; quietly
cripple you. The microservices you split out before you needed to. The custom
infrastructure someone stood up because their last job had it. The platform
abstractions that mean a small UI change touches four repos. Each of those felt
disciplined at the time. Each is now a tax on the only thing you actually have —
speed.</p>
<p>Andreessen wrote the
<a href="https://pmarchive.com/guide_to_startups_part4.html">original PMF essay</a> almost
twenty years ago, and the line that's aged best is the bit about doing whatever
it takes — changing people, rewriting the product, moving markets. That's not a
license to be chaotic. It's a reminder that the org needs to be physically
capable of those moves. If your architecture, process, or contracts make
rewriting the product a six-month project, you've already lost the gauntlet
whether you know it yet or not.</p>
<p>I've got a strong opinion on this one: when in doubt, build it boring. Boring is
fast to change.</p>
<h2>The first cohort is a dance, and you have to lead</h2>
<p>The customers who signed up first kept you alive. They also signed up for a
slightly different company than the one you're now trying to become. That gap is
where a lot of startups quietly die.</p>
<p>Keep them too happy and you slow your evolution. Push too hard toward the new
vision and you churn the cohort that's funding your runway. The actual job is to
do both at once, which is why I sometimes call it internal schizophrenia. You're
a different company to them than you are to yourselves, and that's not a bug —
that's the mode you're operating in.</p>
<p>The skill is being honest with the early cohort about where you're going without
selling them something they didn't buy. The art is using their feedback to
sharpen the bigger vision rather than letting yourself be pulled back into being
their bespoke vendor. The dance is doing both of those without your team
thinking you've gone off-piste, because the gap between &quot;what we're shipping
today&quot; and &quot;where we're going&quot; looks weird from the inside.</p>
<h2>Where PMF teams quietly self-sabotage</h2>
<p>Three patterns I keep seeing.</p>
<p><img src="https://jonnonz.com/img/posts/pmf-gauntlet/discipline-vs-fragility.svg" alt="Discipline vs fragility — three patterns where what looks like discipline (microservices on day one, rigid quarterly planning, founder comprehension debt) becomes fragility (can't pivot, defending old assumptions, context never reaches the team)."></p>
<p>Engineering over-scales the architecture. The team builds for the company they
want to be in two years instead of the company they need to be this quarter. By
the time PMF actually shows up, the org can't move. Worse, the engineers feel
busy and capable the whole time it's happening, which is why it's so hard to
stop. Nobody is asking to slow down — everyone is shipping.</p>
<p>Product holds the roadmap too tightly. The roadmap <em>is</em> the experiment at this
stage. Treating it like a commitment is a category error. The product teams I've
seen do this well treat the roadmap like a hypothesis with version numbers —
last month's was wrong, this month's is less wrong, and that's how it's supposed
to feel. The ones who don't end up defending decisions they made when they knew
less.</p>
<p>Founder comprehension debt builds up faster than anyone notices. The founder is
heads-down on signal — every customer call, every dropped deal, every weird
pattern in the data lands in their head and gets metabolised on the spot. The
team is two beats behind, working from last week's mental model. Each individual
delay feels minor. The cumulative gap is the thing that kills decisions.</p>
<p>Each of these looks like discipline from the inside. Each of these is fragility
wearing discipline's clothes.</p>
<h2>AI changes the moat conversation, not the gauntlet</h2>
<p>Moats in the AI space are shifting quarter by quarter right now. Feature moats
have basically collapsed — anything you can describe in a screenshot can be
cloned in a weekend with the current generation of tools. What's
<a href="https://www.latitudemedia.com/news/in-the-age-of-ai-can-startups-still-build-a-moat/">actually defensible has moved</a>
toward proprietary data, deeply embedded workflows, distribution, trust, and
regulatory positioning.</p>
<p>For a founder in the PMF gauntlet that means the playbook is unreliable in a way
it wasn't five years ago. You can't just lift what worked for the last cohort of
SaaS winners and run it. You have to reason from first principles about where
your actual edge is going to come from over the next eighteen months, and place
chips accordingly.</p>
<p>The gauntlet itself hasn't changed. The chips you're placing have. That's harder
than it sounds, because most of us were trained in an era when the moat
conversation was settled.</p>
<h2>The unglamorous work that decides whether you survive</h2>
<p>The thing nobody tells you is that the founder or leader's most important job
during the PMF stretch isn't strategy or product or sales. It's getting the
context that's in your head out into the org while everyone is running at five
thousand miles an hour.</p>
<p>You will not feel like you have time for this. You won't. You have to carve it
out anyway. The teams I've seen come through PMF intact are the ones whose
leaders forced themselves to stop, write things down, repeat themselves more
than felt necessary, and trust that the slowdown was the work.</p>
<p>The teams that don't make it tend to look back and realise everyone was busy and
nobody knew why.</p>
]]>
      </content:encoded>
      <pubDate>Fri, 01 May 2026 00:00:00 GMT</pubDate>
    </item>
    <item>
      <title>Change management</title>
      <link>https://jonnonz.com/posts/change-management/</link>
      <guid isPermaLink="false">https://jonnonz.com/posts/change-management/</guid>
      <description>On the personal version of change management — the long, weird middle bit between who you were and who you're becoming.</description>
      <content:encoded>
        <![CDATA[<p>There's a whole business discipline called change management. Frameworks,
certifications, consultancies, the lot. Every big company has someone running it
during a restructure or a tech migration. Nobody runs it for you when your life
turns over.</p>
<p>Which is strange, because the personal version is the harder problem — and right
now, more people are facing it than at any point in recent memory.</p>
<p>More than
<a href="https://www.cnbc.com/2026/04/24/20k-job-cuts-at-meta-microsoft-raise-concern-of-ai-labor-crisis-.html">92,000 tech workers have been laid off in 2026 alone</a>,
bringing the total close to 900,000 since 2020. Meta cut 8,000 jobs last week.
Microsoft offered buyouts to 7% of its US workforce — the first time in its
51-year history. Oracle has started cuts that could reach 30,000 by year end.
Closer to home, Xero, Sharesies, Spark, One NZ and Eroad have all run their own
rounds. AI is the headline reason, but the impact lands the same regardless of
the cause: hundreds of thousands of people closing a laptop and discovering
their working identity has just been deleted.</p>
<p>That's a lot of people being handed a forced version of personal change
management without ever signing up for the course.</p>
<p>The business framing has the right insight buried in it.
<a href="https://wmbridges.com/about/what-is-transition/">William Bridges</a> made a
distinction in the 90s that most people miss: change is external, transition is
internal. Change is the new org chart, the redundancy email, the merger.
Transition is what happens inside people's heads while all that is going on.
Change can happen overnight. Transition takes as long as it takes.</p>
<p>Personal change management is just transition without the org chart.</p>
<p>I've been through a few years of it now. Not one big event — more like a slow
stack of endings, some chosen, some not. Companies, relationships, versions of
myself I'd been building for a decade. The kind of stretch where you don't
really notice you're changing until you look up one day and the old you is gone.</p>
<p>That's the part nobody warns you about. Real change isn't transformation. It's a
controlled demolition followed by a slow rebuild, with a long, weird middle bit
where neither the old you nor the new you is really there.</p>
<h2>Something has to die</h2>
<p>The thing that goes is usually the organising self. Whatever the old you was
arranged around — a fear, a need for approval, a story about who you had to be,
an ambition that was really a wound. When that goes, the structure it was
holding up collapses. That's the death. It's real.</p>
<p>What survives is everything that wasn't load-bearing on the old arrangement.
Your humour, your curiosity, the way you actually see people, the things you
genuinely care about. Those don't die because they weren't propping anything up.
They were just you, underneath.</p>
<p>The disorienting part is feeling like a stranger to yourself and entirely
continuous, at the same time. Both are true. The continuous parts are
continuous. The organising self is gone. You're in between.</p>
<h2>The middle is the work</h2>
<p>Bridges calls this the neutral zone. The old reality has gone, the new one isn't
there yet. He says it's the hardest phase to manage, and most organisations rush
through it because it looks unproductive. People do the same thing to
themselves.</p>
<p>The temptation is to build a new identity fast, because the empty space is
uncomfortable. Don't. Whatever you grab in a hurry will be made of whatever was
lying around — which usually means the old patterns sneak back in wearing new
clothes. Workaholism becomes &quot;building my legacy&quot;. Approval-seeking becomes
&quot;being of service&quot;. Avoidance becomes &quot;protecting my peace&quot;. Same machine, new
paint.</p>
<p>The test is always: is this coming from fear or from truth? You'll know. The
body knows before the mind does. Pay attention to the part of you that goes
quiet around certain people, certain projects, certain decisions. That's the
signal.</p>
<h2>Fearlessness is a side effect</h2>
<p>You don't get to fearless by trying. You get there by going through enough
endings that the bluff stops working.</p>
<p>Fear runs on a specific con: <em>if this thing happens, you won't survive it</em>. Not
literally die — but the you that exists now won't continue. You'll be broken,
finished, unrecognisable. The con works as long as it's untested. Then the thing
happens, and you go through it, and on the other side you notice you're still
here. Different, scarred, but continuous. The fear was lying about its hand.</p>
<p>After that, fear can still show up — it doesn't leave — but it can't run the
same con. You've seen the card it was holding. Next time it says <em>you won't
survive this</em>, some quiet part of you knows: I already did.</p>
<p>That's not the absence of fear. It's knowing you can act from what's true even
with the fear in the room.</p>
<h2>What's on the other side is ordinary</h2>
<p>Here's the bit that surprised me. Once the demolition is done and the rebuild
starts, what comes back isn't impressive. It's just real. Less reactive. Less
noise. Less performance. You stop needing to be seen a particular way, partly
because you've watched a few of those selves die and you don't trust the next
one enough to stake everything on it.</p>
<p>The goal of change management — the personal kind — isn't to become someone
admirable. It's to become someone who's the same alone as in public. Someone who
does the next true thing without announcing it. Most of the depth of this stuff
lives in the texture of regular days. How you handle a boring Tuesday. Whether
you rest when you're tired or push through to prove something to nobody.</p>
<p>Business change management has all this in it, and most people read it as a
project manager's manual. It's also a personal one. Endings, neutral zone, new
beginnings. Same shape, different blast radius.</p>
<p>The seeds that grow through the demolition are the ones worth tending. The rest
sorts itself out.</p>
]]>
      </content:encoded>
      <pubDate>Sat, 25 Apr 2026 00:00:00 GMT</pubDate>
    </item>
    <item>
      <title>Three Ways to Look at Time</title>
      <link>https://jonnonz.com/posts/three-ways-to-look-at-time/</link>
      <guid isPermaLink="false">https://jonnonz.com/posts/three-ways-to-look-at-time/</guid>
      <description>
        ST-ResNet decomposes crime patterns into three temporal scales and models each one separately. Clever architecture, but does it actually help with only four years of NZ data?
      </description>
      <content:encoded>
        <![CDATA[<p>ST-ResNet's core insight is that not all history is created equal.</p>
<p>When you're predicting crime in Auckland next month, three different kinds of
past information matter. What happened in the last couple of months: the recent
trend. What happened at the same time last year: the seasonal pattern. And
what's been happening over the longer term: whether crime is generally rising or
falling in an area.</p>
<p>ConvLSTM treats all of this as one continuous sequence and hopes the network
figures out which parts matter. <a href="https://arxiv.org/abs/1610.00081">ST-ResNet</a>
takes a more opinionated approach. It separates these three temporal scales
explicitly and gives each one its own dedicated neural network branch.</p>
<p>The original paper by Zhang et al. was about predicting crowd flows in Beijing.
People move through cities in patterns that look a lot like crime patterns:
daily rhythms, weekly cycles, long-term trends. The architecture
<a href="https://www.nature.com/articles/s41598-025-24559-7">translates well to crime data</a>,
with some modifications.</p>
<h2>Closeness, period, trend</h2>
<p>The three branches each look at different slices of history:</p>
<p><strong>Closeness</strong> captures what's been happening recently. For our monthly data,
this means the last 3 months. If South Auckland has been trending upward over
the last quarter, the closeness branch sees that momentum.</p>
<p><strong>Period</strong> captures seasonal patterns. It looks at the same month in previous
years. So to predict January 2026, it pulls in January 2025 and January 2024.
The assumption is that crime has an annual rhythm, and the same month tends to
look similar year to year.</p>
<p><strong>Trend</strong> captures longer-term shifts. It uses quarterly averages from further
back: broad strokes of whether an area is seeing more or less crime over time.
This is the slowest-moving signal.</p>
<p>Each branch independently processes its temporal slice through a stack of
residual convolutional blocks, then a learned fusion layer combines the three
outputs:</p>
<pre><code>prediction = W_c · closeness + W_p · period + W_t · trend + bias
</code></pre>
<p>Where <code>W_c</code>, <code>W_p</code>, and <code>W_t</code> are learned weights that vary by grid cell. This
is a nice touch. It means the model can decide that the CBD's crime is mostly
driven by recent trends (closeness), while a residential suburb might be more
seasonal (period). Different areas get different temporal recipes.</p>
<h2>Residual blocks</h2>
<p>Each branch uses residual convolutional units, the building blocks that made
<a href="https://arxiv.org/abs/1512.03385">ResNet</a> so successful in image recognition.</p>
<p>The key idea: instead of learning the full output at each layer, the network
learns the <em>residual</em>, the difference between input and output. The identity
shortcut connection means gradients flow cleanly through the network during
training, which lets you stack more layers without the signal degrading.</p>
<pre><code>ResUnit(X) = ReLU(Conv(ReLU(Conv(X))) + X)
</code></pre>
<p>That <code>+ X</code> at the end is the skip connection. If the layer has nothing useful to
add, it can learn weights near zero and just pass the input through. This makes
deeper networks stable, which matters when you're trying to learn spatial
features at multiple scales.</p>
<p>For our grid, I use 4 residual units per branch. Each unit has two 3×3
convolutional layers with 32 filters. That's deep enough to capture spatial
relationships across several kilometres without being so deep that the model
overfits on 36 months of training data.</p>
<h2>The NZ-specific problem</h2>
<p>Here's where theory meets reality, and it gets a bit awkward.</p>
<p>ST-ResNet was designed for dense, high-frequency data. The Beijing crowd flow
paper used 30-minute intervals over months of data: thousands of timesteps. The
crime papers that report strong results typically use daily data over several
years.</p>
<p>We have 48 monthly timesteps. Total. The period branch (which looks at the same
month in previous years) has at most 3 data points per month (2022, 2023, 2024
to predict 2025/2026). The trend branch is working with quarterly averages from
a four-year window. It's not a lot of temporal data for an architecture that's
specifically designed to decompose temporal patterns.</p>
<p>I had a feeling this would be the bottleneck, and it was.</p>
<h2>Implementation</h2>
<pre><code>Closeness branch:
  Input: last 3 months (3 × 6 channels = 18 input channels)
  → 4 ResUnits (32 filters, 3×3 kernels)
  → Output: 32 channels

Period branch:
  Input: same month from 2 prior years (2 × 6 = 12 input channels)
  → 4 ResUnits (32 filters, 3×3 kernels)
  → Output: 32 channels

Trend branch:
  Input: 2 quarterly averages (2 × 6 = 12 input channels)
  → 4 ResUnits (32 filters, 3×3 kernels)
  → Output: 32 channels

Fusion:
  → Learned weighted sum across branches
  → Conv2d(32, 6, 1×1) → 6 crime type predictions
</code></pre>
<p>Total parameters: roughly 180k. Slightly smaller than the ConvLSTM, which is
fine. ST-ResNet's power is supposed to come from the temporal decomposition, not
from model size.</p>
<p>Training uses the same setup as ConvLSTM: Adam optimiser, learning rate 1e-4,
MSE loss on <code>log1p</code>-transformed values, early stopping with patience of 15
epochs. On CPU, each run takes about 35 minutes, a bit faster than ConvLSTM
since there's no sequential recurrence to deal with.</p>
<h2>Results</h2>
<table>
<thead>
<tr>
<th>Crime Type</th>
<th>Hist. Avg MAE</th>
<th>ConvLSTM MAE</th>
<th>ST-ResNet MAE</th>
</tr>
</thead>
<tbody>
<tr>
<td>Theft</td>
<td>1.28</td>
<td>1.14</td>
<td>1.18</td>
</tr>
<tr>
<td>Burglary</td>
<td>0.35</td>
<td>0.32</td>
<td>0.33</td>
</tr>
<tr>
<td>Assault</td>
<td>0.20</td>
<td>0.19</td>
<td>0.19</td>
</tr>
<tr>
<td>Robbery</td>
<td>0.04</td>
<td>0.04</td>
<td>0.04</td>
</tr>
<tr>
<td>Sexual</td>
<td>0.03</td>
<td>0.03</td>
<td>0.03</td>
</tr>
<tr>
<td>Harm</td>
<td>0.01</td>
<td>0.01</td>
<td>0.01</td>
</tr>
<tr>
<td><strong>All types</strong></td>
<td><strong>0.39</strong></td>
<td><strong>0.35</strong></td>
<td><strong>0.36</strong></td>
</tr>
</tbody>
</table>
<p>ST-ResNet beats the historical average but doesn't quite match ConvLSTM. The
aggregate MAE of 0.36 is a 7.7% improvement over the baseline, compared to
ConvLSTM's 10.3%.</p>
<p>That's not a terrible result, but it's not what I was hoping for.</p>
<h2>Why ConvLSTM wins here</h2>
<p>When I dug into the learned fusion weights, the story became clear. The
closeness branch dominates. It gets 60–70% of the weight across most grid cells.
The period branch gets 20–25%, and the trend branch barely contributes at
10–15%.</p>
<p>The model is basically saying: &quot;Recent months matter most, seasonal patterns
help a bit, and long-term trends are mostly noise.&quot; That's not a failure of the
architecture. It's a fair assessment of what's in the data.</p>
<p>With only 2–3 examples of each calendar month, the period branch can't reliably
learn seasonal patterns. It's overfitting to individual years rather than
extracting a stable seasonal signal. ConvLSTM handles this better because it
processes the full sequence and implicitly learns seasonality from the
continuous flow of months, without needing to explicitly align calendar periods.</p>
<p>The trend branch suffers even more. Quarterly averages over a four-year window
don't give it much to work with. In the original crowd flow papers with years of
half-hourly data, the trend branch captures genuine long-term shifts in
population movement. Here, it's essentially learning a constant.</p>
<h2>Where ST-ResNet does shine</h2>
<p>Despite losing on aggregate, ST-ResNet has one clear advantage: it's better at
predicting seasonal transitions.</p>
<p>The months where crime shifts gears (the spring uptick in September/October and
the February dip) ST-ResNet handles more gracefully than ConvLSTM. The period
branch, sparse as its data is, does capture enough of the annual rhythm to
anticipate these transitions a bit earlier.</p>
<p>ConvLSTM tends to lag these transitions by about a month. It needs to &quot;see&quot; the
uptick starting before it predicts continuation. ST-ResNet, by explicitly
looking at last year's same month, can anticipate the shift before it fully
materialises in the recent sequence.</p>
<p>For an operational forecasting tool, that one-month lead time on seasonal
transitions could be valuable. But in our test set metrics, it's a small
advantage that doesn't overcome ST-ResNet's overall weaker performance on
month-to-month dynamics.</p>
<h2>Head to head</h2>
<table>
<thead>
<tr>
<th>Metric</th>
<th>Historical Avg</th>
<th>ConvLSTM</th>
<th>ST-ResNet</th>
</tr>
</thead>
<tbody>
<tr>
<td>Overall MAE</td>
<td>0.39</td>
<td>0.35</td>
<td>0.36</td>
</tr>
<tr>
<td>Theft MAE</td>
<td>1.28</td>
<td>1.14</td>
<td>1.18</td>
</tr>
<tr>
<td>Training time (CPU)</td>
<td>N/A</td>
<td>~40 min</td>
<td>~35 min</td>
</tr>
<tr>
<td>Parameters</td>
<td>0</td>
<td>~200k</td>
<td>~180k</td>
</tr>
<tr>
<td>Seasonal transitions</td>
<td>Poor</td>
<td>Lagging</td>
<td>Better</td>
</tr>
<tr>
<td>Spatial dynamics</td>
<td>None</td>
<td>Good</td>
<td>Good</td>
</tr>
</tbody>
</table>
<p>ConvLSTM is the better model for this specific dataset. Not by a lot. We're
talking about small differences on already-small error values. But consistently
better on the main crime types that have enough signal to matter.</p>
<p>Neither model is a revelation. A 7–10% improvement over &quot;just use the historical
average&quot; is real but modest. Deep learning's strengths (learning complex
nonlinear dynamics from huge datasets) are somewhat wasted on 48 monthly
timesteps over a relatively low-crime city.</p>
<p>If I had daily data instead of monthly, or ten years instead of four, I'd expect
ST-ResNet to close the gap or pull ahead. Its architecture is fundamentally
sound. The temporal decomposition is a genuinely good idea. It's just starved of
the data it needs to shine.</p>
<p>Both models meaningfully beat the baselines. Both learn spatial patterns that
simple averages can't capture. And both are honest about the sparse crime types:
they predict near-zero and move on, which is the right call.</p>
<p>Next up: we'll take these predictions and build something you can actually look
at. A 3D interactive dashboard where you can watch crime patterns evolve across
Auckland over time. The modelling was the hard bit. Making it visual is the fun
bit.</p>
]]>
      </content:encoded>
      <pubDate>Thu, 23 Apr 2026 00:00:00 GMT</pubDate>
    </item>
    <item>
      <title>What an hour of your attention is worth</title>
      <link>https://jonnonz.com/posts/what-an-hour-of-your-attention-is-worth/</link>
      <guid isPermaLink="false">https://jonnonz.com/posts/what-an-hour-of-your-attention-is-worth/</guid>
      <description>
        You pay Big Tech about $1,000 a year in attention. Here's how to read the meter — and why building your own is suddenly cheaper than opting out.
      </description>
      <content:encoded>
        <![CDATA[<p>I stood up a working social network for eight mates last weekend. Profile pages,
a shared feed, a photo wall, a jukebox bolted onto a spare domain. It took me a
Saturday, about forty bucks in Claude credits, and exactly zero
product-market-fit meetings.</p>
<p>The same weekend, Meta earned about six bucks off me. Google made ten. LinkedIn,
YouTube, TikTok, X — all quietly billing in the background, none of them sending
a receipt. If you add them all up for the average American, the annual total is
north of $1,000. You just never see it, because no money changes hands and no
invoice arrives.</p>
<p>The clever thing about &quot;free&quot; on the internet isn't that the trade doesn't
exist. It's that it's been designed so you can't see it. No money moves. No
invoice lands. No app shows you the meter ticking as you scroll. The exchange is
real — your attention and your data in, Instagram and Google and LinkedIn out —
but by the time the numbers get tallied, they live in a quarterly earnings
report you'll never read. So the trade feels weightless.</p>
<p>It isn't. You just can't see the price tag.</p>
<p>The strange thing is the price tag has been public the whole time. Every
platform listed on a stock exchange tells you, four times a year, exactly what
you're worth to them. You've just never been shown how to read it — and until
recently, the only practical alternative to reading it was &quot;live in a cabin.&quot;
That part has changed, and it's the part almost nobody is talking about.</p>
<p>The invisibility isn't an accident either. If Meta had to send you a cheque
every month for the money they made off you, you'd treat the relationship very
differently. You'd notice when the amount went up. You'd notice that the
teenager version of the payment looks nothing like the adult version. You'd
wonder why the Auckland cheque was ten times the Jakarta one for the exact same
hour of scrolling. The whole edifice of &quot;free&quot; rests on keeping the accounting
one-sided — they measure you in basis points to three decimal places, you
experience the trade as a vague sense of having lost your afternoon.</p>
<h2>The price tag they're legally required to print</h2>
<p>The number you want is called ARPU — average revenue per user. Every public
platform reports it, because investors demand it. The maths is blunt: take the
company's annual revenue, divide by monthly active users. What comes out is what
the platform earns off the average human who shows up, per year.</p>
<p>For Meta last year the global figure was about $52 per user. For YouTube's
ad-supported side, around $24. For
<a href="https://www.linkedin.com/posts/dshapero_earnings-update-to-close-out-our-2025-fiscal-activity-7361399679256858624-vVg7">LinkedIn it's $15 averaged across all 1.2B members</a>,
but much higher once you strip out the dormant accounts.</p>
<p>These aren't guesses from a watchdog group. They're from the companies
themselves, in the part of the earnings release where the whole purpose is to
convince shareholders each user is worth more than last quarter. The incentive
is to talk the number up, not down.</p>
<p>Whatever ARPU says, the reality on the ground probably isn't lower. If anything,
it's a floor.</p>
<h2>Your annual bill, itemised</h2>
<p>Rough figures, from the companies' own filings:</p>
<ul>
<li><strong>Meta</strong>: ~$52/yr global, ~$320 in the US</li>
<li><strong>Google (all products)</strong>: ~$100/yr globally, ~$500 US —
<a href="https://abc.xyz/investor/">$400B in revenue</a> across ~4B users spanning
Search, Android, YouTube, Cloud and Workspace combined</li>
<li><strong>YouTube ads alone</strong>: ~$24/yr global, ~$80 US</li>
<li><strong>LinkedIn</strong>: $15/yr averaged across all 1.2B members, but ~$57/yr across the
<a href="https://www.linkedin.com/posts/dshapero_earnings-update-to-close-out-our-2025-fiscal-activity-7361399679256858624-vVg7">310M monthly active ones</a></li>
<li><strong>TikTok</strong>: ~$16 global, ~$70 US — doubled in two years</li>
<li><strong>Snapchat, Reddit, Pinterest, X</strong>: all in the $10–30/user/yr range</li>
</ul>
<p>The geographic skew is the part most people miss. Meta's figure in the US is
roughly ten times what it is in Asia-Pacific. Europe sits in the middle at about
$92. Same product, same features, same algorithm — different rate card, because
ad buyers pay more to reach wealthier audiences. You are literally worth more in
Auckland than you are in Jakarta, and your feed is tuned accordingly.</p>
<p><img src="https://jonnonz.com/img/posts/arpu-meta-by-region.svg" alt="Meta ARPU by region — US $320, Europe $92, global $52, Asia-Pacific $32"></p>
<p>The same skew shows up across every ad-funded platform. The US rate card is the
one the rest of the world gets compared to:</p>
<p><img src="https://jonnonz.com/img/posts/arpu-us-vs-global.svg" alt="US vs global ARPU comparison — Google $500/$100, Meta $320/$52, YouTube $80/$24, TikTok $70/$16, LinkedIn $57/$15"></p>
<p>Marketplaces don't fit ARPU cleanly, but the extraction is still there if you
look for it. Uber and Lyft take around 20% of each fare. Airbnb combines host
and guest fees for about 14–16%. DoorDash and Uber Eats take closer to 25%.
Shopify's card take is 2.9% plus 30 cents per transaction. Different mechanism,
same game — a percentage of every transaction, quietly skimmed, never itemised.</p>
<h2>The meter, in dollars per hour</h2>
<p>ARPU is annual. Attention isn't spent in years though — it's spent in hours, in
the little windows between other things. So the honest conversion is to divide.</p>
<p>The average US Meta user burns about 200 hours a year across Facebook and
Instagram. $320 ÷ 200 = roughly $1.60 per hour of your attention. YouTube works
out to about $0.27/hour. TikTok $0.22. Snapchat cheaper still. Do the same sum
on global averages and Meta drops to around 26 cents an hour, YouTube to 8.</p>
<p>Those rates are only what the platform <em>earns</em> this year, mind. They aren't what
your data is ultimately <em>worth</em>. Everything you click and hover and pause on
feeds ad targeting across the wider web, plus — now — AI training corpora. ARPU
is the rent. The equity is bigger, and the equity compounds.</p>
<p>The AI-training bit is genuinely new and worth pausing on. For fifteen years the
data you generated on these platforms powered one thing: better ad targeting on
those same platforms. It was a closed loop. You scrolled, they learned, they
sold the targeting back to advertisers, the advertisers bought your attention
again. Bounded. Weird, but bounded.</p>
<p>That loop isn't bounded anymore. Your posts and comments and DMs are now
training data for models that will be sold, resold, and embedded into every
piece of software you touch for the next decade. The $320 Meta earned off you in
the US last year is a rounding error next to what the underlying corpus is worth
to the next generation of AI products. ARPU doesn't capture any of that. It's
literally last quarter's ad rent, with none of the capital gains on the asset.</p>
<p>Even the rent, laid out per hour, makes one thing obvious: you can see exactly
why every platform is obsessed with &quot;time spent&quot; as a north-star metric. If one
extra hour a week on Facebook is worth ~$83 a year per US user, multiplied
across three billion users, the maths for why the feed never stops scrolling is
not mysterious. The feed is a meter. Keeping it running is the business. Every
&quot;new feature&quot; that shows up in your settings — reels, shorts, a nudge to open
the app on your commute — is a hand on that meter.</p>
<p>Once you see it that way, a lot of product decisions stop looking like product
decisions.</p>
<h2>Run your own numbers</h2>
<p>The point of making the numbers this concrete is that you can plug in your own
usage and see what you personally throw into the machine each year. Drag the
sliders for how much time goes into each platform and watch the ledger tally up.
Rates are global averages.</p>
<section class="ledger" id="ledger" aria-label="The Ledger calculator"><style>
.ledger{--lb:#141f2e;--lb2:#1a2637;--li:#e4e9ee;--ld:#bfc8d2;--lm:#6a7d92;--la:#d4a853;--lbd:rgba(255,255,255,.08);--lbs:rgba(255,255,255,.16);max-width:38rem;background:var(--lb);border:1px solid var(--lbs);border-radius:.35rem;padding:1.15rem 1.25rem 1rem;margin:2rem auto;color:var(--ld);font-family:text,'Roboto',-apple-system,sans-serif;font-size:.88rem;line-height:1.5;text-align:left}
.ledger *,.ledger *::before,.ledger *::after{box-sizing:border-box}
.ledger h3,.ledger h4{font-family:inherit;margin:0;padding:0;color:inherit;font-weight:inherit;font-size:inherit;letter-spacing:0;line-height:1.2}
.ledger h3::before,.ledger h4::before{content:none}
.ledger p{margin:0;padding:0}
.ledger input{font:inherit;color:inherit;background:transparent;border:none;outline:none}
.l-head{display:flex;justify-content:space-between;align-items:baseline;gap:.75rem;padding-bottom:.75rem;margin-bottom:.9rem;border-bottom:1px solid var(--lbs)}
.l-title{font-family:serif,'Fraunces',Georgia,serif;font-weight:400;font-size:1.15rem;letter-spacing:-.015em;color:var(--li)}
.l-tag{font-family:code,'JetBrains Mono',monospace;font-size:.58rem;letter-spacing:.16em;text-transform:uppercase;color:var(--lm)}
.l-total{display:flex;align-items:baseline;justify-content:space-between;gap:1rem;padding:.85rem 1rem;background:#0e1623;border:1px solid var(--lbd);border-radius:.25rem;margin-bottom:1.1rem}
.l-total-amt{font-family:serif,'Fraunces',Georgia,serif;font-weight:400;font-size:1.9rem;line-height:1;color:var(--la);letter-spacing:-.02em;font-variant-numeric:tabular-nums}
.l-total-lab{font-family:code,'JetBrains Mono',monospace;font-size:.58rem;letter-spacing:.18em;text-transform:uppercase;color:var(--lm);text-align:right}
.l-sec{margin-top:1.15rem}
.l-sec:first-of-type{margin-top:0}
.l-row{padding:.9rem 0;border-bottom:1px dashed var(--lbd)}
.l-row:last-child{border-bottom:none}
.l-row-top{display:flex;align-items:baseline;justify-content:space-between;gap:.75rem;margin-bottom:.55rem}
.l-row-n{color:var(--li);font-size:.94rem;line-height:1.25}
.l-row-meta{font-family:code,'JetBrains Mono',monospace;font-size:.6rem;letter-spacing:.05em;color:var(--lm);margin-top:.15rem;display:block}
.l-row-a{font-family:serif,'Fraunces',Georgia,serif;font-size:1.05rem;color:var(--la);text-align:right;font-variant-numeric:tabular-nums;white-space:nowrap;flex-shrink:0}
.l-row-a.z{color:var(--lm)}
.l-row-c{display:flex;align-items:center;gap:.9rem}
.l-row-v{font-family:code,'JetBrains Mono',monospace;font-size:.66rem;letter-spacing:.05em;color:var(--ld);white-space:nowrap;min-width:6rem;text-align:right}
.l-sl{-webkit-appearance:none;appearance:none;flex:1;min-width:0;height:32px;background:transparent;cursor:pointer;padding:0;margin:0;touch-action:manipulation}
.l-sl::-webkit-slider-runnable-track{height:3px;background:var(--lbd);border-radius:2px}
.l-sl::-webkit-slider-thumb{-webkit-appearance:none;appearance:none;width:22px;height:22px;border-radius:50%;background:var(--la);border:3px solid var(--lb);margin-top:-10px;box-shadow:0 0 0 1px var(--la),0 2px 6px rgba(0,0,0,.3);cursor:grab}
.l-sl:active::-webkit-slider-thumb{cursor:grabbing;box-shadow:0 0 0 1px var(--la),0 0 0 6px rgba(212,168,83,.22)}
.l-sl::-moz-range-track{height:3px;background:var(--lbd);border-radius:2px}
.l-sl::-moz-range-thumb{width:22px;height:22px;border-radius:50%;background:var(--la);border:3px solid var(--lb);box-shadow:0 0 0 1px var(--la)}
.l-sl:focus::-webkit-slider-thumb{box-shadow:0 0 0 1px var(--la),0 0 0 6px rgba(212,168,83,.28)}
.l-notes{margin-top:1.25rem;padding-top:.9rem;border-top:1px solid var(--lbs)}
.l-notes h5{font-family:code,'JetBrains Mono',monospace;font-size:.56rem;letter-spacing:.18em;text-transform:uppercase;color:var(--lm);margin:0 0 .55rem;font-weight:400}
.l-notes p{font-family:serif,'Fraunces',Georgia,serif;font-size:.85rem;line-height:1.55;color:var(--ld);margin:0 0 .4rem;text-wrap:pretty}
.l-notes p:last-child{margin-bottom:0}
.l-notes strong{color:var(--li);font-weight:500}
@media (max-width:560px){
.ledger{padding:1rem .9rem;margin:1.5rem auto;font-size:.92rem}
.l-total{flex-direction:column;align-items:flex-start;gap:.25rem;padding:.75rem .9rem}
.l-total-lab{text-align:left}
.l-row{padding:1rem 0}
.l-row-c{gap:.75rem}
.l-row-v{min-width:5rem;font-size:.7rem}
.l-sl::-webkit-slider-thumb{width:26px;height:26px;margin-top:-12px}
.l-sl::-moz-range-thumb{width:26px;height:26px}
}
</style><div class="l-head"><h3 class="l-title">The Ledger</h3><span class="l-tag">global averages · per year</span></div><div class="l-total"><div class="l-total-amt" id="l-total">$0</div><div class="l-total-lab">extracted per year</div></div><div class="l-sec"><div id="l-attn-rows"></div></div><div class="l-notes"><h5>Notes on the method</h5><p><strong>ARPU is rent, not equity.</strong> What a platform earns this year isn't what the underlying data is worth across the wider web and AI training corpora.</p><p><strong>Averages hide heavy users.</strong> Freemium smears free and paying users into one figure. If you're all-in, you're worth more than average.</p><p><strong>Multi-product companies cheat the top line.</strong> Google's per-user number isn't all Search — it's Search plus Android plus YouTube plus Cloud.</p></div><script>(function(){var R={meta:.26,youtube:.08,tiktok:.05,x:.07,reddit:.12,snap:.07,pin:.15,li:.15,gq:.04};
var ATTN=[{id:'meta',name:'Meta (FB / IG / WhatsApp)',rate:'meta',unit:'hrs/day',mult:365,max:6,step:.25},{id:'youtube',name:'YouTube (ad-supported)',rate:'youtube',unit:'hrs/day',mult:365,max:6,step:.25},{id:'tiktok',name:'TikTok',rate:'tiktok',unit:'hrs/day',mult:365,max:6,step:.25},{id:'x',name:'X (Twitter)',rate:'x',unit:'hrs/day',mult:365,max:4,step:.25},{id:'reddit',name:'Reddit',rate:'reddit',unit:'hrs/day',mult:365,max:4,step:.25},{id:'snap',name:'Snapchat',rate:'snap',unit:'hrs/day',mult:365,max:4,step:.25},{id:'pin',name:'Pinterest',rate:'pin',unit:'hrs/day',mult:365,max:4,step:.25},{id:'li',name:'LinkedIn (free)',rate:'li',unit:'hrs/day',mult:365,max:2,step:.1},{id:'gq',name:'Google Search',rate:'gq',unit:'searches/day',mult:365,max:100,step:1}];
var state={attn:{}};
function fmt(n){n=Math.round(n);if(n===0)return'$0';if(n>=1000)return'$'+n.toLocaleString();return'$'+n;}
function recalc(){var tot=0;
ATTN.forEach(function(s){var v=state.attn[s.id]||0;var amt=v*R[s.rate]*s.mult;tot+=amt;var el=document.getElementById('l-a-'+s.id);if(el){el.textContent=fmt(amt);el.classList.toggle('z',amt<1);}var vl=document.getElementById('l-v-'+s.id);if(vl)vl.textContent=v+' '+s.unit;});
document.getElementById('l-total').textContent=fmt(tot);}
function renderAttn(){document.getElementById('l-attn-rows').innerHTML=ATTN.map(function(s){var meta='$'+R[s.rate].toFixed(2)+(s.unit==='searches/day'?' / search':' / hr');return '<div class="l-row"><div class="l-row-top"><div><span class="l-row-n">'+s.name+'</span><span class="l-row-meta">'+meta+'</span></div><span class="l-row-a z" id="l-a-'+s.id+'">$0</span></div><div class="l-row-c"><input type="range" class="l-sl" data-cat="attn" data-svc="'+s.id+'" min="0" max="'+s.max+'" step="'+s.step+'" value="0" aria-label="'+s.name+' '+s.unit+'"><span class="l-row-v" id="l-v-'+s.id+'">0 '+s.unit+'</span></div></div>';}).join('');}
function renderAll(){renderAttn();recalc();}
var root=document.getElementById('ledger');
root.addEventListener('input',function(e){var t=e.target;if(t.classList.contains('l-sl')){state[t.dataset.cat][t.dataset.svc]=parseFloat(t.value)||0;recalc();}});
renderAll();})();</script></section>
<p>The rates come from the earnings-report maths above — global ARPU divided by
average annual hours on the platform.</p>
<h2>The weekend social network</h2>
<p>Once the number has somewhere to sit, it's much harder to ignore.</p>
<p>Most people look at a total over $1,000/yr and go quiet for a second. Not
because any one platform is egregious — on a per-hour basis they really aren't —
but because the aggregate is real, and it's been invisible until now. That's the
first useful thing the exercise does. It makes a choice possible.</p>
<p>The obvious next move is to look at alternatives. Signal instead of WhatsApp.
Kagi or Brave Search instead of Google. Paid Spotify instead of ad-supported
Spotify. Bluesky or Mastodon instead of X. Fastmail instead of Gmail. None are
perfect, and some cost actual money — but once you can price what you're
currently &quot;not paying&quot;, the paid alternative often looks less expensive than it
did five minutes ago. Fastmail at
$5/month stops being a luxury when the honest comparison is &quot;$60/yr vs being the
product for an ad network that paid $500 for me last year.&quot;</p>
<p>That's the defensive move. It's the one everyone talks about, every time one of
these pieces gets written. You switch to the more honest vendor, you feel
slightly better, and the fundamental shape of the market doesn't move.</p>
<p>The more interesting move is what's happened on the <em>build</em> side, and it's the
part almost nobody has internalised yet.</p>
<p>Standing up a social app used to take a small team months. You needed a backend
engineer, a frontend engineer, a designer, probably a DevOps person, and a spare
three months. That was the real moat — not the network effects, not the
algorithm, but the sheer human-hours required to put a working thing on the
internet. That's why the only viable answer for twenty years was to build
something big enough to run ads against. Small social didn't exist because small
social couldn't pay the salaries.</p>
<p>With Claude Code, Cursor, v0, and Lovable, that equation has quietly inverted. A
profile page, a shared feed, a wall for photos, maybe a jukebox, a chat wall — a
MySpace-sized thing for you and a dozen friends, on a domain you own, with none
of it feeding anyone's ad platform — is a weekend. I know because I just did it.
Not as some Silicon Valley startup trying to replace Facebook. As a Saturday
project for eight mates, on a domain that cost twelve bucks, running on a box
that costs ten a month.</p>
<p>The bill of materials is embarrassingly short. A boring Postgres. A boring
Next.js app. Auth via magic link. Storage for photos. An LLM for the fiddly bits
nobody wants to write from scratch. All of it plumbed together in an afternoon
of prompting, an evening of cleanup, and a Sunday of adding the jukebox because
my mate Hamish wouldn't stop asking.</p>
<p>It is not good software. It is good <em>enough</em> software for eight humans who know
each other.</p>
<p>That qualifier is the whole thing. Facebook has to be good software at planet
scale because Facebook is selling ad impressions at planet scale. A group of
eight doesn't need p99 latency and a content moderation policy. A group of eight
needs a place to put photos from the weekend where the photos don't end up
training someone's image model in twelve months' time. Those are very different
engineering problems, and the second one is much, much easier than the first.</p>
<p>A lot of things genuinely don't work on the weekend version. There's no
recommendation algorithm. There's no real search. The feed is
reverse-chronological and that's it. When someone posts something at 3am nobody
sees it until the morning. There's no cleverness about which photos get surfaced
or which memories get resurrected. If you go on holiday for two weeks, you come
back to a feed that's exactly what your eight mates posted, in the order they
posted it.</p>
<p>That sounds like a limitation until you notice the thing it is not doing is
optimising for your engagement. Reverse-chronological across eight friends is
not a meter. It's a wall. You check it, you see what's there, you leave. There's
no reason for the software to try to keep you around because there's nobody
paying the software to keep you around. That inversion — from meter to wall — is
the entire point.</p>
<p>The thing that would have been a VC round in 2015 is now a side quest you finish
before the roast is in the oven. The tools genuinely got that much better in the
last eighteen months. We just haven't updated our intuitions yet about what that
means.</p>
<p>What it means, specifically, is that the ad-supported social network is no
longer the only technically viable answer. For twenty years it was. That was the
constraint the whole &quot;free web&quot; was built around. The constraint is gone, and
nobody has sent the memo.</p>
<p>The cheapest social network in 2026 is the one you and six mates build on a
Saturday afternoon. It doesn't scale. It doesn't need to. It costs less than a
month of Netflix, produces no ad revenue for anyone, and feeds no one's training
set. You own the domain. You own the data. You own the product decisions — which
in practice means there are no product decisions, because nobody is trying to
squeeze another hour out of anyone's week.</p>
<p>None of this replaces the platforms, to be clear. You still need Gmail for the
recruiter, LinkedIn for the job hunt, YouTube for the tutorial, WhatsApp for the
group chat your family refuses to leave. The ad-supported internet isn't going
anywhere and I'm not pretending it is. What's changed is that it's no longer the
only game in town. For the circle of people you actually care about — the eight
mates, the cousins, the old uni flat — you don't have to hand them over to the
ad machine anymore. You can build them a room of their own, and the tools to
build that room have become trivial in a way we haven't fully absorbed yet.</p>
<p>The meter's been running your whole life. You just got the tools to turn it off.</p>
]]>
      </content:encoded>
      <pubDate>Tue, 21 Apr 2026 12:00:00 GMT</pubDate>
    </item>
    <item>
      <title>Teaching a Neural Network to Watch Crime Like Video</title>
      <link>https://jonnonz.com/posts/teaching-a-neural-network-to-watch-crime-like-video/</link>
      <guid isPermaLink="false">https://jonnonz.com/posts/teaching-a-neural-network-to-watch-crime-like-video/</guid>
      <description>
        ConvLSTM was built for weather radar. Turns out predicting crime on a grid is basically the same problem. Here's how it works and what it learned.
      </description>
      <content:encoded>
        <![CDATA[<p>ConvLSTM was invented to predict rainstorms.</p>
<p>Specifically,
<a href="https://arxiv.org/abs/1506.04214">Shi et al. at the Hong Kong Observatory</a>
needed to forecast radar echo maps: 2D grids of rainfall intensity that evolve
over time. They had sequences of spatial images and wanted to predict the next
frames. Sound familiar?</p>
<p>That's exactly what we built in Part 3. Crime on a 500m grid, one frame per
month, six channels for crime types. The Auckland crime tensor is structurally
identical to a weather radar sequence. Same dimensionality, same prediction
task, just a very different domain.</p>
<h2>Why not regular LSTM?</h2>
<p>Standard LSTM networks are fantastic at learning sequences. They're the backbone
of a lot of time-series forecasting. But they have a fundamental problem with
spatial data: they need flat vectors as input.</p>
<p>To feed our 77×59 grid into a regular LSTM, we'd have to flatten it into a
vector of 4,543 values per crime type. That's 27,258 values per timestep across
all six channels. The network would process this as a sequence of big flat
vectors, with no concept that cell (10, 5) is <em>next to</em> cell (10, 6).</p>
<p>All the spatial structure (the fact that crime clusters, that hotspots have
neighbourhoods, that the CBD is a contiguous area) gets thrown away. The model
would have to rediscover spatial relationships from scratch, purely from
correlations in the flattened vector. With only 36 training months, that's not
happening.</p>
<h2>The convolutional trick</h2>
<p>ConvLSTM's insight is elegant. Take the standard LSTM equations (the input gate,
forget gate, output gate, cell state update) and replace every matrix
multiplication with a convolution operation.</p>
<p>In a regular LSTM:</p>
<pre><code>input_gate = sigmoid(W_xi * x_t + W_hi * h_{t-1} + b_i)
</code></pre>
<p>In ConvLSTM:</p>
<pre><code>input_gate = sigmoid(W_xi ∗ X_t + W_hi ∗ H_{t-1} + b_i)
</code></pre>
<p>That <code>∗</code> is a convolution instead of a matrix multiply. <code>X_t</code> is the full 2D
grid at time <code>t</code>, and <code>H_{t-1}</code> is the previous hidden state, also a 2D grid.
The convolution kernel slides across the spatial dimensions, so each cell's gate
values depend on its local neighbourhood.</p>
<p>This means the network naturally learns that a spike in cell (10, 5) might
affect predictions for cell (10, 6). Spatial proximity is baked into the
architecture. It doesn't need to learn it from data.</p>
<p>The kernel size controls how much spatial context each cell sees. A 3×3 kernel
means each cell looks at its immediate 8 neighbours. Stack multiple ConvLSTM
layers and the effective receptive field grows. Deeper layers can capture
relationships between cells that are several kilometres apart.</p>
<h2>Architecture choices</h2>
<p>Here's what I settled on after a fair bit of experimentation (which on CPU means
&quot;a lot of patient waiting&quot;):</p>
<pre><code>Input: (batch, 6, 6, 77, 59), 6 months, 6 crime types, 77×59 grid
  ↓
ConvLSTM2d(in=6, hidden=32, kernel=3×3, padding=1)
  ↓
BatchNorm2d
  ↓
ConvLSTM2d(in=32, hidden=32, kernel=3×3, padding=1)
  ↓
BatchNorm2d
  ↓
Conv2d(in=32, out=6, kernel=1×1), project to 6 crime type channels
  ↓
Output: (batch, 6, 77, 59), next month prediction
</code></pre>
<p>Two ConvLSTM layers with 32 hidden channels each. The 3×3 kernel gives each cell
a neighbourhood view, and stacking two layers means the effective receptive
field covers about 1–1.5 km. Enough to capture the spatial extent of most crime
hotspots.</p>
<p>Why only 32 hidden channels? This is where the CPU constraint actually helps. A
bigger model would be tempting with a GPU, but on a Ryzen 5 we need to keep it
tight. 32 channels gives us about 200k trainable parameters: small enough to
train in under an hour, large enough to learn meaningful spatial-temporal
patterns.</p>
<p>The 1×1 convolution at the end is a channel projection. It maps the 32 learned
features back to 6 crime type predictions.</p>
<h2>Sequence length: six months</h2>
<p>The lookback window is six months. The model sees January through June and
predicts July. Then February through July to predict August. And so on.</p>
<p>Six months captures one half of the seasonal cycle, which turned out to be the
sweet spot. Shorter sequences (3 months) missed seasonal context. Longer
sequences (12 months) didn't improve results, likely because the model doesn't
have enough data to learn year-long dependencies with only 36 training months
total.</p>
<p>The training set gives us 30 sequences (months 1–6 predict 7, months 2–7 predict
8, all the way to months 30–35 predict 36). That's not a lot. Every sequence
counts.</p>
<h2>Training details</h2>
<pre><code class="language-python">optimiser = Adam(lr=1e-4)
loss = MSE  # on log1p-transformed values
batch_size = 4  # small because sequences are large
epochs = 150 with early stopping (patience=15)
</code></pre>
<p>The <code>log1p</code> transformation from Part 3 is critical here. Raw crime counts range
from 0 to 50+. After <code>log1p</code>, the range compresses to 0–4. Without this, the
loss function would be dominated by the handful of high-count CBD cells, and the
model would essentially ignore the rest of the grid.</p>
<p>Training on CPU takes about 40 minutes per run. Not fast, but manageable. I
could typically fit in 3–4 experimental runs per evening, which meant progress
was slow but steady. Each run I'd tweak one thing (kernel size, hidden channels,
learning rate) and compare validation MAE.</p>
<p>Early stopping triggers around epoch 80–100 in most runs. The model converges
relatively quickly, which makes sense given the small dataset and architecture.</p>
<h2>Results</h2>
<p>So how does ConvLSTM stack up against the baselines from Part 5?</p>
<table>
<thead>
<tr>
<th>Crime Type</th>
<th>Hist. Avg MAE</th>
<th>ConvLSTM MAE</th>
<th>Improvement</th>
</tr>
</thead>
<tbody>
<tr>
<td>Theft</td>
<td>1.28</td>
<td>1.14</td>
<td>10.9%</td>
</tr>
<tr>
<td>Burglary</td>
<td>0.35</td>
<td>0.32</td>
<td>8.6%</td>
</tr>
<tr>
<td>Assault</td>
<td>0.20</td>
<td>0.19</td>
<td>5.0%</td>
</tr>
<tr>
<td>Robbery</td>
<td>0.04</td>
<td>0.04</td>
<td>2.5%</td>
</tr>
<tr>
<td>Sexual</td>
<td>0.03</td>
<td>0.03</td>
<td>~0%</td>
</tr>
<tr>
<td>Harm</td>
<td>0.01</td>
<td>0.01</td>
<td>~0%</td>
</tr>
<tr>
<td><strong>All types</strong></td>
<td><strong>0.39</strong></td>
<td><strong>0.35</strong></td>
<td><strong>10.3%</strong></td>
</tr>
</tbody>
</table>
<p>A 10% improvement on the aggregate MAE. Not earth-shattering, but real.</p>
<p>Theft gets the biggest lift because there's the most signal to work with. The
model genuinely learns spatial dynamics that the historical average can't
capture. When a cluster of cells in South Auckland trends upward over several
months, ConvLSTM picks up on that momentum and adjusts its predictions
accordingly.</p>
<p>Burglary sees a decent improvement too, likely driven by the spatial correlation
with theft that we spotted in the EDA.</p>
<p>For the sparse crime types (robbery, sexual offences, harm) ConvLSTM basically
learns to predict near-zero, same as the baseline. There simply isn't enough
signal at 500m monthly resolution for these types. The model is honest about
what it doesn't know, which I actually respect.</p>
<h2>Where it shines and where it doesn't</h2>
<p>The improvement isn't uniform across the grid. ConvLSTM does best in the
transition zones: cells on the edges of established hotspots where crime counts
fluctuate month to month. It learns that these boundary cells tend to follow the
trend of their neighbours, which is exactly the kind of spatial-temporal pattern
it was designed to capture.</p>
<p>In the stable hotspot cores (the CBD, Manukau) the model performs about the same
as the baseline. Those cells are consistently high, and the historical average
already captures that well.</p>
<p>Where it properly struggles is with sudden spikes in normally quiet areas. A
cell that's been near-zero for months and then gets 5 thefts in one month: the
model doesn't see that coming. Neither does any other model, to be fair. Those
events are closer to random noise than learnable signal.</p>
<h2>Putting it in perspective</h2>
<p>A 10% MAE improvement is meaningful but modest.
<a href="https://arxiv.org/pdf/2502.07465v1">Recent ConvLSTM crime prediction papers</a>
report larger gains, but they typically work with much more data: years of daily
records across cities with higher crime density. Our setup is tougher. Monthly
resolution limits temporal signal, Auckland is relatively low-crime by global
standards, and we only have four years.</p>
<p>The model is also running on CPU with a deliberately small architecture. A
bigger model on a GPU might squeeze out more performance. But the point of this
project was always to see how far you can push it with modest resources, and a
10% beat over simple baselines feels like a real result.</p>
<p>The question now is whether ST-ResNet's different approach to temporal modelling
can do better. ConvLSTM processes time as one continuous sequence. ST-ResNet
breaks it into three separate temporal scales: closeness, period, and trend.
With a seasonal dataset like crime, that decomposition might be exactly what's
needed.</p>
]]>
      </content:encoded>
      <pubDate>Thu, 16 Apr 2026 00:00:00 GMT</pubDate>
    </item>
    <item>
      <title>Open-Source Agent That Teaches Claude Code Your Architecture</title>
      <link>https://jonnonz.com/posts/open-source-agent-that-teaches-claude-code-your-architecture/</link>
      <guid isPermaLink="false">https://jonnonz.com/posts/open-source-agent-that-teaches-claude-code-your-architecture/</guid>
      <description>
        AI made software cheap to build but not cheap to scale. Evolutionary architecture with domain-aware AI agents is the missing piece.
      </description>
      <content:encoded>
        <![CDATA[<p>AI has made building software cheap. A solo founder with Claude Code or Cursor
can ship an MVP in a weekend that would've taken a small team a month two years
ago. I've watched this happen across the NZ startup scene. Ideas that used to
die in the &quot;can we afford to build it&quot; phase now get built over a long weekend.</p>
<p>This is mostly great. Velocity is what startups need. Cost of testing an idea is
now close to zero, and the business prioritises speed.</p>
<p>The catch shows up when the idea works.</p>
<p>AI builds for <em>right now</em>. It optimises for the current prompt, the current
file, the current feature. It doesn't think about what happens when your billing
service needs to handle 10x the volume, or when your email notifications need to
move from inline calls to a queue. It doesn't plan for the evolutionary pressure
your system will face once it has users.</p>
<p>That's the gap I've been thinking about, and it's what led me to build
<a href="https://github.com/jonnonz1/domain-agents">domain-agents</a>.</p>
<h2>Give the tools their credit</h2>
<p>I want to be fair to the current generation of AI coding assistants. They're not
stupid about finding code.</p>
<p>Claude Code runs an agentic search loop (grep, glob, file reads) iterating
through your codebase to find what's relevant. Boris Cherny (who created Claude
Code) <a href="https://x.com/bcherny/status/2017824286489383315">has said</a> they tried
RAG with a local vector database early on and dropped it because agentic search
outperformed it. Cursor takes a different approach: it
<a href="https://read.engineerscodex.com/p/how-cursor-indexes-codebases-fast">chunks your codebase, generates embeddings</a>,
and stores them for semantic search so you can find code by concept rather than
keyword. Copilot combines semantic indexing with LSP-powered reference tracing
from VS Code.</p>
<p>The search works. If you ask Claude Code to find your billing service, it'll
find it. Ask Cursor for authentication logic and the embeddings will surface it
even if the code never uses the word &quot;authentication.&quot;</p>
<p>None of them understand the architecture those files live in.</p>
<p>All the information needed to understand domain relationships sits in the code:
import graphs, interface signatures, dependency patterns. These tools don't
extract or structure it that way. They find files one at a time. They don't map
out that your billing service depends on the email service, that
<code>BillingService</code> is consumed by two other domains, or that changing its
interface is a cross-domain event. The information is in the codebase. Nobody's
pulling it together.</p>
<p>And every session starts from zero. The AI learned your architecture yesterday
and forgot it today.</p>
<h2>Evolutionary architecture for the AI era</h2>
<p>My thesis: cheap AI-built MVPs plus expensive scaling problems point toward
evolutionary architecture with domain-based boundaries.</p>
<p>The idea isn't new. The reason it matters now is.</p>
<p>In an evolutionary architecture, you focus on clean interfaces between business
domains. Your email service exposes a contract like
<code>sendEmail(to, subject, body)</code>, and the rest of the system calls that interface.
Behind the interface, the implementation evolves through stages as your scaling
needs change:</p>
<pre><code class="language-mermaid">graph LR
    A[&quot;Inline\n(direct call)&quot;] --&gt; B[&quot;Async\n(fire &amp; forget)&quot;]
    B --&gt; C[&quot;Queued\n(BullMQ/SQS)&quot;]
    C --&gt; D[&quot;Separate Service&quot;]
    D --&gt; E[&quot;Distributed&quot;]
</code></pre>
<p>Day one, <code>sendEmail</code> is a function that calls Resend directly. Inline,
synchronous, dead simple. When traffic picks up, you drop the <code>await</code> and let it
run in the background. Later, you introduce BullMQ or SQS. Eventually it becomes
its own service. The interface stays put. Only the implementation behind it
changes.</p>
<p>This is the kind of evolution AI coding assistants are terrible at planning for.
They'll inline that email call because it works <em>right now</em>. They have no
concept of where this domain sits on its scaling trajectory.</p>
<h2>Where domain-agents fits in</h2>
<p><a href="https://github.com/jonnonz1/domain-agents">domain-agents</a> is a CLI tool that
runs static analysis on TypeScript codebases, discovers business domains, and
generates AI agent context files for Claude Code and Cursor.</p>
<pre><code class="language-bash">domain-agents discover .    # Analyse codebase → proposal.json
domain-agents init .        # Generate agents/*.md + AGENTS.md
domain-agents hooks claude  # Wire into Claude Code (rules + MCP server)
domain-agents hooks cursor  # Wire into Cursor (.mdc rules)
</code></pre>
<p>After setup, opening <code>src/billing/invoice.ts</code> in Claude Code loads the billing
domain agent into context. The AI now knows: billing depends on email (coupling
score 0.23), exposes <code>BillingService</code> consumed by 2 other domains, sits at the
&quot;inline&quot; scaling stage with a path toward async queuing, and has 3 tracked tech
debt items.</p>
<p>It plans work accordingly. The context was loaded before the first prompt, no
search required.</p>
<h2>Five signals, not one</h2>
<p>The discovery engine runs 5 analysis passes because no single signal identifies
business domains on its own.</p>
<p>Directory structure works for greenfield projects (<code>src/auth/</code>, <code>src/billing/</code>)
but fails for legacy MVC apps. Import graphs capture coupling but not business
intent. Package dependencies hint at external integrations but miss internal
domains.</p>
<pre><code class="language-mermaid">graph TD
    S[&quot;Structure Analysis&quot;] --&gt; O[&quot;Signal Orchestrator&quot;]
    I[&quot;Import Graph\n(TS Compiler API)&quot;] --&gt; O
    N[&quot;Naming Patterns&quot;] --&gt; O
    D[&quot;Dependency Mapping\n(npm → domain hints)&quot;] --&gt; O
    IF[&quot;Interface Detection&quot;] --&gt; O
    O --&gt; M[&quot;Merge Pipeline&quot;]
    M --&gt; R[&quot;Domain Proposal&quot;]
</code></pre>
<p><strong>Structure</strong> detects whether the codebase is feature-organised,
layer-organised, mixed, or flat. <strong>Import graph</strong> uses the TypeScript Compiler
API to parse each <code>.ts</code> file, resolve imports, and build a directed edge graph.
Type-only imports get weighted at 0.3 because they're a weaker coupling signal
than value imports. <strong>Naming patterns</strong> extract domain prefixes:
<code>auth.controller.ts</code> → &quot;auth&quot;. <strong>Dependency mapping</strong> maps npm packages to
domain hints (<code>stripe</code> → billing, <code>@sendgrid/mail</code> → email). <strong>Interface
detection</strong> identifies files imported across domain boundaries and calculates
coupling scores between domain pairs.</p>
<p>Each pass produces weighted signals. The orchestrator combines them with
confidence scoring: average signal strength plus a bonus for signal count,
capped at 0.99. Layer-organised codebases get an 0.85 multiplier because they're
harder to discover.</p>
<h2>Most real codebases aren't clean</h2>
<p>Feature-organised codebases are easy. The directory structure <em>is</em> the domain.
But most real codebases look like this:</p>
<pre><code>src/
  controllers/
    auth.controller.ts
    billing.controller.ts
  services/
    auth.service.ts
    billing.service.ts
  models/
    invoice.model.ts
    user.model.ts
</code></pre>
<p>Here <code>auth.controller.ts</code>, <code>auth.service.ts</code>, and <code>auth.routes.ts</code> all belong to
the &quot;auth&quot; domain despite living in three different directories. domain-agents
uses naming pattern extraction cross-referenced with import graph cohesion to
cluster these. The <code>auth.*</code> files form a tight import cluster, which confirms
the naming signal.</p>
<h2>Merging is the hard bit</h2>
<p>Raw signals produce too many small, overlapping clusters. The orchestrator runs
a multi-phase normalisation pipeline.</p>
<p>Plurals merge: <code>journals</code> + <code>journal</code> → whichever has more files. Compound names
consolidate: <code>bank-balance</code> + <code>bank-statement</code> + <code>bank-transaction</code> →
<code>bank-accounts</code> (the largest cluster). Small clusters merge into their strongest
import target, but only if they have a dominant dependency: more than 40% of
imports from one target, and that target is at least 2x larger. This prevents
cascading, where A merges into B, B gets bigger and attracts C, C pulls in D.</p>
<p>Files that import from 3+ domains get moved to &quot;unassigned.&quot; These are coupling
hotspots: middleware, orchestrators, shared handlers. Assigning them to one
domain would mislead the AI, so the tool surfaces them for a human decision.
That's the right call for architectural boundaries.</p>
<p>The E2E test suite validates the complete pipeline against 3 fixture codebases
(feature-organised, layer-organised, mixed). Current benchmark: 100% activation
accuracy across all 3 patterns and all 3 activation levels (domain assignment,
glob matching, MCP lookup).</p>
<h2>Auto-activation, not search</h2>
<p>The integration into Claude Code and Cursor uses glob-based rule activation, the
native mechanism both tools already support.</p>
<p>Each domain gets a rule file with glob patterns in the frontmatter:</p>
<pre><code class="language-yaml">---
description: billing domain
globs:
  - src/billing/**
  - **/billing.*
  - **/billing-*
---
</code></pre>
<p>When Claude Code opens a file matching those globs, the domain context loads. No
MCP call, no background process, zero runtime overhead.</p>
<p>An <a href="https://modelcontextprotocol.io/">MCP server</a> complements the rules with 4
on-demand tools: <code>domain_lookup(file)</code>, <code>domain_context(name)</code>,
<code>domain_files(name)</code>, and <code>list_domains()</code>. A SessionStart hook prints a domain
summary at the start of every Claude Code session, so the AI has system-level
awareness from the first prompt.</p>
<h2>Agents as a team model</h2>
<p>This is the bit I'm most keen on long-term.</p>
<p>At Vend and Xero, teams owned domains. The billing team owned billing, the
integrations team owned integrations. Ownership meant knowing the interfaces,
the coupling points, the tech debt, and where things were headed. That knowledge
lived in people's heads and got passed on through code reviews, architecture
chats, and tribal memory.</p>
<p>Domain-specific AI agents formalise that same ownership model. An email agent
loads the email domain's interface contract, its coupling to other domains, its
current scaling stage, and its tracked tech debt. A billing agent carries the
same for billing. They work within their boundaries and flag when a change
crosses a domain line.</p>
<p>You don't need this from day one. Early on, one agent covers multiple areas. As
the product grows, agents split along the same lines engineering teams split: by
business domain. The operator (that's you) resolves conflicts where agents
disagree, the same way an engineering manager resolves cross-team dependencies.</p>
<p>The analogy is rough, but it captures how AI-assisted development scales past a
single person staring at a single context window.</p>
]]>
      </content:encoded>
      <pubDate>Wed, 15 Apr 2026 00:00:00 GMT</pubDate>
    </item>
    <item>
      <title>Claude Code Can Now Spawn Copies of Itself in Isolated VMs</title>
      <link>https://jonnonz.com/posts/claude-code-can-now-spawn-copies-of-itself-in-isolated-vms/</link>
      <guid isPermaLink="false">https://jonnonz.com/posts/claude-code-can-now-spawn-copies-of-itself-in-isolated-vms/</guid>
      <description>
        Wiring up an MCP server so Claude Code can spawn isolated VMs, the SSE-to-Streamable-HTTP migration that broke everything, streaming architecture, and what productionising this would actually look like.
      </description>
      <content:encoded>
        <![CDATA[<p>The moment this project went from &quot;fun weekend hack&quot; to something I actually use
every day was when I got the MCP server working. Claude Code on my laptop sends
a prompt to the orchestrator sitting under my desk, which boots a VM, runs
Claude Code inside it with full permissions, and streams the results back.
Claude delegating work to Claude.</p>
<p>It's a weird feeling watching it happen. You're in a conversation with Claude,
it decides a task needs isolation, calls the MCP tool, and a few seconds later
you can see a fresh VM spinning up in the dashboard. Like having an intern who
can clone themselves.</p>
<p><a href="https://jonnonz.com/posts/claude-code-running-claude-code-in-4-second-disposable-vms/">Part 1</a>
covered why I built this.
<a href="https://jonnonz.com/posts/29-hours-debugging-iptables-to-boot-vms-in-4-seconds/">Part 2</a> was the
guts of it — rootfs, networking, the guest agent. This last post is about the
interfaces, the streaming pipeline, and what I'd change if this needed to work
for more than just me.</p>
<h2>The MCP server</h2>
<p>The orchestrator exposes an <a href="https://modelcontextprotocol.io/">MCP</a> server with
eight tools. The main one is <code>run_task</code> — give it a prompt, optional config
(RAM, vCPUs, timeout, max turns), and it blocks until the task completes.
Returns the task ID, status, exit code, result files, cost, and the output
truncated to 4000 characters.</p>
<p>Two transport modes. Stdio for when Claude Code runs on the same machine:</p>
<pre><code class="language-json">{
  &quot;mcpServers&quot;: {
    &quot;orchestrator&quot;: {
      &quot;command&quot;: &quot;sudo&quot;,
      &quot;args&quot;: [&quot;/opt/firecracker/bin/orchestrator&quot;, &quot;mcp&quot;]
    }
  }
}
</code></pre>
<p>And Streamable HTTP for network access — Claude Code on any machine on the LAN
can use it:</p>
<pre><code class="language-json">{
  &quot;mcpServers&quot;: {
    &quot;orchestrator&quot;: {
      &quot;type&quot;: &quot;http&quot;,
      &quot;url&quot;: &quot;http://192.168.50.44:8081/mcp&quot;
    }
  }
}
</code></pre>
<p>The other tools are for poking around: <code>get_task_status</code>, <code>list_vms</code>,
<code>exec_in_vm</code> (run a command in a still-running VM), <code>read_vm_file</code>,
<code>destroy_vm</code>, <code>list_task_files</code>, and <code>get_task_file</code>. That last one is smart
about content types — text files come back as plain text, images come back as
base64 MCP image content so Claude can actually see screenshots the VM took.</p>
<pre><code class="language-go">if isImageMime(mimeType) {
    encoded := base64.StdEncoding.EncodeToString(data)
    return mcplib.NewToolResultImage(&quot;Screenshot from task &quot;+taskID, encoded, mimeType), nil
}
</code></pre>
<h2>The migration that broke everything</h2>
<p>This bit is worth telling because it'll save someone else the debugging time.</p>
<p>I originally built the MCP server with
<a href="https://github.com/mark3labs/mcp-go">mcp-go</a> v0.45.0 using SSE (Server-Sent
Events) transport. Worked great. Then Claude Code updated to expect the newer
Streamable HTTP transport, and everything fell over.</p>
<p>The failure mode was confusing. Claude Code would try to connect, attempt OAuth
discovery against the <code>/sse</code> endpoint, get a 404 (my server doesn't do OAuth),
and fail with:</p>
<pre><code>Error: HTTP 404: Invalid OAuth error response: SyntaxError: JSON Parse error: Unable to parse JSON string
</code></pre>
<p>Nothing in my code changed. The client just started speaking a different
protocol.</p>
<p>The fix was small once I understood it:</p>
<pre><code class="language-go">// Before — SSE transport
func (s *Server) ServeSSE(addr string) error {
    sseServer := server.NewSSEServer(s.mcpServer,
        server.WithBaseURL(&quot;http://&quot;+addr),
    )
    return sseServer.Start(addr)
}

// After — Streamable HTTP transport
func (s *Server) ServeHTTP(addr string) error {
    httpServer := server.NewStreamableHTTPServer(s.mcpServer,
        server.WithEndpointPath(&quot;/mcp&quot;),
        server.WithStateLess(true),
    )
    return httpServer.Start(addr)
}
</code></pre>
<p>Bumped mcp-go from v0.45.0 to v0.46.0, swapped the server constructor, changed
the endpoint from <code>/sse</code> to <code>/mcp</code>, updated the client config. Done. But
diagnosing &quot;OAuth error on a server that doesn't do OAuth&quot; — that bit took a
while.</p>
<h2>Output streaming</h2>
<p>When Claude Code runs inside a VM, its output needs to get from stdout inside
the guest all the way to a browser tab on my laptop. The path:</p>
<pre><code class="language-mermaid">flowchart LR
    A[&quot;Claude Code stdout&quot;] --&gt; B[&quot;Guest agent\nvsock frame&quot;]
    B --&gt; C[&quot;Host vsock client\nExecStream&quot;]
    C --&gt; D[&quot;Task runner\nOnEvent callback&quot;]
    D --&gt; E[&quot;Stream Hub\nring buffer + fan-out&quot;]
    E --&gt; F[&quot;WebSocket\nto browser&quot;]
</code></pre>
<p>The stream hub (<code>internal/stream/hub.go</code>) is a per-task pub/sub system. Each
task gets a stream with a 1000-event ring buffer. When a WebSocket client
connects, it gets all the buffered history first, then live events as they
arrive.</p>
<p>Fan-out is non-blocking:</p>
<pre><code class="language-go">for ch := range s.subscribers {
    select {
    case ch &lt;- event:
    default:
        // Subscriber is slow, drop the event
    }
}
</code></pre>
<p>A slow WebSocket client can't block the task runner. If the browser can't keep
up, it misses events. In practice this never happens because the bottleneck is
always Claude thinking, not the network.</p>
<h2>The web dashboard</h2>
<p>The React frontend is compiled to static files and embedded into the Go binary:</p>
<pre><code class="language-go">//go:embed all:web-dist
var webDistEmbed embed.FS
</code></pre>
<p>Single binary deployment. No nginx, no separate frontend server, no CORS
headaches in production. The API server falls through to <code>index.html</code> for
unknown paths, which gives you SPA client-side routing.</p>
<p>The most interesting page is the task detail view. Claude Code's
<code>--output-format stream-json</code> spits out one JSON object per line — thinking
blocks, text responses, tool calls, tool results, cost summaries. The dashboard
parses these into coloured blocks:</p>
<ul>
<li>Purple for thinking (Claude's internal reasoning)</li>
<li>Blue for text responses</li>
<li>Orange for tool calls (shows the tool name and input)</li>
<li>Grey for tool results (truncated to 2000 chars — some of these are enormous)</li>
<li>Green for the final result with cost</li>
</ul>
<p>A <code>useWebSocket</code> hook connects when the task is running and disconnects when
it's done. Green pulsing dot for live streaming. Auto-scroll to the bottom as
events arrive. Image files in the results get inline previews pointing at the
API's file download endpoint — so when Claude takes a screenshot inside the VM,
you see it immediately.</p>
<p>Dark theme. Orange accents. Obviously.</p>
<h2>What productionising looks like</h2>
<p>This runs on one box with no auth. It's a home lab project. But the gap between
&quot;works for me&quot; and &quot;works for a small team&quot; isn't as big as it looks.</p>
<p><strong>Persistence</strong> is the most obvious one. The task store is an in-memory Go map.
Orchestrator restarts? All task history gone. VM metadata already persists to
disk and gets recovered on startup — tasks should too. SQLite or bbolt, a few
hours of work. I just haven't needed it because I don't restart the process very
often.</p>
<p><strong>Task queue with backpressure.</strong> Right now tasks fire as goroutines with no
concurrency limit. Submit 20 tasks on a 30GB machine where each VM wants 2GB and
the last few fail because there's no memory left. A buffered channel or
semaphore would fix this. You could get fancier with priority queues — quick
code generation tasks ahead of long research tasks — but even a simple
concurrency cap would be enough.</p>
<p><strong>Authentication.</strong> The REST API and MCP server accept requests from anyone who
can reach the port. For a team: API keys at minimum, mTLS if you're serious
about it. The MCP spec supports auth flows now — that'd be the right way to do
it for the MCP endpoint.</p>
<p><strong>The OnEvent callback race.</strong> This one's a latent bug. The task runner's
<code>OnEvent</code> callback is stored on the runner struct, not passed per-task:</p>
<pre><code class="language-go">s.taskRunner.OnEvent = func(id string, event agent.StreamEvent) {
    taskStream.Publish(event)
}
s.taskRunner.Run(context.Background(), t)
</code></pre>
<p>Two simultaneous tasks overwrite each other's callbacks. It works today because
MCP tasks block (one at a time) and the API handler sets up the stream before
the goroutine runs. But it's the kind of thing that works until it doesn't. Fix
is trivial — pass the callback into <code>Run()</code> as a parameter.</p>
<p><strong>Graceful shutdown.</strong> There's no signal handler. Ctrl-C kills the process,
running VMs become orphans. They keep running as Firecracker processes — the
<code>recoverState()</code> function on next startup finds them and starts tracking them
again — but their tasks are lost. A proper signal handler would stop accepting
new tasks, wait for running ones to finish with a timeout, then tear everything
down cleanly.</p>
<p><strong>For real multi-user</strong> you'd want result storage on S3 or R2 instead of local
disk. A web auth layer. Per-user credential vaults so different people's Claude
tokens don't mix. Usage tracking and cost attribution.</p>
<p><strong>What I wouldn't change:</strong> the single-binary deployment, vsock for host-guest
communication, ephemeral VMs as the isolation model, the embedded frontend.
Those are the right calls regardless of scale. The architecture is sound — it's
the operational bits around it that need work.</p>
<p>Most of these are a weekend each. The project is about 3,200 lines of Go and 860
of TypeScript. It's not a big codebase. Adding persistence, auth, and a task
queue would maybe take it to 4,500 lines. Still fits in your head.</p>
<p>For now, it sits under my desk and boots VMs when I ask it to. Claude delegating
to Claude, in complete isolation, on hardware I own. That's enough.</p>
]]>
      </content:encoded>
      <pubDate>Mon, 13 Apr 2026 00:00:00 GMT</pubDate>
    </item>
    <item>
      <title>OpenHealth – Chat with Apple Health Data, Anywhere</title>
      <link>https://jonnonz.com/posts/openhealth-chat-with-apple-health-data/</link>
      <guid isPermaLink="false">https://jonnonz.com/posts/openhealth-chat-with-apple-health-data/</guid>
      <description>
        I built OpenHealth because Claude's and ChatGPT's Apple Health connectors are US-only. Drop your export.zip, get seven markdown files any LLM can read — browser, CLI, or phone-to-desktop over WebRTC.
      </description>
      <content:encoded>
        <![CDATA[<p>For years I've worn an Apple Watch and let my iPhone quietly hoover up my
resting heart rate, HRV, sleep stages, every workout, every nutrition log.
Millions of data points. And for most of that time, when I wanted to actually
<em>ask</em> something about my training — &quot;am I cooked this week?&quot;, &quot;has my recovery
gotten worse since Christmas?&quot; — I'd open ChatGPT and get an answer that was
basically vibes, because it couldn't see any of the data.</p>
<p>So I built <a href="https://github.com/jonnonz1/openhealth">openhealth</a>. It turns your
Apple Health export into seven short markdown files any LLM can read. Drop the
zip in your browser at
<a href="https://openhealth-axd.pages.dev/">openhealth-axd.pages.dev</a>, run the CLI, or
beam the zip straight from your iPhone over WebRTC. Paste the output into Claude
or ChatGPT and start asking the questions you actually wanted to ask.</p>
<p><img src="https://jonnonz.com/img/posts/openhealth/hero.png" alt="openhealth's web app — drop the zip, get seven markdown files, nothing uploaded"></p>
<h2>What's US-only and why that's annoying</h2>
<p>In January, Anthropic
<a href="https://www.macrumors.com/2026/01/22/claude-ai-adds-apple-health-connectivity/">shipped an Apple Health connector</a>
for Claude. OpenAI has one in ChatGPT. Both are US-only — if you're in New
Zealand like me, or the UK, EU, or Switzerland,
<a href="https://context-link.ai/blog/chatgpt-connectors">they're not available</a>. That's
a lot of people locked out of the most natural way to use this data.</p>
<p>And even if you are in the US, you're letting Anthropic or OpenAI decide what
the model reads, how it's framed, and what tier unlocks it. I wanted control
over the whole pipeline — including which LLM I feed it into.</p>
<h2>What I built</h2>
<p>openhealth ships three ways.</p>
<p><strong>A static web app.</strong> Drop <code>export.zip</code>, wait five seconds, download seven
files. The browser does the parse. There's no upload endpoint because there's no
server — the Cloudflare Pages site is static HTML plus a tiny Web Worker. Open
DevTools, watch the Network panel, nothing goes out.</p>
<p><strong>A Bun-compiled CLI.</strong> <code>openhealth ~/export.zip -o ./output</code> gets you seven
markdown files. <code>--bundle</code> concatenates them into one. <code>--clipboard</code> pushes that
bundle straight to your system clipboard so you can paste it into any chat
window. Zero deps beyond <code>saxes</code> for XML and <code>fflate</code> for unzip — even the
argument parsing is <code>node:util parseArgs</code>, not Commander. One binary, put it
wherever.</p>
<p><strong>A phone-to-desktop handoff over WebRTC.</strong> The desktop site renders a QR code.
Point your iPhone camera at it, Safari opens a tiny receiver page, pick the zip,
and it streams directly to your desktop browser over a DataChannel. The only
backend in the whole stack is a ~100-line Cloudflare Worker that relays the
WebRTC handshake — it never sees a byte of your health data.</p>
<p><img src="https://jonnonz.com/img/posts/openhealth/walkthrough.png" alt="Getting the export off your iPhone — six taps, or scan the desktop QR"></p>
<h2>How the parse actually works</h2>
<p>Apple's <code>export.xml</code> is
<a href="https://www.tdda.info/in-defence-of-xml-exporting-and-analysing-apple-health-data">properly huge</a>.
A long-term Watch user can easily have a 500MB–4GB file with millions of rows.
Most XML parsers build a tree in memory, which OOMs before they finish.</p>
<p>openhealth uses <a href="https://github.com/lddubeau/saxes">saxes</a> — a streaming SAX
parser in pure TypeScript. It's isomorphic, so the same parser runs in Bun,
Node, and the browser. I tested it against a synthetic 169MB / 1 million-record
export and it finished in about 5 seconds in Chrome, with the main-thread heap
staying around 5MB because the parse runs in a Web Worker.</p>
<p>The rest of the core is a small pipeline: stream XML, accumulate
per-record-type, roll up into weekly and monthly summaries, run each through a
writer that produces one markdown file. Every writer is snapshot-tested against
byte-for-byte expected output. 85 tests, TDD throughout.</p>
<h2>What the seven files are</h2>
<p>Each one is deliberately small and shaped to be LLM-readable:</p>
<ul>
<li><code>health_profile.md</code> — baselines, data sources, long-term averages</li>
<li><code>weekly_summary.md</code> — current week plus a 4-week rolling comparison with
week-over-week deltas</li>
<li><code>workouts.md</code> — detailed log for the last 4 weeks: HR, duration, distance,
energy</li>
<li><code>body_composition.md</code> — weight trend, recent readings, nutrition averages</li>
<li><code>sleep_recovery.md</code> — nightly stages, 8-week averages, HRV, resting HR, SpO2
trends</li>
<li><code>cardio_fitness.md</code> — running log, HR-zone distribution, walking-speed trends</li>
<li><code>prompt.md</code> — a ready-to-paste system prompt that frames the other six as
coaching input</li>
</ul>
<p>Drop one file or all seven, depending on which chat model you're using.</p>
<h2>What it's actually good at</h2>
<p>Feeding real data to an LLM is a different experience from answering its
questions. When Claude can see that my resting HR has crept up 4bpm over the
last fortnight while my HRV has dropped and my training load stayed the same, it
gives a real answer — &quot;you're likely undercooked on recovery this week, here's
what I'd change&quot; — rather than a generic reminder to drink water.</p>
<p>It's especially good if you've got multiple devices in the mix. I've got data
from Apple Watch, the iPhone step counter, a Withings scale, and MyFitnessPal.
The parser picks the highest-trust source per metric — Apple Watch wins over
iPhone for steps, Watch sleep beats AutoSleep which beats Withings,
duplicate-weight entries on the same day get deduped. You feed in one zip and
get one coherent picture.</p>
<p>Ask it about your recovery, your training load, what you might be doing wrong,
how your sleep correlates with your long runs. It'll tell you — and it'll be
right more often than not.</p>
<h2>If privacy matters, go all the way</h2>
<p>openhealth itself never uploads your data. The web app parses in your browser
tab. The CLI runs locally. The WebRTC handoff stays peer-to-peer — the
Cloudflare Worker that relays the handshake never sees a byte of the file. Clone
the repo, diff the build output, and confirm it yourself.</p>
<p>When you paste the seven files into ChatGPT or Claude, <em>they</em> see the data.
That's the trade most people will take for convenience, and it's fine. But if
you don't want to make that trade, you don't have to — run the CLI and pipe the
bundle into a local model:</p>
<pre><code class="language-bash">openhealth ~/export.zip --bundle -o ./out
ollama run llama3 &lt; ./out/openhealth.md
</code></pre>
<p>Ollama, llama.cpp, LM Studio, whatever you run. Your health data never leaves
your laptop. The output is just markdown — it doesn't care what reads it.</p>
<p>That's why the shape is seven files and not an API. You pick what sees them.</p>
<p>I'm not a doctor. Neither is the model. Use this for thinking out loud about
your own training, not diagnosing anything.</p>
<p>MIT, source at
<a href="https://github.com/jonnonz1/openhealth">github.com/jonnonz1/openhealth</a>. Web
app at <a href="https://openhealth-axd.pages.dev/">openhealth-axd.pages.dev</a>. If you've
been sitting on a 200MB <code>export.zip</code> with nothing that'll open it, have a go.</p>
]]>
      </content:encoded>
      <pubDate>Mon, 13 Apr 2026 00:00:00 GMT</pubDate>
    </item>
    <item>
      <title>The Future of Security Is an Open-Source Model That Detects and Acts on Threats</title>
      <link>https://jonnonz.com/posts/llm-kills-compromised-services-at-3am/</link>
      <guid isPermaLink="false">https://jonnonz.com/posts/llm-kills-compromised-services-at-3am/</guid>
      <description>
        Anthropic's Glasswing is for enterprises with six-figure security budgets. The rest of us need open-source security agents that learn our systems and act autonomously.
      </description>
      <content:encoded>
        <![CDATA[<p>Anthropic just dropped <a href="https://www.anthropic.com/glasswing">Project Glasswing</a>
— a big collaborative cybersecurity initiative with a shiny new model called
Claude Mythos Preview that can find zero-day vulnerabilities at scale. Twelve
major tech companies involved. $100M in credits. Found a 27-year-old flaw in
OpenBSD. Impressive stuff.</p>
<p>But let's be real about what's happening here. Anthropic trained a model so
capable at breaking into systems that they decided it was too dangerous to
release publicly. So they wrapped the release in a collaborative security
initiative. The security work is genuinely valuable. But it's also a smart way
to keep control of something they know is too powerful to let loose.</p>
<p>The part that actually matters, though, is who benefits. Glasswing is for the
big players. The companies with security teams, budgets, and the kind of
infrastructure that gets invited to sit at the table with AWS, Microsoft, and
Palo Alto Networks. What about the rest of us? The startups, the small SaaS
shops, the indie developers running production systems on a shoestring?</p>
<p>The internet is a
<a href="https://bigthink.com/books/how-the-dark-forest-theory-helps-us-understand-the-internet/">dark forest</a>.
That's not a metaphor anymore — it's becoming the literal reality. Bots,
scrapers, automated exploit chains, credential stuffing, AI-generated phishing.
A server goes up and within hours it's being scanned, fingerprinted, and probed
by systems that don't sleep. Visibility equals vulnerability. And AI is making
the attackers faster, cheaper, and more autonomous every month.</p>
<p>The
<a href="https://www.isc2.org/insights/2026/04/ai-driven-defense-and-autonomous-attacks">ISC2 put it plainly</a>
— both offence and defence now operate at speeds beyond human intervention. The
threats aren't people sitting at keyboards anymore. They're autonomous systems
running campaigns end-to-end.</p>
<p>So what do we do about it?</p>
<h2>Offensive security — but not the kind you're thinking</h2>
<p>When I say offensive security, I don't mean red-teaming or penetration testing.
I mean giving your systems the ability to fight back.</p>
<p>Picture an LLM that sits across your centralised logs — network traffic,
database queries, user interactions, access patterns — and builds an
understanding of what normal looks like for your system over weeks and months.
Not just pattern matching against known signatures. Actually understanding the
shape of healthy behaviour.</p>
<p>When something breaks the pattern, it doesn't just alert. It acts.</p>
<p>Disable a compromised account. Kill a service that's behaving strangely. Block a
database connection that shouldn't exist. Create an incident with full context
for a human to review. The response is proportional and immediate — not waiting
for someone to check their phone at 3am.</p>
<p>The architecture is pretty straightforward:</p>
<pre><code class="language-mermaid">graph TD
    A[Application Logs] --&gt; D[Secure Isolated Log Store]
    B[Network Traffic] --&gt; D
    C[Database Queries] --&gt; D
    D --&gt; F[Baseline Health Model]
    E[User Activity] --&gt; D
    F --&gt;|Anomaly Detected| G[LLM Analysis]
    G --&gt;|Analyse &amp; Plan| H{Threat Assessment}
    H --&gt;|Low| I[Alert &amp; Log]
    H --&gt;|Medium| J[Restrict &amp; Escalate]
    H --&gt;|High| K[Disable &amp; Isolate]
    I --&gt; L[Human Review]
    J --&gt; L
    K --&gt; L
</code></pre>
<p>The key is that the logging and analysis layer has to be isolated and secured
separately from the systems it's watching. If an attacker can compromise the
thing that's watching them, the whole model falls apart.</p>
<p>In practice that means separate infrastructure with its own auth boundary.
Ingestion is write-only — your application services push logs in but can never
read or modify what's already there. Append-only, immutable. The analysis layer
gets scoped service accounts that can read logs, fire alerts, and pull specific
emergency levers through a narrow API. Nothing else. If a compromised service
tries to reach the log store directly, it hits a wall.</p>
<p>None of this is exotic. Centralised logging, immutable storage, scoped IAM — the
building blocks exist. The hard part is wiring an LLM into that loop with the
right constraints. Enough access to act, not enough to make things worse.</p>
<h2>Adaptive, not rule-based</h2>
<p>Traditional security tooling runs on signatures and static rules. Known bad
patterns, blocklists, threshold alerts. That worked when threats were mostly
human-paced. It doesn't work when you're up against autonomous systems that
adapt faster than you can write rules.</p>
<p>The alternative is a system that learns what normal looks like for <em>your</em>
environment — not a generic baseline, but the actual shape of healthy behaviour
in your specific infrastructure. Traffic patterns, query frequencies, access
timing, user behaviour. Weeks of observation before it starts making decisions.</p>
<p>When something breaks the pattern, the response is proportional. A sudden spike
in unusual API calls might trigger deeper correlation — the system widens its
search, pulls in more signals, lowers its threshold for flagging related
activity. Repeated failed auth attempts from new IPs tighten access controls
automatically. A database connection that shouldn't exist gets killed.</p>
<p>This isn't a static ruleset you configure once and hope covers everything. It's
a system that develops behavioural intuition from running in your environment,
responding to your traffic. The difference matters — static rules are brittle
against novel attacks, while adaptive systems can catch anomalies they've never
seen before.</p>
<p>The baseline isn't magic. It's watching five things:</p>
<ul>
<li><strong>Rate</strong> — how many events per time window. A user who averages 50 API calls
per hour suddenly making 500 is a signal.</li>
<li><strong>Composition</strong> — what's in those events. The same user always hitting
/api/users and /api/orders suddenly hammering /api/admin/export.</li>
<li><strong>Cardinality</strong> — how many unique values. One IP hitting 3 endpoints is
normal. One IP cycling through 200 endpoints in an hour isn't.</li>
<li><strong>Latency</strong> — how fast things happen. Legitimate users pause, think, navigate.
Bots don't.</li>
<li><strong>Novelty</strong> — things the system has never seen. A new endpoint, a new
parameter, a user agent string that doesn't match anything in the training
window.</li>
</ul>
<p>Three layers of detection stack on top of each other. Layer one is simple
thresholds — hard caps that trigger immediately. Layer two is statistical
deviation — standard deviations from the learned baseline. Layer three is
correlation — looking across multiple signals simultaneously. A spike in rate
alone might be fine. A spike in rate plus unusual composition plus new source
IP? That's a pattern.</p>
<h2>Learning to recognise yourself</h2>
<p>A pure anomaly detector would go nuts during deploys. New code paths, changed
response times, config reloads — all of it looks unusual. Same with cron jobs.
Your 3am batch job that hits the database hard every night would trigger alerts
every night.</p>
<p>Tolerance patterns solve this. The system learns to recognise you.</p>
<p>Mark a deploy event, and the system creates a tolerance window — elevated
thresholds for the next 30 minutes. Register a recurring cron job, and the
system expects that exact spike at that exact time. These aren't exceptions you
configure manually. They're patterns the system learns from watching.</p>
<p>After a few weeks, it knows when your weekly cache warm-up runs, when your daily
reports generate, when deploys happen. It stops bothering you about the things
you do on purpose.</p>
<h2>The system gets cheaper over time</h2>
<p>Calling an LLM for every anomaly would be expensive. The trick is building
immune memory.</p>
<p>When the LLM analyses an anomaly and decides it's benign — say, a deploy spike
or a legitimate traffic surge — that verdict gets stored. Next time the same
pattern appears, the system recognises it. No LLM call needed.</p>
<p>This is how your security bill drops over the first few weeks. Early on,
everything is novel. The LLM gets called constantly. A month in, most anomalies
match patterns it's already seen. The LLM only gets called for genuinely new
situations.</p>
<p>The more your system runs, the smarter it gets and the less it costs.</p>
<h2>Setup without a PhD</h2>
<p>The hardest part of any security tool is configuration. Getting thresholds
right. Understanding your traffic patterns before you can tell the tool what's
normal.</p>
<p><code>darkforest init</code> flips this. Point it at a log sample — a day's worth of
traffic, a week if you've got it — and Claude reads it. Not just parsing,
actually understanding the shape of your system. It figures out what your
endpoints are, what normal request rates look like, what user agents show up,
where your traffic comes from geographically.</p>
<p>Then it writes your config file for you.</p>
<p>You review it, tweak anything that looks wrong, and you're running. No
spreadsheets. No guesswork about what &quot;normal&quot; means for your specific stack.
The LLM that's going to watch your logs already understands them.</p>
<h2>This has to be open</h2>
<p>Glasswing is cool.
<a href="https://github.com/aliasrobotics/CAI">Open-source frameworks like CAI</a> are
making progress — but mostly on the offensive side, using LLMs for penetration
testing and vulnerability research. On the defensive side, the tooling barely
exists. There's no open-source equivalent for the kind of adaptive monitoring
and response I'm describing here.</p>
<p>The building blocks are around. Centralised logging is a solved problem. Open
standards for security event formats are maturing. Smaller open models are more
than capable of pattern analysis on local infrastructure. What's missing is the
glue — a framework that takes logs in, builds a baseline, detects anomalies, and
can actually respond. Something a small team can deploy without a six-figure
security budget.</p>
<p>The threats don't discriminate by company size. The defences shouldn't either.
This can't be proprietary or locked behind enterprise contracts.</p>
<p>The dark forest doesn't care how big your company is. The bots scanning your
infrastructure don't check your headcount before they attack. If the threats are
going to be this accessible, the defences need to be too.</p>
<p>I'm building this. An open-source security agent — adaptive, autonomous, acts
when something breaks the pattern. Small enough for a startup to run on their
own infrastructure. Centralised logging, open LLMs, scoped response actions. The
pieces are all there. I'm wiring them together now.</p>
<p>For v0.1, one real action working end-to-end: detect anomalous authentication
patterns, call the LLM for analysis, and disable the compromised account via
your identity provider's API. Not just alerting — actually responding while
you're asleep. That's the proof of concept that matches the headline.</p>
<p>I'm actively working on this and looking for early testers. If you want alpha
access when it's ready, or just want to follow along,
<a href="https://jonnonz.com/posts/llm-kills-compromised-services-at-3am/#newsletter">drop your email below</a>. I'll reach out when there's something to
try.</p>
]]>
      </content:encoded>
      <pubDate>Sat, 11 Apr 2026 00:00:00 GMT</pubDate>
    </item>
    <item>
      <title>I Spent 29 Hours Debugging iptables to Boot VMs in 4 Seconds</title>
      <link>https://jonnonz.com/posts/29-hours-debugging-iptables-to-boot-vms-in-4-seconds/</link>
      <guid isPermaLink="false">https://jonnonz.com/posts/29-hours-debugging-iptables-to-boot-vms-in-4-seconds/</guid>
      <description>
        Building a Firecracker MicroVM rootfs, wiring up TAP networking, debugging UFW's FORWARD chain, and writing a guest agent with raw vsock syscalls in Go.
      </description>
      <content:encoded>
        <![CDATA[<p>The first time I got a Firecracker VM to boot and respond to a vsock ping from
the host, I sat there grinning like an idiot. Typed a command on my machine, it
reached through a kernel-level socket into a completely separate Linux system
with its own kernel, and got a reply. Under a second.</p>
<p>That was about 30 hours into the project. The previous 29 were mostly fighting
with rootfs images and iptables rules.</p>
<p><a href="https://jonnonz.com/posts/claude-code-running-claude-code-in-4-second-disposable-vms/">Part 1</a>
covered why I built this — Firecracker MicroVMs for running Claude Code in
full-permission isolation. This post is the actual build. Rootfs, networking,
the guest agent, and the streaming pipeline.</p>
<h2>Building the rootfs</h2>
<p>A Firecracker VM needs two things: an uncompressed Linux kernel (<code>vmlinux</code>, not
<code>bzImage</code> — there's no bootloader) and an ext4 filesystem image to use as the
root disk.</p>
<p>The kernel is straightforward — grab a prebuilt 6.1 LTS vmlinux. The rootfs took
more work.</p>
<p>It's a standard ext4 image with Debian Bookworm, and it needs everything Claude
Code might want: Node.js 24, Python 3.11, Chromium for browser automation, git,
curl, jq, and the full Claude Code CLI installed globally via npm. The image
ends up at about 4GB.</p>
<p>The guest agent — the Go binary that listens for commands from the host — lives
inside the rootfs as a systemd service:</p>
<pre><code class="language-bash">sudo mount /opt/firecracker/rootfs/base-rootfs.ext4 /mnt
sudo cp bin/agent /mnt/usr/local/bin/agent
sudo chmod +x /mnt/usr/local/bin/agent

sudo tee /mnt/etc/systemd/system/agent.service &lt;&lt;'EOF'
[Unit]
Description=Orchestrator Guest Agent
After=network.target

[Service]
Type=simple
ExecStart=/usr/local/bin/agent
Restart=always
RestartSec=1

[Install]
WantedBy=multi-user.target
EOF

sudo chroot /mnt systemctl enable agent.service
sudo umount /mnt
</code></pre>
<p>That <code>RestartSec=1</code> matters. If the agent crashes for any reason, systemd has it
back up in a second. The orchestrator polls vsock every 500ms waiting for the
agent, so even a crash during boot is barely noticeable.</p>
<p>You build this rootfs once, by hand. Every new VM gets a sparse copy of it.</p>
<h2>VM lifecycle</h2>
<p><code>internal/vm/manager.go</code> handles the whole lifecycle. It's sequential with
cleanup at each step — if anything fails, it tears down what it already set up
and returns the error.</p>
<pre><code class="language-mermaid">flowchart TD
    A[&quot;Copy rootfs (sparse)&quot;] --&gt; B[&quot;Mount &amp; inject network config&quot;]
    B --&gt; C[&quot;Create TAP device&quot;]
    C --&gt; D[&quot;Add iptables rules&quot;]
    D --&gt; E[&quot;Setup jailer chroot&quot;]
    E --&gt; F[&quot;Write Firecracker config JSON&quot;]
    F --&gt; G[&quot;Launch via jailer --daemonize&quot;]
    G --&gt; H[&quot;Find PID, save metadata&quot;]
    H --&gt; I[&quot;VM ready — poll vsock&quot;]
</code></pre>
<p>The sparse copy is the first thing that happens:</p>
<pre><code class="language-go">cmd := exec.Command(&quot;cp&quot;, &quot;--sparse=always&quot;, BaseRootfs, vm.RootfsPath)
</code></pre>
<p><code>--sparse=always</code> means zero blocks aren't allocated on disk. A 4GB image might
only use 2GB of actual disk space. Takes under a second on NVMe.</p>
<p>After copying, the rootfs gets mounted and three files are injected: a
systemd-networkd config with a static IP, <code>/etc/resolv.conf</code> for DNS, and
<code>/etc/hostname</code>. Then it's unmounted and copied again into the jailer chroot.</p>
<p>Yeah, that's two copies of the rootfs per VM. The first for network injection,
the second because the jailer expects everything inside its chroot. I could
collapse this into one copy by injecting the network config directly into the
chroot copy, but it's never been a bottleneck — sparse copy of 4GB takes less
time than Firecracker takes to boot. So I left it.</p>
<h2>The jailer</h2>
<p>Firecracker's jailer is a separate binary that creates a chroot, sets up minimal
<code>/dev</code> entries (kvm, net/tun, urandom), and runs the Firecracker process inside
it. The VM config is a JSON file:</p>
<pre><code class="language-go">vmConfig := map[string]interface{}{
    &quot;boot-source&quot;: map[string]interface{}{
        &quot;kernel_image_path&quot;: &quot;/vmlinux&quot;,
        &quot;boot_args&quot;:         &quot;console=ttyS0 reboot=k panic=1 pci=off init=/sbin/init&quot;,
    },
    &quot;drives&quot;: []map[string]interface{}{{
        &quot;drive_id&quot;:       &quot;rootfs&quot;,
        &quot;path_on_host&quot;:   &quot;/rootfs.ext4&quot;,
        &quot;is_root_device&quot;: true,
        &quot;is_read_only&quot;:   false,
    }},
    &quot;machine-config&quot;: map[string]interface{}{
        &quot;vcpu_count&quot;:  vm.VCPUs,
        &quot;mem_size_mib&quot;: vm.RamMB,
    },
    &quot;network-interfaces&quot;: []map[string]interface{}{{
        &quot;iface_id&quot;:      &quot;eth0&quot;,
        &quot;guest_mac&quot;:     &quot;06:00:AC:10:00:02&quot;,
        &quot;host_dev_name&quot;: netCfg.TapDev,
    }},
    &quot;vsock&quot;: map[string]interface{}{
        &quot;guest_cid&quot;: vm.VsockCID,
        &quot;uds_path&quot;:  &quot;/vsock.sock&quot;,
    },
}
</code></pre>
<p><code>pci=off</code> because Firecracker doesn't emulate PCI. Paths are relative to the
jailer chroot. The vsock entry creates a Unix domain socket at <code>/vsock.sock</code>
inside the chroot — that's how the host talks to the guest.</p>
<p>Launch looks like this:</p>
<pre><code class="language-go">cmd := exec.Command(JailerBin,
    &quot;--id&quot;, vm.JailID,
    &quot;--exec-file&quot;, FCBin,
    &quot;--uid&quot;, &quot;0&quot;, &quot;--gid&quot;, &quot;0&quot;,
    &quot;--cgroup-version&quot;, &quot;2&quot;,
    &quot;--daemonize&quot;,
    &quot;--&quot;,
    &quot;--config-file&quot;, &quot;/vm-config.json&quot;,
)
cmd.Run()
</code></pre>
<p>After launch there's a 2-second sleep — Firecracker needs a moment to start —
then the PID is found via <code>pgrep</code> and saved to a metadata file. If the
orchestrator restarts, it reads these metadata files and picks up where it left
off. VMs survive orchestrator crashes.</p>
<h2>Networking</h2>
<p>This is where I burned the most time. Not because the concepts are hard, but
because of one specific bug that had me questioning reality.</p>
<p>Each VM needs internet access for Claude Code to fetch packages, clone repos,
and hit the Anthropic API. The approach: each VM gets a Linux TAP device on the
host, a dedicated <code>/24</code> subnet, and iptables rules for NAT.</p>
<h3>IP allocation</h3>
<p>Subnets are deterministic, derived from the VM name using FNV-1a hashing:</p>
<pre><code class="language-go">func NetSlot(name string) int {
    h := fnv.New32a()
    h.Write([]byte(name))
    return int(h.Sum32()%253) + 1
}
</code></pre>
<p>VM named <code>task-a3bfca80</code> might hash to slot 61, giving it subnet
<code>172.16.61.0/24</code>, guest IP <code>172.16.61.2</code>, TAP IP <code>172.16.61.1</code>. No coordination
needed, no DHCP server, no IP pool to manage. The collision space is 253 slots —
more than enough for 12-13 concurrent VMs.</p>
<h3>TAP devices</h3>
<p>A TAP device is a virtual ethernet interface. Firecracker attaches the guest's
<code>eth0</code> to it.</p>
<pre><code class="language-go">tap := &amp;netlink.Tuntap{
    LinkAttrs: netlink.LinkAttrs{Name: cfg.TapDev},
    Mode:      netlink.TUNTAP_MODE_TAP,
}
netlink.LinkAdd(tap)
addr, _ := netlink.ParseAddr(cfg.TapIP + &quot;/24&quot;)
link, _ := netlink.LinkByName(cfg.TapDev)
netlink.AddrAdd(link, addr)
netlink.LinkSetUp(link)
</code></pre>
<p>TAP names are <code>fc-&lt;vm-name&gt;</code>, truncated to 15 characters because Linux interface
names can't be longer. A fun constraint to discover at runtime.</p>
<h3>The iptables rules</h3>
<p>Three rules per VM:</p>
<pre><code class="language-go">// NAT — rewrite source IP when traffic exits the host
ipt.AppendUnique(&quot;nat&quot;, &quot;POSTROUTING&quot;,
    &quot;-s&quot;, cfg.Subnet, &quot;-o&quot;, cfg.HostIface, &quot;-j&quot;, &quot;MASQUERADE&quot;)

// FORWARD — allow outbound from TAP
ipt.Insert(&quot;filter&quot;, &quot;FORWARD&quot;, 1,
    &quot;-i&quot;, cfg.TapDev, &quot;-o&quot;, cfg.HostIface, &quot;-j&quot;, &quot;ACCEPT&quot;)

// FORWARD — allow established/related inbound
ipt.Insert(&quot;filter&quot;, &quot;FORWARD&quot;, 1,
    &quot;-i&quot;, cfg.HostIface, &quot;-o&quot;, cfg.TapDev,
    &quot;-m&quot;, &quot;state&quot;, &quot;--state&quot;, &quot;RELATED,ESTABLISHED&quot;, &quot;-j&quot;, &quot;ACCEPT&quot;)
</code></pre>
<p>See those <code>Insert</code> calls with position 1? That's the bug fix.</p>
<h3>The UFW bug</h3>
<p>I originally used <code>Append</code> for the FORWARD rules. Traffic from the VM would
leave the host fine (NAT worked), but return traffic got dropped. The VM could
resolve DNS but couldn't complete TCP handshakes. I spent an embarrassing amount
of time staring at <code>tcpdump</code> output before I figured it out.</p>
<p>Ubuntu's UFW adds a blanket <code>DROP</code> rule to the FORWARD chain. If you append your
ACCEPT rules, they land <em>after</em> UFW's DROP. They never match. The packets hit
the DROP rule first and get silently killed.</p>
<p><code>Insert</code> at position 1 puts the rules before UFW's. Return traffic flows, VMs
get internet access, everything works.</p>
<p>The traffic path through a working VM:</p>
<pre><code>Guest (172.16.61.2) → eth0 → TAP (fc-task-xxx) → FORWARD ACCEPT
→ NAT MASQUERADE (rewrite src to host IP) → host interface → internet
→ response → RELATED,ESTABLISHED → TAP → guest eth0
</code></pre>
<p>VMs can't reach each other. Each TAP device is point-to-point on its own <code>/24</code>.
There's no route between subnets.</p>
<h2>The guest agent</h2>
<p><code>cmd/agent/main.go</code> — 420 lines of Go. It's a static binary that starts on boot,
listens on vsock port 9001, and handles five request types: ping, exec,
write_files, read_file, and signal.</p>
<p>The interesting one is streaming exec.</p>
<p>When the orchestrator wants to run Claude Code, it sends an exec request with
<code>stream: true</code>. The agent spawns the command, reads stdout and stderr line by
line, and sends each line back as a framed event over the vsock connection. When
the process exits, it sends an exit event with the exit code.</p>
<p>Sounds straightforward. The tricky part is background processes.</p>
<p>Claude Code can start things that outlive the main command — dev servers, file
watchers, whatever it decides it needs. These child processes inherit the
stdout/stderr pipes. If the agent waits for the pipes to close (the normal
approach), it hangs forever because the children are still holding them open.</p>
<p>The fix has three parts:</p>
<pre><code class="language-go">// 1. Process group isolation
cmd.SysProcAttr = &amp;syscall.SysProcAttr{Setpgid: true}

// 2. Wait for the main process, not the pipes
&lt;-waitDone

// 3. Kill the entire process group
pgid, _ := syscall.Getpgid(cmd.Process.Pid)
syscall.Kill(-pgid, syscall.SIGTERM)
time.Sleep(500 * time.Millisecond)
syscall.Kill(-pgid, syscall.SIGKILL)
</code></pre>
<p><code>Setpgid: true</code> puts the command in its own process group. When the main process
exits, kill the group (<code>-pgid</code> means &quot;everything in this group&quot;). SIGTERM first,
wait half a second, then SIGKILL for anything that didn't listen.</p>
<p>Even after killing the group, there's a 3-second timeout waiting for the
pipe-reading goroutines to drain. If they're still stuck after that, move on and
send the exit event anyway. Can't let a hung pipe block the entire task.</p>
<p>The line-by-line reader uses a 256KB buffer because Claude Code's
<code>--output-format stream-json</code> can produce enormous single lines — tool results
that include the full contents of files it read.</p>
<h2>Credential injection</h2>
<p>Before Claude Code runs, the orchestrator writes five things into the VM via
vsock:</p>
<p>OAuth credentials from the host's <code>~/.claude/.credentials.json</code> (mode 0600). A
settings file that allows all tools. An environment script that sets
<code>CLAUDE_DANGEROUSLY_SKIP_PERMISSIONS=true</code>. Task metadata. And a marker file to
create the output directory.</p>
<p>The prompt itself gets written to a temp file inside the VM to avoid shell
escaping nightmares, then referenced in the command:</p>
<pre><code class="language-go">claudeArgs := fmt.Sprintf(
    &quot;claude -p \&quot;$(cat %s)\&quot; --output-format stream-json --verbose&quot;,
    promptFile,
)
cmd := []string{&quot;bash&quot;, &quot;-c&quot;,
    &quot;source /etc/profile.d/claude.sh &amp;&amp; &quot; + claudeArgs}
</code></pre>
<p>When the VM is destroyed, the rootfs — containing the credentials — is deleted.
Credentials only exist for the lifetime of the task.</p>
<h2>Collecting results</h2>
<p>After Claude Code finishes, the orchestrator searches for files it created:</p>
<pre><code class="language-go">// Anything in the output directory
vsock.Exec(jailID, []string{&quot;find&quot;, outputDir, &quot;-type&quot;, &quot;f&quot;, &quot;-not&quot;, &quot;-name&quot;, &quot;.keep&quot;}, nil, &quot;/root&quot;)

// Any new files under /root, created after the prompt was written
vsock.Exec(jailID, []string{&quot;find&quot;, &quot;/root&quot;, &quot;-maxdepth&quot;, &quot;2&quot;, &quot;-type&quot;, &quot;f&quot;,
    &quot;-newer&quot;, &quot;/tmp/claude-prompt.txt&quot;}, nil, &quot;/root&quot;)
</code></pre>
<p>Each file gets downloaded via <code>vsock.ReadFile</code> and saved to
<code>/opt/firecracker/results/&lt;task-id&gt;/</code>. The runner also scans the accumulated
output for Claude's <code>total_cost_usd</code> field to record what the task cost in API
credits.</p>
<p>Then the VM is destroyed. Firecracker process killed, TAP device removed,
iptables rules deleted, jailer chroot deleted, VM state directory deleted. Clean
slate.</p>
<p>The whole cycle — boot, inject, run, collect, destroy — typically takes 30-120
seconds depending on how complex the prompt is. The 4-second boot and ~1-second
teardown are rounding errors compared to the time Claude actually spends
thinking.</p>
<p><a href="https://jonnonz.com/posts/claude-code-can-now-spawn-copies-of-itself-in-isolated-vms/">Part 3</a>
gets into the fun stuff — the MCP server that lets Claude delegate tasks to
itself, the streaming architecture, the web dashboard, and what productionising
this would actually look like.</p>
]]>
      </content:encoded>
      <pubDate>Fri, 10 Apr 2026 00:00:00 GMT</pubDate>
    </item>
    <item>
      <title>Can You Beat Last Month?</title>
      <link>https://jonnonz.com/posts/can-you-beat-last-month/</link>
      <guid isPermaLink="false">https://jonnonz.com/posts/can-you-beat-last-month/</guid>
      <description>
        Before deep learning gets a chance, we need to know how well stupid-simple models perform. Turns out, they put up a real fight.
      </description>
      <content:encoded>
        <![CDATA[<p>Every machine learning project needs a reality check.</p>
<p>It's tempting to jump straight to the neural network. That's the exciting bit,
right? But if you don't establish what a dead-simple model can do first, you've
got no idea whether your fancy architecture is actually learning anything useful
or just being expensive.</p>
<p>So before ConvLSTM gets anywhere near this data, we're going to throw three
gloriously simple baselines at it and see how they do.</p>
<h2>Persistence: next month equals this month</h2>
<p>The dumbest possible model. To predict April, just use March's values. Every
cell, every crime type. Carbon copy.</p>
<p>It sounds ridiculous, but it works surprisingly well when patterns are stable.
And as we saw in the EDA, Auckland's crime hotspots are remarkably persistent.
The CBD doesn't suddenly go quiet. South Auckland doesn't randomly calm down.</p>
<p>On the six-month test set (August 2025 – January 2026):</p>
<table>
<thead>
<tr>
<th>Crime Type</th>
<th>MAE</th>
<th>RMSE</th>
</tr>
</thead>
<tbody>
<tr>
<td>Theft</td>
<td>1.42</td>
<td>3.18</td>
</tr>
<tr>
<td>Burglary</td>
<td>0.38</td>
<td>0.91</td>
</tr>
<tr>
<td>Assault</td>
<td>0.22</td>
<td>0.64</td>
</tr>
<tr>
<td>Robbery</td>
<td>0.04</td>
<td>0.15</td>
</tr>
<tr>
<td>Sexual</td>
<td>0.03</td>
<td>0.12</td>
</tr>
<tr>
<td>Harm</td>
<td>0.01</td>
<td>0.04</td>
</tr>
</tbody>
</table>
<p>Those MAE numbers for theft and burglary look small until you remember that most
cells are zero. For the active cells (the ones we actually care about) the error
is larger. A busy CBD cell might have 35 thefts in one month and 28 the next.
Persistence would be off by 7 there, which is a 20% miss on an important
prediction.</p>
<h2>Seasonal naive: same month last year</h2>
<p>Instead of copying last month, copy the same month from the previous year.
January 2026 gets predicted from January 2025. This should capture seasonal
patterns: the summer spike, the February dip.</p>
<p>The catch? We only have four years of data. The test set months (August–January)
each have at most three prior examples of the same month. That's not a lot of
seasonal training data.</p>
<table>
<thead>
<tr>
<th>Crime Type</th>
<th>MAE</th>
<th>RMSE</th>
</tr>
</thead>
<tbody>
<tr>
<td>Theft</td>
<td>1.51</td>
<td>3.42</td>
</tr>
<tr>
<td>Burglary</td>
<td>0.41</td>
<td>0.97</td>
</tr>
<tr>
<td>Assault</td>
<td>0.24</td>
<td>0.68</td>
</tr>
<tr>
<td>Robbery</td>
<td>0.05</td>
<td>0.17</td>
</tr>
<tr>
<td>Sexual</td>
<td>0.04</td>
<td>0.13</td>
</tr>
<tr>
<td>Harm</td>
<td>0.01</td>
<td>0.04</td>
</tr>
</tbody>
</table>
<p>Slightly worse than persistence across the board. That surprised me initially.
Shouldn't capturing seasonality help?</p>
<p>The issue is that the 2023-to-2025 decline we spotted in the EDA bites hard
here. If you predict January 2026 from January 2025, you're using data from a
period when crime was higher. The seasonal pattern is real, but the
year-over-year trend works against it. With more years of data, seasonal naive
would likely pull ahead.</p>
<h2>Historical average: the mean of all training months</h2>
<p>For each cell and crime type, take the average across all 36 training months.
This smooths out month-to-month noise and gives you a &quot;typical&quot; value for each
location.</p>
<table>
<thead>
<tr>
<th>Crime Type</th>
<th>MAE</th>
<th>RMSE</th>
</tr>
</thead>
<tbody>
<tr>
<td>Theft</td>
<td>1.28</td>
<td>2.95</td>
</tr>
<tr>
<td>Burglary</td>
<td>0.35</td>
<td>0.84</td>
</tr>
<tr>
<td>Assault</td>
<td>0.20</td>
<td>0.58</td>
</tr>
<tr>
<td>Robbery</td>
<td>0.04</td>
<td>0.14</td>
</tr>
<tr>
<td>Sexual</td>
<td>0.03</td>
<td>0.11</td>
</tr>
<tr>
<td>Harm</td>
<td>0.01</td>
<td>0.04</td>
</tr>
</tbody>
</table>
<p>The best baseline. By averaging over three years, it smooths out the
month-to-month noise and the year-over-year trend simultaneously. It won't
capture seasonal peaks or sudden changes, but for the &quot;typical month&quot; prediction
it's solid.</p>
<h2>Why MAPE breaks down</h2>
<p>You might wonder why I'm not reporting MAPE (Mean Absolute Percentage Error).
It's the standard metric in a lot of forecasting work. The reason: sparse data.</p>
<p>MAPE divides the error by the actual value. When the actual value is zero (which
it is for 91.7% of our tensor) you get division by zero. Even for cells with
small counts (1 or 2 crimes), a prediction of 0 gives you 100% MAPE while a
prediction of 2 gives you 0–100%. The metric becomes wildly unstable.</p>
<p>MAE and RMSE are more honest here. They tell you the absolute magnitude of your
errors in actual crime counts, which is what we care about. A miss of 3
victimisations means the same thing whether the cell usually has 5 or 50.</p>
<h2>The bar to clear</h2>
<p>Here's the scoreboard going forward. Any deep learning model needs to beat the
historical average baseline to justify its existence:</p>
<table>
<thead>
<tr>
<th>Crime Type</th>
<th>Historical Avg MAE</th>
<th>Historical Avg RMSE</th>
</tr>
</thead>
<tbody>
<tr>
<td>Theft</td>
<td>1.28</td>
<td>2.95</td>
</tr>
<tr>
<td>Burglary</td>
<td>0.35</td>
<td>0.84</td>
</tr>
<tr>
<td>Assault</td>
<td>0.20</td>
<td>0.58</td>
</tr>
<tr>
<td>All types</td>
<td>0.39</td>
<td>0.95</td>
</tr>
</tbody>
</table>
<p>Theft is the easiest to beat because there's the most signal: high counts, clear
spatial patterns, strong seasonality. Robbery, sexual offences, and harm are
essentially noise at this resolution. The models will probably predict near-zero
for those types and be mostly correct.</p>
<p>The real test will be the middle ground. Can ConvLSTM or ST-ResNet predict the
<em>changes</em> in theft and burglary better than a static average? Can they catch the
months where a cell spikes or dips? That's where simple baselines fall flat,
because they don't model dynamics at all.</p>
<p>If the deep learning can't meaningfully beat &quot;just use the average,&quot; then it's
not worth the CPU cycles. Or in my case, the many hours of a Ryzen 5 grinding
away.</p>
]]>
      </content:encoded>
      <pubDate>Thu, 09 Apr 2026 00:00:00 GMT</pubDate>
    </item>
    <item>
      <title>Claude Code Running Claude Code in 4-Second Disposable VMs</title>
      <link>https://jonnonz.com/posts/claude-code-running-claude-code-in-4-second-disposable-vms/</link>
      <guid isPermaLink="false">https://jonnonz.com/posts/claude-code-running-claude-code-in-4-second-disposable-vms/</guid>
      <description>
        Anthropic has an internal platform for running Claude Code in full-permission isolation. I built my own version with Firecracker MicroVMs, Go, and a $400 Ryzen box.
      </description>
      <content:encoded>
        <![CDATA[<p>Running Claude Code with full permissions inside a Docker container is a
terrible idea. I did it anyway for about a week, then built something better.</p>
<p>Anthropic has an internal platform — people have been calling it
<a href="https://ai.gopubby.com/anthropics-antspace-the-secret-paas-nobody-was-supposed-to-find-a79ce1e02151">Antspace</a>
since it got reverse-engineered from the Claude Code source — that runs AI
coding tasks in isolated environments. It's part of a vertical stack they're
building internally: intent goes in, code comes out, and the agent never touches
the host machine.</p>
<p>I wanted that. Not the whole platform-as-a-service thing, just the core idea:
give Claude Code a prompt, let it run with zero permission restrictions, stream
the output back, grab any files it created, and destroy everything when it's
done. On a single Linux box sitting in my office.</p>
<p>The result is about 3,200 lines of Go and 860 lines of TypeScript. It boots a
fresh Linux VM in ~4 seconds, runs Claude Code inside it, and tears it down when
the task finishes. Three ways to use it: a CLI, a REST API with a web dashboard,
and an MCP server so Claude Code on other machines can delegate tasks to it.</p>
<p>This first post is about why I built it this way.
<a href="https://jonnonz.com/posts/29-hours-debugging-iptables-to-boot-vms-in-4-seconds/">Part 2</a> and
<a href="https://jonnonz.com/posts/claude-code-can-now-spawn-copies-of-itself-in-isolated-vms/">Part 3</a> get
into the actual implementation.</p>
<h2>The container problem</h2>
<p><code>CLAUDE_DANGEROUSLY_SKIP_PERMISSIONS=true</code> — that's the environment variable
that tells Claude Code to stop asking before it runs shell commands or writes
files. It just does whatever it thinks it needs to. For autonomous tasks, you
need this. Claude can't ask for confirmation when there's nobody watching.</p>
<p>The question is where you let it run.</p>
<p>Docker is the obvious first thought. Fast startup, everyone knows it, easy to
orchestrate. But containers share the host kernel. Every container on the
machine issues syscalls to the same Linux kernel, and a kernel vulnerability is
a vulnerability in every container on the host.
<a href="https://huggingface.co/blog/agentbox-master/firecracker-vs-docker-tech-boundary">The isolation boundary is the container runtime</a>,
not hardware — and that surface area is big.</p>
<p>For most workloads this is fine. Running a web server in Docker? No worries. But
running an AI agent that can execute arbitrary shell commands with root-level
permissions? That's a different threat model. A container escape gives you the
host. And you've just given the thing inside the container permission to try
anything.</p>
<p>Anthropic's own approach to
<a href="https://www.anthropic.com/engineering/claude-code-sandboxing">sandboxing Claude Code</a>
uses OS-level primitives — bubblewrap on Linux, Seatbelt on macOS — for
filesystem and network isolation. They report an 84% reduction in permission
prompts internally. That's smart for the normal use case where Claude is helping
you write code in your own project. But I wanted something more aggressive: full
isolation where even a kernel exploit can't reach the host.</p>
<h2>Why Firecracker</h2>
<p><a href="https://firecracker-microvm.github.io/">Firecracker</a> is what AWS built for
Lambda and Fargate. Each MicroVM is a real KVM-backed virtual machine with its
own guest kernel, its own memory space, and hardware-enforced isolation via
Intel VT-x or AMD-V. The attack surface is the KVM hypervisor — which the kernel
team at AWS has spent years minimising.</p>
<p>The trade-off is boot time. Containers start in under a second. Firecracker VMs
take about 4 seconds on my hardware once you account for the guest kernel boot,
systemd init, and the agent process starting up. For tasks that typically run
20-120 seconds, 4 seconds of overhead is nothing.</p>
<p>Each VM also copies a 4GB rootfs image. Sparse copies make this fast (&lt;1
second), but it does use disk. On a machine with a 1TB NVMe, I'm not losing
sleep over it.</p>
<p>The hardware is an AMD Ryzen 5 5600GT with 30GB of RAM. Nothing exotic. About
$400 worth of parts sitting under my desk. Each VM gets 2GB of RAM by default,
so I can run roughly 12-13 VMs concurrently before the host runs out of memory.</p>
<h2>Talking to a VM without a network</h2>
<p>This was my favourite bit to figure out.</p>
<p>The obvious way to communicate with a process inside a VM is SSH. Set up keys,
open a port, connect over the network. But SSH means key management, an open
network port inside the VM, and another service to configure. If the guest's
network breaks during a task, you've lost your control channel.</p>
<p><a href="https://github.com/firecracker-microvm/firecracker/blob/main/docs/vsock.md">vsock</a>
(AF_VSOCK, address family 40) is a kernel-level host-guest communication
channel. It doesn't touch the network stack. No IP addresses, no ports, no keys.
Firecracker exposes the guest's vsock as a Unix domain socket on the host side —
you connect to the socket, send <code>CONNECT &lt;port&gt;\n</code>, and you're talking directly
to a process inside the VM.</p>
<pre><code class="language-go">func Connect(jailID string, port int) (net.Conn, error) {
    socketPath := fmt.Sprintf(&quot;/srv/jailer/firecracker/%s/root/vsock.sock&quot;, jailID)
    conn, _ := net.Dial(&quot;unix&quot;, socketPath)
    conn.Write([]byte(fmt.Sprintf(&quot;CONNECT %d\n&quot;, port)))
    // Read &quot;OK &lt;port&gt;&quot; response
    return conn, nil
}
</code></pre>
<p>On the guest side, Go's standard library doesn't support AF_VSOCK — address
family 40 doesn't exist in the <code>net</code> package. So the guest agent uses raw
syscalls:</p>
<pre><code class="language-go">fd, _ := syscall.Socket(40, syscall.SOCK_STREAM, 0)  // AF_VSOCK = 40
// Manually construct struct sockaddr_vm (16 bytes)
sa := [16]byte{}
*(*uint16)(unsafe.Pointer(&amp;sa[0])) = 40          // family
*(*uint32)(unsafe.Pointer(&amp;sa[4])) = uint32(port) // port (9001)
*(*uint32)(unsafe.Pointer(&amp;sa[8])) = 0xFFFFFFFF   // VMADDR_CID_ANY
syscall.RawSyscall(syscall.SYS_BIND, uintptr(fd), uintptr(unsafe.Pointer(&amp;sa[0])), 16)
syscall.RawSyscall(syscall.SYS_LISTEN, uintptr(fd), 5, 0)
</code></pre>
<p>Yeah, that's <code>unsafe.Pointer</code> and manual struct layout. Not the prettiest Go
you'll ever write. But it works, it's fast, and the whole vsock layer is about
160 lines shared between both binaries.</p>
<p>The wire protocol is dead simple — length-prefixed JSON frames:</p>
<pre><code class="language-go">func WriteFrame(w io.Writer, v interface{}) error {
    data, _ := json.Marshal(v)
    binary.Write(w, binary.BigEndian, uint32(len(data)))
    w.Write(data)
    return nil
}
</code></pre>
<p>Each operation (ping, exec, write files, read file) opens a new connection,
sends one request, reads the response, and closes. Connection-per-request. Not
fancy, but vsock connections are local and effectively instant, so there's no
reason to complicate things with multiplexing.</p>
<h2>The shape of the thing</h2>
<p>The whole system is two Go binaries — the orchestrator (runs on the host) and
the agent (runs inside each VM).</p>
<pre><code class="language-mermaid">graph TD
    subgraph &quot;Host — orchestrator binary&quot;
        API[&quot;REST API + WebSocket :8080&quot;]
        MCP[&quot;MCP Server :8081&quot;]
        VM[&quot;VM Manager&quot;]
        NET[&quot;TAP + iptables&quot;]
        TASK[&quot;Task Runner&quot;]
        STREAM[&quot;Pub/Sub Hub&quot;]
        VSOCK[&quot;vsock Client&quot;]
    end

    subgraph &quot;Guest — agent binary&quot;
        AGENT[&quot;Guest Agent vsock:9001&quot;]
        CLAUDE[&quot;Claude Code&quot;]
    end

    API --&gt; TASK
    MCP --&gt; TASK
    TASK --&gt; VM
    VM --&gt; NET
    TASK --&gt; VSOCK
    VSOCK --&gt; AGENT
    AGENT --&gt; CLAUDE
    TASK --&gt; STREAM
    STREAM --&gt; API
</code></pre>
<p>The orchestrator is a single 14MB binary with the React dashboard embedded via
<code>//go:embed</code>. Copy it to a server, run it with sudo, done. Seven Go dependencies
total — chi for routing, netlink for TAP devices, go-iptables for firewall
rules, <a href="https://github.com/mark3labs/mcp-go">mcp-go</a> for the MCP protocol, and a
few others.</p>
<p>The agent is a 2.5MB static binary compiled with <code>CGO_ENABLED=0</code>. It ships
inside the VM's rootfs and starts via systemd on boot. Within about a second of
the VM coming up, the agent is listening on vsock port 9001 and ready to accept
commands.</p>
<p>They share exactly one file — <code>internal/agent/protocol.go</code> — which defines the
wire protocol types and framing functions. Everything else is independent.</p>
<h2>What a task looks like</h2>
<p>You give it a prompt. It does the rest.</p>
<ol>
<li>Generate a task ID and VM name</li>
<li>Copy the base rootfs image (sparse, &lt;1 second)</li>
<li>Inject network config into the rootfs</li>
<li>Create a TAP device and iptables rules for internet access</li>
<li>Launch Firecracker via the jailer</li>
<li>Poll vsock until the agent responds (~1 second)</li>
<li>Inject credentials and files via vsock</li>
<li>Run Claude Code with streaming output</li>
<li>Collect any files Claude created</li>
<li>Destroy the VM</li>
</ol>
<p>From the CLI it looks like this:</p>
<pre><code class="language-bash">sudo ./bin/orchestrator task run \
    --prompt &quot;Write a Python script that generates Fibonacci numbers&quot; \
    --ram 2048 \
    --vcpus 2 \
    --timeout 120
</code></pre>
<p>Output streams to your terminal in real time. When it's done:</p>
<pre><code>=== Task Complete ===
ID:     a3bfca80
Status: completed
Exit:   0
Cost:   $0.0582
Files:  [fibonacci.py]
</code></pre>
<p>The VM is gone. The rootfs is deleted. The TAP device and iptables rules are
cleaned up. All that's left is the result files in
<code>/opt/firecracker/results/a3bfca80/</code>.</p>
<p>Or you use the MCP server, and Claude Code on your laptop delegates the task to
a VM on the box under your desk. Claude spawning Claude. That bit is properly
cool, and I'll get into it in Part 3.</p>
<h2>Why Go</h2>
<p>Quick aside on this because people always ask.</p>
<p>Go produces static binaries. The agent needs to be a single file with zero
dependencies that runs inside a minimal Debian guest — <code>CGO_ENABLED=0</code> makes
this trivial. The orchestrator needs to manage concurrent VMs, and goroutines
are a natural fit for that. Syscall support is first-class, which matters when
you're doing raw vsock operations. And it compiles in about 2 seconds, which is
nice when you're iterating.</p>
<pre><code class="language-makefile">build-agent:
	CGO_ENABLED=0 GOOS=linux GOARCH=amd64 go build -o bin/agent -ldflags=&quot;-s -w&quot; ./cmd/agent
</code></pre>
<p>That <code>-ldflags=&quot;-s -w&quot;</code> strips debug info and DWARF tables, dropping the agent
binary from ~3.5MB to ~2.5MB. Every byte counts when you're baking it into a
rootfs that gets copied for every VM.</p>
<p><a href="https://jonnonz.com/posts/29-hours-debugging-iptables-to-boot-vms-in-4-seconds/">Part 2</a> gets into
the actual build — the rootfs, the networking (including a fun bug with Ubuntu's
UFW that had me staring at iptables rules for an embarrassing amount of time),
the guest agent, and the streaming pipeline that gets Claude's output from
inside a VM to your browser.</p>
]]>
      </content:encoded>
      <pubDate>Wed, 08 Apr 2026 00:00:00 GMT</pubDate>
    </item>
    <item>
      <title>What if your browser built the UI for you?</title>
      <link>https://jonnonz.com/posts/what-if-your-browser-built-the-ui-for-you/</link>
      <guid isPermaLink="false">https://jonnonz.com/posts/what-if-your-browser-built-the-ui-for-you/</guid>
      <description>
        We're still shipping hand-crafted frontends while AI can generate entire interfaces. What if the browser itself generated the UI from an API manifest and your preferences?
      </description>
      <content:encoded>
        <![CDATA[<p>We're at a genuinely weird inflection point in frontend development. AI can
generate entire interfaces now. LLMs can reason about data and layout. And yet —
most SaaS products still ship hand-crafted React apps, each building its own UI,
its own accessibility layer, its own theme system, its own responsive
breakpoints. Not every service, but the vast majority.</p>
<p>That's a lot of duplicated effort for what's essentially the same job — showing
a human some data and letting them do stuff with it.</p>
<p>I've been thinking about this a lot lately, and I built a proof of concept to
test an idea: what if the browser itself generated the UI?</p>
<h2>Where we are right now</h2>
<p>The industry is circling this idea from multiple angles, but nobody's quite
landed on it yet.</p>
<p><a href="https://www.apollographql.com/docs/graphos/schema-design/guides/sdui/basics">Server-driven UI</a>
has been around for a while — Airbnb and others pioneered it for mobile, where
app store review cycles make shipping UI changes painful. The server sends down
a JSON tree describing what to render, and the client just follows instructions.
It's clever, but the server is still calling the shots. x.</p>
<p>Google recently shipped
<a href="https://developers.google.com/natively-adaptive-interfaces">Natively Adaptive Interfaces</a>
— a framework that uses AI agents to make accessibility a default rather than an
afterthought. Really cool idea, and the right instinct. But it's still operating
within a single app's boundaries. Your accessibility preferences don't carry
between Google's products and, say, your project management tool.</p>
<p>Then there's the
<a href="https://www.copilotkit.ai/blog/the-developer-s-guide-to-generative-ui-in-2026">generative UI</a>
wave — CopilotKit, Vercel's AI SDK, and others building frameworks where LLMs
generate components on the fly. These are powerful developer tools, but they're
still developer tools. The generation happens at build time or on the server.
The service is still in control.</p>
<p>See the pattern? Every approach keeps the power on the service side.</p>
<h2>Flip it</h2>
<p>Here's the idea behind the
<a href="https://github.com/jonnonz1/adaptive-browser">adaptive browser</a>: what if the
generation happened on <em>your</em> side?</p>
<p>Instead of a service shipping you a finished frontend, it publishes a manifest —
a structured description of what it can do. Its capabilities, endpoints, data
shapes, what actions are available. Think of it like an API spec, but semantic.
Not just &quot;here's a GET endpoint&quot; but &quot;here's a list of repositories, they're
sortable by stars and language, you can create, delete, star, or fork them.&quot;</p>
<p>Your browser takes that manifest, calls the actual APIs, gets real data back,
and then generates the UI based on your preferences. Your font size. Your colour
scheme. Your preferred layout (tables vs cards vs kanban). Your accessibility
needs. All applied universally, across every service.</p>
<p>The manifest for something like GitHub looks roughly like this — a service
describes its capabilities and the browser figures out the rest:</p>
<pre><code class="language-yaml">service:
  name: &quot;GitHub&quot;
  domain: &quot;api.github.com&quot;

capabilities:
  - id: &quot;repositories&quot;
    endpoints:
      - path: &quot;/user/repos&quot;
        semantic: &quot;list&quot;
        entity: &quot;repository&quot;
        sortable_fields: [name, updated_at, stargazers_count]
        actions: [create, delete, star, fork]
</code></pre>
<p>The browser takes that, fetches the data, and generates a bespoke interface —
using an LLM to reason about the best way to present it given who you are and
what you're trying to do.</p>
<h2>Why this matters more than it sounds</h2>
<p>When I was building the app store and integrations platforms at Xero, one of the
constant headaches was that every third-party integration had its own UI
patterns. Users had to learn a new interface for every app they connected. If
the browser was generating the UI from a shared set of preferences, that problem
just… goes away.</p>
<p>Accessibility is the big one though. Right now, accessibility is a feature that
gets bolted on — and often badly. When the browser generates the UI,
accessibility isn't a feature. It's the default. Your preferences — high
contrast, keyboard-first navigation, screen reader optimisation, larger text —
apply everywhere. Not because every developer remembered to implement them, but
because they're baked into how the UI gets generated in the first place.</p>
<p>Customisation becomes genuinely personal too. Not &quot;pick from three themes the
developer made&quot; but &quot;this is how I interact with software, full stop.&quot;</p>
<h2>The trade-off is real though</h2>
<p>Frontend complexity drops dramatically, but the complexity doesn't disappear —
it moves behind the API. And honestly, it probably increases.</p>
<p>API design becomes way more important. You can't just throw together some REST
endpoints and call it a day. Your manifest needs to be semantic — describing
what the data means, not just what shape it is. Data contracts between services
matter more. Versioning matters more.</p>
<pre><code class="language-mermaid">graph LR
    A[Service] --&gt;|Publishes manifest + APIs| B[Browser Agent]
    C[User Preferences] --&gt; B
    D[Org Guardrails] --&gt; B
    B --&gt;|Generates| E[Bespoke UI]
</code></pre>
<p>But here's the thing — this trade-off pushes us somewhere genuinely interesting.
If every service needs to describe itself semantically through APIs and
manifests, those APIs become the actual product surface. Not the frontend. The
APIs.</p>
<p>And once APIs are the product surface, sharing context between platforms becomes
the interesting problem. Your project management tool knows what you're working
on. Your email client knows who you're talking to. Your code editor knows what
you're building. Right now, none of these talk to each other in any meaningful
way because they're all locked behind their own UIs. In a manifest-driven world,
that context flows through the APIs — and your browser can stitch it all
together into something coherent.</p>
<h2>Where this is headed (IMHO)</h2>
<p>I reckon we're about 3-5 years from this being mainstream. The pieces are all
there — LLMs that can reason about UI,
<a href="https://www.builder.io/blog/ui-over-apis">standardisation efforts</a> around
sending UI intent over APIs, and a growing expectation from users that software
should adapt to them, not the other way around.</p>
<p>The services that win in this world won't be the ones with the prettiest
hand-crafted UI. They'll be the ones with the best APIs, the richest manifests,
and the most useful data. The frontend becomes a generated output, not a
hand-crafted input.</p>
<p>Organisations will set preference guardrails — &quot;our people can use dark or light
mode, must have destructive action confirmations, these fields are always
visible&quot; — while individuals customise within those bounds. Your browser becomes
your agent, not just a renderer.</p>
<p>I built the <a href="https://github.com/jonnonz1/adaptive-browser">adaptive browser</a> as
a proof of concept to test this thinking — it uses Claude to generate UIs from a
GitHub manifest and user preferences defined in YAML. It's rough, but the
direction feels right.</p>
<p>The frontend isn't dying. But what we think of as &quot;frontend development&quot; is
about to change. The interesting work moves to API design, semantic data
contracts, and building browsers smart enough to be genuine user agents.</p>
]]>
      </content:encoded>
      <pubDate>Sun, 05 Apr 2026 00:00:00 GMT</pubDate>
    </item>
    <item>
      <title>Stealing NanoClaw Patterns for Web Apps and SaaS</title>
      <link>https://jonnonz.com/posts/stealing-nanoclaw-patterns-for-webapps-and-saas/</link>
      <guid isPermaLink="false">https://jonnonz.com/posts/stealing-nanoclaw-patterns-for-webapps-and-saas/</guid>
      <description>
        Four architectural patterns from NanoClaw's tiny codebase that translate directly to production web apps and SaaS — credential sidecars, infrastructure-level isolation, polling loops, and Postgres as your message queue.
      </description>
      <content:encoded>
        <![CDATA[<p>In <a href="https://jonnonz.com/posts/nanoclaw-architecture-masterclass-in-doing-less/">Part 1</a> I pulled
apart NanoClaw's codebase and found six patterns that make an 8,000-line AI
assistant surprisingly robust. But NanoClaw is a single-user tool running on
your laptop. Surely these patterns fall apart once you've got real tenants, real
money, and real scale?</p>
<p>Nah. Four of them translate almost directly — and the ones that don't still
teach you something useful.</p>
<h2>The credential sidecar</h2>
<p>NanoClaw's credential proxy — where containers get a placeholder API key and a
localhost proxy injects the real one — sounds like a neat trick for a personal
tool. But this exact pattern is showing up in production Kubernetes deployments
right now.</p>
<p>The broader version is a
<a href="https://www.apistronghold.com/blog/phantom-token-pattern-production-ai-agents">sidecar proxy that handles credential injection</a>
for any service that needs API keys or tokens. Your application code never
touches the real secret. A sidecar container intercepts outbound requests, swaps
in credentials, and forwards them upstream.</p>
<p>At Vend we managed a bunch of third-party integrations — payment gateways,
shipping providers, accounting platforms. Each one had API keys that needed to
live somewhere. We went through the typical evolution: environment variables,
then a secrets manager, then a service that distributed keys at startup. Every
step was an improvement, but the keys still ended up <em>in the application's
memory</em>.</p>
<p>The sidecar approach skips that entirely. Your app sends requests with a
placeholder. The proxy — which is a separate process with its own security
boundary — does the credential swap. Even if your application gets compromised,
the real keys aren't there to steal.</p>
<p>If you're running any kind of multi-service architecture where services call
external APIs, this pattern is worth adopting. Your API gateway might already be
doing a version of it — the insight is making it explicit and consistent across
all outbound credential flows.</p>
<h2>Isolation as the security model</h2>
<p>This is the one I keep thinking about.</p>
<p>NanoClaw uses filesystem mounts to control what each container can see. No
application-level permission checks — the security model <em>is</em> the infrastructure
topology. If a container can't see a file, it can't access it. No bugs, no
missed checks, no escalation vulnerabilities.</p>
<p>In SaaS, we spend enormous amounts of time writing authorisation logic. Role
checks, permission middleware, tenant-scoping queries. And it works — until
someone forgets a WHERE clause.</p>
<p>AWS's own
<a href="https://docs.aws.amazon.com/whitepapers/latest/saas-architecture-fundamentals/tenant-isolation.html">SaaS tenant isolation guidance</a>
makes this point explicitly: authentication and authorisation are not the same
as isolation. The fact that a user logged in doesn't mean your system has
achieved tenant isolation. A
<a href="https://workos.com/blog/tenant-isolation-in-multi-tenant-systems">single missed tenant filter</a>
on a database query and you've got a cross-tenant data leak.</p>
<p>The NanoClaw-inspired approach is to push isolation down the stack. Separate
database schemas per tenant. Separate containers. Separate cloud accounts for
your highest-value customers. Not instead of application-level checks — but as a
backstop that catches the bugs your application-level checks inevitably have.</p>
<p>At Xero, working across the integrations and app store teams, I saw first-hand
how multi-tenant data isolation gets complicated fast. The teams that had the
fewest incidents were the ones where the infrastructure itself enforced
boundaries, not just the application code.</p>
<p>You don't need to go full NanoClaw and give every tenant their own container.
But you should be asking: if my application-level authorisation has a bug,
what's my second line of defence? If the answer is &quot;nothing&quot; — that's the
pattern to steal.</p>
<h2>Polling when it's the right call</h2>
<p>NanoClaw polls SQLite every 2 seconds. No WebSockets, no event bus, no pub/sub.
Just a loop that checks for new stuff.</p>
<p>The instinct for most teams is to treat polling as a temporary hack you'll
replace with &quot;proper&quot; event-driven architecture later. Yan Cui wrote a
<a href="https://theburningmonk.com/2025/05/understanding-push-vs-poll-in-event-driven-architectures/">solid breakdown of push vs poll in event-driven systems</a>
and the takeaway isn't that one is always better — it's that the right choice
depends on your throughput, ordering, and failure-handling requirements.</p>
<p>For a lot of internal systems, polling is the correct permanent answer.</p>
<p>Admin dashboards. Background job status. Internal reporting. Webhook retry
queues. Deployment pipelines. These systems don't need sub-second latency. They
need reliability and simplicity. A polling loop against your database gives you
both, with zero infrastructure overhead.</p>
<p>At Xero we shipped multiple times per day, and some of the internal tooling that
supported continuous deployment was surprisingly simple under the hood. Cron
jobs. Polling loops. SQL queries on a timer. Not because anyone was cutting
corners — because the requirements genuinely didn't need anything more
sophisticated.</p>
<p>The trap is reaching for Kafka or RabbitMQ because you think you'll need it
eventually.
<a href="https://synmek.com/saas-architecture-for-startups-2025-guide">70% of startups fail due to premature scaling</a>.
The infrastructure you don't deploy is the infrastructure that never breaks.</p>
<h2>Your database is your message queue</h2>
<p>NanoClaw uses JSON files on the filesystem for inter-process communication.
Atomic rename, directory-based identity, simple polling to pick up new messages.
No Redis. No message broker.</p>
<p>That specific approach won't scale to a multi-tenant SaaS — but the <em>instinct</em>
behind it absolutely does. The instinct is: use the infrastructure you already
have.</p>
<p>For most web apps, that means Postgres. The
<a href="https://dagster.io/blog/skip-kafka-use-postgres-message-queue">Postgres-as-queue movement</a>
has been gaining serious traction, and tools like
<a href="https://github.com/pgmq/pgmq">PGMQ</a> make it practical. You get ACID guarantees,
you don't need to manage another service, and your queue is backed by the same
database you're already monitoring and backing up.</p>
<p>NanoClaw's
<a href="https://dev.to/constanta/crash-safe-json-at-scale-atomic-writes-recovery-without-a-db-3aic">atomic write pattern</a>
— write to a temp file, rename into place — maps to <code>INSERT INTO queue_table</code>
followed by a <code>SELECT ... FOR UPDATE SKIP LOCKED</code> consumer. Same principle: the
message either exists completely or doesn't exist at all. No partial state.</p>
<p>The &quot;just add Redis&quot; reflex is strong in our industry. Sometimes it's the right
call. But I've seen plenty of teams introduce a message broker for a workload
that Postgres could've handled without breaking a sweat — and then spend the
next six months debugging consumer lag and dead letter queues.</p>
<h2>The real pattern</h2>
<p>The specific techniques matter less than the discipline behind them.</p>
<p>NanoClaw's developer looked at a 500,000-line framework and asked: what are my
<em>actual</em> constraints? Single user. Local machine. One AI provider. And then
built exactly the architecture those constraints required — nothing more.</p>
<p>Most teams don't do this. They build for imaginary scale, imaginary
multi-tenancy requirements, imaginary traffic spikes. They reach for Kubernetes
before they've outgrown a single server. They deploy event buses before they've
outgrown a polling loop. They write complex authorisation middleware before
they've considered whether infrastructure isolation would eliminate the problem
entirely.</p>
<p>The pattern worth stealing isn't the credential proxy or the polling loop or
Postgres-as-queue. It's the habit of understanding your constraints first and
letting them delete complexity from your architecture.</p>
<p>Hardest pattern to adopt, though. Because it means admitting you're smaller than
you think.</p>
]]>
      </content:encoded>
      <pubDate>Sun, 05 Apr 2026 00:00:00 GMT</pubDate>
    </item>
    <item>
      <title>What the Data Actually Shows</title>
      <link>https://jonnonz.com/posts/what-the-data-actually-shows/</link>
      <guid isPermaLink="false">https://jonnonz.com/posts/what-the-data-actually-shows/</guid>
      <description>
        Before throwing deep learning at Auckland crime data, you need to actually look at it. Seasonal patterns, spatial hotspots, and the sparsity problem.
      </description>
      <content:encoded>
        <![CDATA[<p>You can't just shove a tensor into a neural network and hope for the best.</p>
<p>I mean, you <em>can</em>. People do it all the time. But you'll have no idea whether
your model is learning something real or just memorising noise. Before we get
anywhere near ConvLSTM or ST-ResNet, we need to properly understand what
patterns actually exist in this data, and whether they're strong enough for a
model to learn.</p>
<p>This is the part that most ML blog posts skip. It's also the part that saves you
weeks of debugging later.</p>
<h2>When does crime happen?</h2>
<p>The monthly pattern across Auckland is surprisingly consistent year to year.
Crime peaks in late spring and early summer (October through January) and dips
in late summer through winter. February is reliably the quietest month at around
7,000–8,000 victimisations, while November and December regularly push past
9,000.</p>
<p>This tracks with what
<a href="https://link.springer.com/article/10.1007/s43762-023-00094-x">criminology research has found globally</a>:
warmer months mean more people out and about, more opportunities for property
crime, and more interpersonal conflict. It's a well-documented pattern called
seasonal variation in crime, and it shows up clearly in the NZ data.</p>
<p>The seasonal signal isn't uniform across crime types though. Theft drives most
of the swing. It surges in summer and drops in winter, accounting for nearly all
the monthly variance. Assault has its own rhythm. It peaks around the holiday
period (December–January) and shows a secondary bump in winter weekends,
probably pub-related. Burglary is flatter, with a slight winter uptick when
houses are dark earlier.</p>
<p>2023 was the peak year across the board, with a noticeable decline through 2024
and into early 2025. Whether that's a real trend or a reporting artefact, I
genuinely don't know. But it means the model's training data includes both an
upswing and a downswing, which is useful. It can't just learn &quot;crime always goes
up.&quot;</p>
<h2>Where does crime cluster?</h2>
<p>Crime in Auckland is not randomly distributed. That's obvious to anyone who
lives here, but it's worth quantifying.</p>
<p>Running a
<a href="https://www.publichealth.columbia.edu/research/population-health-methods/hot-spot-spatial-analysis">Moran's I test</a>
on our 500m grid confirms strong positive spatial autocorrelation. Cells with
high crime counts are surrounded by other high-crime cells. The Moran's I
statistic comes out at 0.43 (p &lt; 0.001), which means the clustering is highly
significant. Crime begets more crime in adjacent cells.</p>
<p>The hotspots are exactly where you'd expect. The CBD dominates: Queen Street,
Karangahape Road, and the surrounding blocks consistently light up across all
crime types. South Auckland corridors (Manukau, Ōtāhuhu, Papatoetoe) form a
second cluster, particularly for assault and robbery. Henderson in the west
shows up for burglary.</p>
<p>What's less obvious is how stable these hotspots are over time. The top 5% of
cells (about 227 cells) account for over 60% of all recorded crime across the
entire four-year period. These aren't random spikes. They're persistent. A cell
that's hot in 2022 is almost certainly still hot in 2025. That temporal
persistence is exactly what makes this data amenable to prediction. If hotspots
moved randomly month to month, no model could learn them.</p>
<h2>Crime type correlations</h2>
<p>The six channels in our tensor don't behave independently. Theft and burglary
show moderate positive correlation (r ≈ 0.52). Cells with lots of theft tend to
have more burglary too, which makes sense given similar opportunity structures
(commercial areas, transport hubs).</p>
<p>Assault correlates weakly with everything else (r ≈ 0.15–0.25). It has its own
spatial logic (nightlife areas, specific residential pockets) that doesn't align
neatly with property crime.</p>
<p>Robbery, sexual offences, and harm are so sparse at the 500m monthly resolution
that correlation analysis is basically meaningless. Most cells have zero counts
for these types in any given month. That sparsity is going to be a real headache
for the models.</p>
<h2>The sparsity problem, again</h2>
<p>We flagged this in Part 3: 91.7% of the tensor is zeros. But the EDA makes the
problem even clearer.</p>
<p>The distribution of non-zero cell values is heavily right-skewed. The median
non-zero value is 1. One crime, in one cell, in one month. The mean is about
2.3. A handful of cells (the CBD, Manukau) hit 30–50+ in peak months for theft.
The model needs to learn the difference between &quot;always zero&quot; cells,
&quot;occasionally one&quot; cells, and &quot;consistently busy&quot; cells.</p>
<p>If you plot the crime count distribution across non-zero cells, it follows
something close to a power law. A tiny number of cells carry an outsized share
of the signal. This is textbook
<a href="https://pmc.ncbi.nlm.nih.gov/articles/PMC7319308/">spatial concentration of crime</a>,
documented in basically every city ever studied.</p>
<p>For modelling, this means two things. First, aggregate metrics like RMSE will be
dominated by how well the model predicts the high-count cells. Second,
predicting &quot;zero&quot; for a sparse cell is almost always correct but completely
uninformative. We'll need to think carefully about what &quot;accuracy&quot; actually
means when we get to evaluation.</p>
<h2>What this means for the models</h2>
<p>The EDA tells us a few things that should directly shape how we build and
evaluate the models:</p>
<p>The seasonal signal is strong and consistent. A model that can't capture monthly
seasonality is worse than useless. It's worse than a calendar.</p>
<p>Spatial structure is real and persistent. Hotspots don't move much. A model that
learns static spatial patterns will get a lot of the way there, even without
understanding temporal dynamics.</p>
<p>We already know the CBD will have lots of theft next month. That's not what
we're trying to predict. The real value is in the margins: the cells that go
from quiet to active, or the months where a normally stable area spikes. That's
where deep learning might actually add something over simple baselines.</p>
<p>Speaking of which, we need baselines. Otherwise we won't know if ConvLSTM is
actually clever or just expensive. That's next.</p>
]]>
      </content:encoded>
      <pubDate>Thu, 02 Apr 2026 00:00:00 GMT</pubDate>
    </item>
    <item>
      <title>How I Built an SMS Gateway with a $20 Android Phone</title>
      <link>https://jonnonz.com/posts/built-an-sms-gateway-with-a-20-dollar-android-phone/</link>
      <guid isPermaLink="false">https://jonnonz.com/posts/built-an-sms-gateway-with-a-20-dollar-android-phone/</guid>
      <description>
        Turn any Android phone into a programmable SMS gateway for your SaaS — no per-message fees, no carrier contracts, no vendor lock-in.
      </description>
      <content:encoded>
        <![CDATA[<p>Twilio charges around $0.05–0.06 per SMS round-trip. Doesn't sound like much
until you're building an MVP that sends reminders, confirmations, and
notifications — suddenly you're looking at $50/month for a thousand messages.
For an app that's not making money yet, that's a dumb tax.</p>
<p>Here's what I did instead: grabbed a cheap Android phone, installed an
open-source app called
<a href="https://github.com/capcom6/android-sms-gateway">SMS Gateway for Android</a>, and
turned it into a full SMS gateway with a REST API. My SMS costs dropped to
whatever my mobile plan charges — which on plenty of prepaid plans is zero.
Unlimited texts.</p>
<p>This post walks through exactly how to wire it into a Next.js app, from first
install to receiving webhooks. The whole thing took an afternoon.</p>
<hr>
<h2>What You're Building</h2>
<p>By the end of this you'll have:</p>
<ul>
<li>An Android phone acting as your SMS gateway</li>
<li>A webhook endpoint receiving inbound SMS in real-time</li>
<li>Outbound SMS sent via a simple REST API call</li>
<li>A provider abstraction so you can swap between SMS Gateway, Twilio, or console
logging</li>
</ul>
<h2>Prerequisites</h2>
<ul>
<li>An Android phone (5.0+) with a SIM card</li>
<li>A Next.js app (I'm using 15 with App Router, but any backend works)</li>
<li>Node.js 18+</li>
<li>ngrok for testing with cloud mode</li>
</ul>
<hr>
<h2>Install SMS Gateway on Android</h2>
<ol>
<li>
<p>Install <strong>SMS Gateway for Android</strong> from the
<a href="https://play.google.com/store/apps/details?id=me.capcom.smsgateway">Google Play Store</a>
or grab the APK from
<a href="https://github.com/capcom6/android-sms-gateway/releases">GitHub Releases</a></p>
</li>
<li>
<p>Open the app and <strong>grant SMS permissions</strong> when prompted</p>
</li>
<li>
<p>You'll see the main screen with toggles for Local Server and Cloud Server:</p>
</li>
</ol>
<p><img src="https://jonnonz.com/img/posts/sms-gateway/screenshot.png" alt="SMS Gateway main screen"></p>
<p>The app supports two modes — local and cloud. Both work well, and I'll cover
each.</p>
<hr>
<h2>Local Server Mode</h2>
<p>Local mode runs an HTTP server directly on the phone. Your backend talks to it
over your local network. No cloud dependency, no third-party servers — the
simplest setup.</p>
<h3>Configure It</h3>
<p><img src="https://jonnonz.com/img/posts/sms-gateway/local-server.png" alt="Local server settings"> <em>Local server
configuration</em></p>
<ol>
<li>Toggle <strong>&quot;Local Server&quot;</strong> on</li>
<li>Go to <strong>Settings &gt; Local Server</strong> to configure:
<ul>
<li><strong>Port:</strong> 1024–65535 (default <code>8080</code>)</li>
<li><strong>Username:</strong> minimum 3 characters</li>
<li><strong>Password:</strong> minimum 8 characters</li>
</ul>
</li>
<li>Tap <strong>&quot;Offline&quot;</strong> — it changes to <strong>&quot;Online&quot;</strong></li>
<li>Note the <strong>local IP address</strong> displayed (e.g. <code>192.168.1.50</code>)</li>
</ol>
<p>Your phone is now running an HTTP server. Verify it:</p>
<pre><code class="language-bash"># Health check
curl http://192.168.1.50:8080/health

# Swagger docs
open http://192.168.1.50:8080/docs
</code></pre>
<h3>Send Your First SMS</h3>
<pre><code class="language-bash">curl -X POST http://192.168.1.50:8080/message \
  -u &quot;admin:yourpassword&quot; \
  -H &quot;Content-Type: application/json&quot; \
  -d '{
    &quot;textMessage&quot;: { &quot;text&quot;: &quot;Hello from my SMS gateway!&quot; },
    &quot;phoneNumbers&quot;: [&quot;+15551234567&quot;]
  }'
</code></pre>
<p>That's it. The phone sends the SMS from its own number, using your mobile plan's
rates.</p>
<h3>Register a Webhook for Inbound SMS</h3>
<p>To receive SMS messages as webhooks:</p>
<pre><code class="language-bash">curl -X POST http://192.168.1.50:8080/webhooks \
  -u &quot;admin:yourpassword&quot; \
  -H &quot;Content-Type: application/json&quot; \
  -d '{
    &quot;id&quot;: &quot;my-webhook&quot;,
    &quot;url&quot;: &quot;http://192.168.1.100:4000/api/sms/webhook&quot;,
    &quot;event&quot;: &quot;sms:received&quot;
  }'
</code></pre>
<p>Replace <code>192.168.1.100</code> with your dev machine's local IP. Both devices need to
be on the same WiFi network.</p>
<h3>Local Mode Gotchas</h3>
<ul>
<li><strong>AP isolation:</strong> Many routers — especially mesh networks and office WiFi —
block device-to-device traffic. If you can't reach the phone, check your
router settings for &quot;AP isolation&quot; or &quot;client isolation&quot; and disable it. This
one caught me out for a good 20 minutes.</li>
<li><strong>Battery optimisation:</strong> Android will kill the background server to save
battery. Disable battery optimisation for SMS Gateway in your phone settings.
<a href="https://dontkillmyapp.com/">dontkillmyapp.com</a> has device-specific
instructions — genuinely useful site.</li>
<li><strong>Keep it plugged in:</strong> During development and in production, the phone lives
on a charger. It's not going anywhere.</li>
</ul>
<hr>
<h2>Cloud Server Mode</h2>
<p>Cloud mode is easier to set up and works from anywhere — no local network
required. The phone connects to SMS Gateway's cloud relay (<code>api.sms-gate.app</code>),
and your backend talks to the same cloud API.</p>
<p><img src="https://jonnonz.com/img/posts/sms-gateway/cloud-server.png" alt="Cloud server settings"> <em>Cloud server
configuration</em></p>
<h3>Enable It</h3>
<ol>
<li>Toggle <strong>&quot;Cloud Server&quot;</strong> on in the app</li>
<li>Tap <strong>&quot;Offline&quot;</strong> — it connects and registers automatically</li>
<li>A <strong>username</strong> and <strong>password</strong> are auto-generated (visible in the Cloud
Server section)</li>
<li>Note these credentials — you'll need them for API calls</li>
</ol>
<p>The cloud uses a hybrid push architecture: Firebase Cloud Messaging as the
primary channel, Server-Sent Events as fallback, and 15-minute polling as a last
resort. It's well thought through.</p>
<h3>Send an SMS via Cloud API</h3>
<pre><code class="language-bash">curl -X POST https://api.sms-gate.app/3rdparty/v1/messages \
  -u &quot;YOUR_USERNAME:YOUR_PASSWORD&quot; \
  -H &quot;Content-Type: application/json&quot; \
  -d '{
    &quot;textMessage&quot;: { &quot;text&quot;: &quot;Hello from the cloud!&quot; },
    &quot;phoneNumbers&quot;: [&quot;+15551234567&quot;]
  }'
</code></pre>
<h3>Register a Webhook (Cloud Mode)</h3>
<p>Your webhook URL <strong>must be HTTPS</strong> in cloud mode. For local development, use
ngrok:</p>
<pre><code class="language-bash"># Start ngrok tunnel to your dev server
ngrok http 4000
# Output: https://abc123.ngrok.app

# Register the webhook
curl -X POST https://api.sms-gate.app/3rdparty/v1/webhooks \
  -u &quot;YOUR_USERNAME:YOUR_PASSWORD&quot; \
  -H &quot;Content-Type: application/json&quot; \
  -d '{
    &quot;url&quot;: &quot;https://abc123.ngrok.app/api/sms/webhook&quot;,
    &quot;event&quot;: &quot;sms:received&quot;
  }'
</code></pre>
<h3>Manage Webhooks</h3>
<pre><code class="language-bash"># List webhooks
curl -u &quot;YOUR_USERNAME:YOUR_PASSWORD&quot; \
  https://api.sms-gate.app/3rdparty/v1/webhooks

# Delete a webhook
curl -X DELETE -u &quot;YOUR_USERNAME:YOUR_PASSWORD&quot; \
  https://api.sms-gate.app/3rdparty/v1/webhooks/WEBHOOK_ID
</code></pre>
<hr>
<h2>The Code — Next.js Integration</h2>
<p>Here's how I integrated SMS Gateway into a Next.js app with a clean provider
abstraction. The idea is simple — swap providers without touching business
logic.</p>
<h3>Provider Interface</h3>
<pre><code class="language-typescript">// src/lib/sms/provider.ts

export interface InboundSms {
  from: string;
  body: string;
  receivedAt?: Date;
}

export interface SmsProvider {
  send(to: string, body: string): Promise&lt;string&gt;;
  parseWebhook(req: Request): Promise&lt;InboundSms | null&gt;;
  webhookResponse(replyText?: string): Response;
}

export async function getSmsProvider(): Promise&lt;SmsProvider&gt; {
  const provider = process.env.SMS_PROVIDER || &quot;sms-gate&quot;;

  switch (provider) {
    case &quot;sms-gate&quot;: {
      const { SmsGateProvider } = await import(&quot;./sms-gate&quot;);
      return new SmsGateProvider();
    }
    case &quot;console&quot;: {
      const { ConsoleProvider } = await import(&quot;./console&quot;);
      return new ConsoleProvider();
    }
    default:
      throw new Error(`Unknown SMS provider: ${provider}`);
  }
}
</code></pre>
<h3>SMS Gate Provider</h3>
<p>The provider handles both local and cloud API differences:</p>
<pre><code class="language-typescript">// src/lib/sms/sms-gate.ts

import type { InboundSms, SmsProvider } from &quot;./provider&quot;;

const SMSGATE_URL = process.env.SMSGATE_URL || &quot;http://localhost:8080&quot;;
const SMSGATE_USER = process.env.SMSGATE_USER || &quot;&quot;;
const SMSGATE_PASSWORD = process.env.SMSGATE_PASSWORD || &quot;&quot;;

export class SmsGateProvider implements SmsProvider {
  private headers(): Record&lt;string, string&gt; {
    const auth = Buffer.from(
      `${SMSGATE_USER}:${SMSGATE_PASSWORD}`,
    ).toString(&quot;base64&quot;);
    return {
      &quot;Content-Type&quot;: &quot;application/json&quot;,
      Authorization: `Basic ${auth}`,
    };
  }

  async send(to: string, body: string): Promise&lt;string&gt; {
    const isCloud = SMSGATE_URL.includes(&quot;api.sms-gate.app&quot;);
    const endpoint = isCloud
      ? `${SMSGATE_URL}/3rdparty/v1/messages`
      : `${SMSGATE_URL}/api/3rdparty/v1/message`;
    const payload = isCloud
      ? { textMessage: { text: body }, phoneNumbers: [to] }
      : { phoneNumbers: [to], message: body };

    const res = await fetch(endpoint, {
      method: &quot;POST&quot;,
      headers: this.headers(),
      body: JSON.stringify(payload),
    });

    if (!res.ok) {
      const err = await res.text();
      throw new Error(`SMS Gate send failed: ${res.status} ${err}`);
    }

    const data = await res.json();
    return data.id || &quot;sent&quot;;
  }

  async parseWebhook(req: Request): Promise&lt;InboundSms | null&gt; {
    try {
      const body = await req.json();

      if (body.event !== &quot;sms:received&quot; || !body.payload) {
        return null;
      }

      const { phoneNumber, message, receivedAt } = body.payload;
      if (!phoneNumber || !message) return null;

      return {
        from: phoneNumber,
        body: message,
        receivedAt: receivedAt ? new Date(receivedAt) : new Date(),
      };
    } catch {
      return null;
    }
  }

  webhookResponse(): Response {
    return new Response(JSON.stringify({ ok: true }), {
      headers: { &quot;Content-Type&quot;: &quot;application/json&quot; },
    });
  }
}
</code></pre>
<h3>Webhook Route</h3>
<p>A basic webhook handler that receives inbound SMS and replies:</p>
<pre><code class="language-typescript">// src/app/api/sms/webhook/route.ts

import { NextRequest } from &quot;next/server&quot;;
import { getSmsProvider } from &quot;@/lib/sms/provider&quot;;

export async function POST(req: NextRequest) {
  const provider = await getSmsProvider();
  const sms = await provider.parseWebhook(req);

  if (!sms) {
    return new Response(&quot;Bad request&quot;, { status: 400 });
  }

  const { from, body } = sms;

  // Look up the sender — replace with your own user lookup
  const user = await findUserByPhone(from);

  if (!user) {
    await provider.send(from, &quot;Hey! Text us back once you've signed up.&quot;);
    return provider.webhookResponse();
  }

  // Known user — do whatever your app needs
  console.log(`[SMS from ${from}]: ${body}`);
  await provider.send(from, &quot;Got it — we're on it!&quot;);
  return provider.webhookResponse();
}
</code></pre>
<h3>Console Provider (for Testing)</h3>
<p>For local development without a phone:</p>
<pre><code class="language-typescript">// src/lib/sms/console.ts

import type { InboundSms, SmsProvider } from &quot;./provider&quot;;

export class ConsoleProvider implements SmsProvider {
  async send(to: string, body: string): Promise&lt;string&gt; {
    console.log(`[SMS -&gt; ${to}] ${body}`);
    return `console-${Date.now()}`;
  }

  async parseWebhook(req: Request): Promise&lt;InboundSms | null&gt; {
    const data = await req.json();
    return {
      from: data.from || &quot;+15550000000&quot;,
      body: data.body || &quot;&quot;,
      receivedAt: new Date(),
    };
  }

  webhookResponse(): Response {
    return new Response(JSON.stringify({ ok: true }), {
      headers: { &quot;Content-Type&quot;: &quot;application/json&quot; },
    });
  }
}
</code></pre>
<h3>Environment Variables</h3>
<pre><code class="language-bash"># .env

# Provider: &quot;sms-gate&quot; | &quot;console&quot;
SMS_PROVIDER=sms-gate

# Local mode
SMSGATE_URL=http://192.168.1.50:8080
SMSGATE_USER=admin
SMSGATE_PASSWORD=yourpassword

# Cloud mode
# SMSGATE_URL=https://api.sms-gate.app
# SMSGATE_USER=auto-generated-username
# SMSGATE_PASSWORD=auto-generated-password
</code></pre>
<hr>
<h2>Webhook Payload Reference</h2>
<p>When someone texts your Android phone, SMS Gateway sends a POST to your webhook
URL:</p>
<pre><code class="language-json">{
  &quot;id&quot;: &quot;Ey6ECgOkVVFjz3CL48B8C&quot;,
  &quot;webhookId&quot;: &quot;LreFUt-Z3sSq0JufY9uWB&quot;,
  &quot;deviceId&quot;: &quot;your-device-id&quot;,
  &quot;event&quot;: &quot;sms:received&quot;,
  &quot;payload&quot;: {
    &quot;messageId&quot;: &quot;abc123&quot;,
    &quot;message&quot;: &quot;Hello!&quot;,
    &quot;sender&quot;: &quot;+15551234567&quot;,
    &quot;recipient&quot;: &quot;+15559876543&quot;,
    &quot;simNumber&quot;: 1,
    &quot;receivedAt&quot;: &quot;2026-04-01T12:41:59.000+00:00&quot;
  }
}
</code></pre>
<h3>Available Events</h3>
<table>
<thead>
<tr>
<th>Event</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td><code>sms:received</code></td>
<td>Inbound SMS received</td>
</tr>
<tr>
<td><code>sms:sent</code></td>
<td>Outbound SMS sent</td>
</tr>
<tr>
<td><code>sms:delivered</code></td>
<td>Outbound SMS confirmed delivered</td>
</tr>
<tr>
<td><code>sms:failed</code></td>
<td>Outbound SMS failed</td>
</tr>
<tr>
<td><code>system:ping</code></td>
<td>Heartbeat — device still alive</td>
</tr>
</tbody>
</table>
<h3>Webhook Security</h3>
<p>SMS Gateway signs webhook payloads with HMAC-SHA256. Two headers are included:</p>
<ul>
<li><code>X-Signature</code> — hex-encoded HMAC-SHA256 signature</li>
<li><code>X-Timestamp</code> — Unix timestamp used in signing</li>
</ul>
<pre><code class="language-typescript">import crypto from &quot;crypto&quot;;

function verifyWebhook(
  signingKey: string,
  payload: string,
  timestamp: string,
  signature: string,
): boolean {
  const expected = crypto
    .createHmac(&quot;sha256&quot;, signingKey)
    .update(payload + timestamp)
    .digest(&quot;hex&quot;);
  return crypto.timingSafeEqual(
    Buffer.from(expected, &quot;hex&quot;),
    Buffer.from(signature, &quot;hex&quot;),
  );
}
</code></pre>
<h3>Retry Behaviour</h3>
<p>If your server doesn't respond 2xx within 30 seconds, SMS Gateway retries with
exponential backoff — starting at 10 seconds, doubling each time, up to 14
attempts (~2 days). Solid default behaviour, you don't need to configure
anything.</p>
<hr>
<h2>Testing the Full Flow</h2>
<h3>1. Start Your Dev Server</h3>
<pre><code class="language-bash">npm run dev
# Next.js running at http://localhost:4000
</code></pre>
<h3>2. Expose It (Cloud Mode)</h3>
<pre><code class="language-bash">ngrok http 4000
# https://abc123.ngrok.app -&gt; http://localhost:4000
</code></pre>
<h3>3. Register the Webhook</h3>
<pre><code class="language-bash">curl -X POST https://api.sms-gate.app/3rdparty/v1/webhooks \
  -u &quot;USERNAME:PASSWORD&quot; \
  -H &quot;Content-Type: application/json&quot; \
  -d '{
    &quot;url&quot;: &quot;https://abc123.ngrok.app/api/sms/webhook&quot;,
    &quot;event&quot;: &quot;sms:received&quot;
  }'
</code></pre>
<h3>4. Send a Text</h3>
<p>Text your Android phone from another phone. You should see:</p>
<ol>
<li>SMS Gateway receives the text</li>
<li>Webhook fires to your ngrok URL</li>
<li>Your Next.js server processes it</li>
<li>A reply SMS is sent back via the API</li>
<li>The sender's phone receives the reply</li>
</ol>
<p>That moment when the reply lands on your phone — genuinely satisfying.</p>
<h3>Test Without a Phone</h3>
<pre><code class="language-bash"># Simulate an inbound SMS with the console provider
SMS_PROVIDER=console npm run dev

curl -X POST http://localhost:4000/api/sms/webhook \
  -H &quot;Content-Type: application/json&quot; \
  -d '{&quot;from&quot;: &quot;+15551234567&quot;, &quot;body&quot;: &quot;Hello&quot;}'
</code></pre>
<hr>
<h2>Production Considerations</h2>
<h3>The Phone Setup</h3>
<ul>
<li><strong>Dedicated device:</strong> Use a cheap Android phone ($20) with a prepaid SIM. It
sits on a charger plugged into power and WiFi. That's its whole life now.</li>
<li><strong>Battery optimisation off:</strong> Disable battery optimisation for SMS Gateway or
Android will kill it. <a href="https://dontkillmyapp.com/">dontkillmyapp.com</a> for your
specific device.</li>
<li><strong>Auto-start:</strong> Enable &quot;start on boot&quot; in the SMS Gateway app settings.</li>
<li><strong>Monitoring:</strong> Register a <code>system:ping</code> webhook to alert if the device goes
offline.</li>
</ul>
<h3>Local vs Cloud</h3>
<table>
<thead>
<tr>
<th></th>
<th>Local</th>
<th>Cloud</th>
</tr>
</thead>
<tbody>
<tr>
<td><strong>Latency</strong></td>
<td>Lower (direct)</td>
<td>Slightly higher (relay)</td>
</tr>
<tr>
<td><strong>Network</strong></td>
<td>Same network required</td>
<td>Works from anywhere</td>
</tr>
<tr>
<td><strong>Privacy</strong></td>
<td>Messages never leave your network</td>
<td>Messages transit through SMS Gateway's servers</td>
</tr>
<tr>
<td><strong>Reliability</strong></td>
<td>Depends on your network</td>
<td>Adds FCM/SSE redundancy</td>
</tr>
<tr>
<td><strong>Cost</strong></td>
<td>Free</td>
<td>Free (community tier)</td>
</tr>
</tbody>
</table>
<p>I use <strong>cloud mode in production</strong> because my server's hosted on Railway and
can't reach the phone's local network. For development on the same WiFi, local
mode is simpler and faster.</p>
<h3>Cost Comparison</h3>
<table>
<thead>
<tr>
<th>Provider</th>
<th>SMS Cost</th>
<th>Monthly (1,000 msgs)</th>
</tr>
</thead>
<tbody>
<tr>
<td>Twilio</td>
<td>~$0.05/msg</td>
<td>~$50</td>
</tr>
<tr>
<td>SMS Gateway + Prepaid SIM</td>
<td>$0/msg (unlimited plan)</td>
<td>~$8 (plan cost)</td>
</tr>
</tbody>
</table>
<p>That's an <strong>80%+ saving</strong>, and it scales linearly — 10,000 messages a month is
still just your plan cost.</p>
<hr>
<p>It's worth knowing this is a whole category now. <a href="https://httpsms.com/">httpSMS</a>
and <a href="https://textbee.dev/">textbee</a> do similar things. I went with
<a href="https://github.com/capcom6/android-sms-gateway">SMS Gateway for Android</a>
because the local mode is properly useful for development, the
<a href="https://docs.sms-gate.app/">documentation</a> is solid, and it's actively
maintained — v1.56.0 dropped in March 2026.</p>
<p>For an MVP, the maths is obvious. A $20 phone and an $8/month plan gets you a
programmable SMS gateway that you fully control. No per-message fees, no carrier
contracts, no vendor lock-in. If you outgrow it, swap the provider interface to
Twilio and you're done — that's why the abstraction exists.</p>
<p><strong>Links:</strong></p>
<ul>
<li><a href="https://github.com/capcom6/android-sms-gateway">SMS Gateway for Android on GitHub</a></li>
<li><a href="https://docs.sms-gate.app/">SMS Gateway Documentation</a></li>
<li><a href="https://play.google.com/store/apps/details?id=me.capcom.smsgateway">Google Play Store listing</a></li>
</ul>
]]>
      </content:encoded>
      <pubDate>Thu, 02 Apr 2026 00:00:00 GMT</pubDate>
    </item>
    <item>
      <title>Wrangling a Million Crime Records</title>
      <link>https://jonnonz.com/posts/wrangling-a-million-crime-records/</link>
      <guid isPermaLink="false">https://jonnonz.com/posts/wrangling-a-million-crime-records/</guid>
      <description>
        NZ Police's crime dataset is publicly available, but it's UTF-16 encoded, full of trailing periods, and 32% of records don't know what time the crime happened. Here's how we cleaned it up.
      </description>
      <content:encoded>
        <![CDATA[<p>The very first thing NZ Police's crime dataset teaches you is that government
data is never straightforward.</p>
<p>You download the CSV from
<a href="https://www.police.govt.nz/about-us/publications-statistics/data-and-statistics/policedatanz/victimisation-time-and-place">policedata.nz</a>,
expecting to do a quick <code>pd.read_csv()</code> and start exploring. Instead you get a
503MB file encoded in UTF-16 Little Endian with tab delimiters. Not a regular
CSV. Not even close. This is a legacy format from old Excel exports and most
tools just silently corrupt it if you try to read it as UTF-8.</p>
<pre><code class="language-python">df = pd.read_csv(&quot;data.csv&quot;, encoding=&quot;utf-16-le&quot;, sep=&quot;\t&quot;)
</code></pre>
<p>That one line took longer to figure out than I'd like to admit.</p>
<h2>What's actually in here</h2>
<p>Once you get past the encoding, there's a lot to work with. 1,154,102 rows
covering every reported victimisation in New Zealand from February 2022 through
January 2026. Each row tells you the crime type (ANZSOC Division), where it
happened (down to meshblock level), when it happened (month, day of week, hour
of day), and sometimes what weapon was involved.</p>
<p>There are 20 columns, but five of them are useless: three are duplicates of
&quot;Year Month&quot; and two are constants that add zero information. Every area name
and territorial authority has a trailing period stuck on the end: &quot;Auckland.&quot;,
&quot;Woodglen.&quot;, &quot;Christchurch City.&quot;. A quirk of the export that'll break any
geographic join if you don't strip them.</p>
<p>And meshblock IDs? Some are 6 digits, some are 7. Stats NZ boundary files use
7-digit codes consistently, so shorter ones need zero-padding. The kind of thing
that's invisible until your join silently drops 19% of your records and you
spend an afternoon figuring out why.</p>
<h2>What the missing data tells you</h2>
<p>That bit actually made me stop and think. 32.2% of records have the hour of day
recorded as 99 (unknown). Another 23.2% have the day of week as &quot;UNKNOWN&quot;.</p>
<p>At first this looks like a data quality problem. But it's not. It's telling you
something about the nature of the crime. If someone breaks into your house while
you're at work, you come home to find your stuff gone. Was it 9am or 2pm? You've
got no idea, and neither do the police.</p>
<p>Property crimes (theft, burglary) make up the bulk of these unknowns. Assault,
by contrast, almost always has a precise time because there's a victim present
when it happens. The absence of data is itself a signal about what kind of crime
you're looking at.</p>
<p>78.6% of location type values are &quot;.&quot; (effectively missing). That column is
sparsely populated but still useful for the roughly one in five records that
have it.</p>
<h2>Cleaning it up</h2>
<p>We built a modular pipeline where each cleaning step is its own function.
Nothing fancy, just practical:</p>
<pre><code class="language-python">def ingest() -&gt; pd.DataFrame:
    df = load_raw_csv(RAW_CSV)            # UTF-16 LE, tab-delimited
    df = drop_redundant_columns(df)        # Remove 5 useless columns
    df = rename_columns(df)                # snake_case everything
    df = parse_dates(df)                   # &quot;July 2022&quot; → datetime
    df = clean_strings(df)                 # Strip trailing periods
    df = clean_meshblocks(df)              # Zero-pad to 7 digits
    df = encode_unknowns(df)              # 99 → NaN, &quot;UNKNOWN&quot; → NaN
    df = map_crime_types(df)               # ANZSOC Division → short enum
    return df
</code></pre>
<p>Each function does one thing. If something breaks, you know exactly where. If
someone wants to understand the pipeline, they can read it top to bottom in
about thirty seconds. I've been bitten enough times by monolithic data scripts
that I'm allergic to them now.</p>
<p>The crime type mapping turns six ANZSOC Division values into short enums:</p>
<table>
<thead>
<tr>
<th>Crime Type</th>
<th>Count</th>
<th>Share</th>
</tr>
</thead>
<tbody>
<tr>
<td>Theft</td>
<td>761,977</td>
<td>66.0%</td>
</tr>
<tr>
<td>Burglary</td>
<td>247,034</td>
<td>21.4%</td>
</tr>
<tr>
<td>Assault</td>
<td>115,383</td>
<td>10.0%</td>
</tr>
<tr>
<td>Robbery</td>
<td>14,860</td>
<td>1.3%</td>
</tr>
<tr>
<td>Sexual</td>
<td>13,943</td>
<td>1.2%</td>
</tr>
<tr>
<td>Harm</td>
<td>905</td>
<td>0.1%</td>
</tr>
</tbody>
</table>
<p>That 66% theft number is going to haunt us when we get to model training. Any
loss function you throw at this data will overwhelmingly optimise for predicting
theft, because that's two-thirds of everything. The class imbalance is real and
it matters.</p>
<h2>503MB to 6.3MB</h2>
<p>The cleaned output goes to
<a href="https://www.datacamp.com/tutorial/apache-parquet">Apache Parquet</a> with snappy
compression. The result?</p>
<ul>
<li><strong>Input</strong>: 503MB CSV (UTF-16, 20 columns)</li>
<li><strong>Output</strong>: 6.3MB Parquet (21 columns including derived fields)</li>
<li><strong>Compression</strong>: ~80x</li>
</ul>
<p>That's not a typo. Parquet's columnar storage is dramatically more efficient
than row-oriented CSV, especially when you've got columns full of repeated
values like crime types and territorial authorities. The file loads in under a
second compared to 3+ seconds for the CSV. When you're iterating on analysis and
loading this data hundreds of times, that adds up fast.</p>
<p>The 21 output columns include the original 16 we kept plus five derived ones: a
proper datetime, year, month, day-of-week as an integer, and the short crime
type enum.</p>
<h2>Sanity checks</h2>
<p>Before calling the data clean, we verify everything that matters:</p>
<ul>
<li>Row count: 1,154,102 (all rows preserved, nothing dropped)</li>
<li>No nulls in key columns: crime_type, date, area_unit, territorial_authority,
meshblock</li>
<li>Date range: Feb 2022 to Jan 2026 (all 48 months present)</li>
<li>Auckland: 412,669 records, 36% of total (exactly where it should be)</li>
<li>Theft: 761,977 records, 66% (as expected)</li>
<li>No trailing periods anywhere in area names</li>
<li>All meshblock IDs are 7 digits</li>
<li>Max hour value is 23 (no more 99s leaking through)</li>
</ul>
<p>You want these checks automated and running every time you regenerate the data.
Future you will thank past you when something upstream changes and a check
catches it.</p>
<h2>What's next</h2>
<p>We've got clean, compressed crime data, but the records only have meshblock IDs
and area unit names. No coordinates. No shapes on a map. In the next post, we'll
download Stats NZ geographic boundary files and join them to our crime records,
giving every victimisation a place in physical space.</p>
]]>
      </content:encoded>
      <pubDate>Thu, 26 Mar 2026 00:00:00 GMT</pubDate>
    </item>
    <item>
      <title>Crime as Video</title>
      <link>https://jonnonz.com/posts/crime-as-video/</link>
      <guid isPermaLink="false">https://jonnonz.com/posts/crime-as-video/</guid>
      <description>
        Turn a million geo-tagged crime records into a 4D tensor by overlaying a 500m grid on Auckland. Crime prediction becomes video prediction.
      </description>
      <content:encoded>
        <![CDATA[<p>This is where the project gets properly fun.</p>
<p>We've got 1.15 million clean crime records. Every one of them has coordinates:
either precise meshblock centroids or area unit fallbacks from Part 2. But a bag
of lat/lon points isn't what a neural network wants. ConvLSTM and ST-ResNet are
fundamentally image-processing architectures. They expect regular 2D grids, rows
and columns, like pixels in a photograph.</p>
<p>So our job now is to convert the messy reality of crime locations into clean,
regular &quot;crime images&quot; that a convolutional network can actually consume. And
once you see it framed that way, crime prediction becomes video prediction. Each
month is a frame. Each grid cell is a pixel. The brightness is the crime count.</p>
<h2>Choosing 500m</h2>
<p>This is the single most consequential decision in the entire data pipeline. Get
the grid resolution wrong and everything downstream suffers.</p>
<p>Too fine (say 100m cells) and the vast majority of cells are empty in any given
month. The model sees an ocean of zeros with occasional spikes, which is
incredibly hard to learn from. Too coarse (say 2km) and you've blurred away the
spatial patterns you're trying to detect. &quot;Auckland CBD&quot; and &quot;Ponsonby&quot; become
the same cell, which is useless.</p>
<p>We computed Auckland's urban crime extent from the meshblock centroids (5th to
95th percentile to exclude outliers like Great Barrier Island):</p>
<table>
<thead>
<tr>
<th>Metric</th>
<th>Value</th>
</tr>
</thead>
<tbody>
<tr>
<td>Urban extent</td>
<td>27.7 km × 36.9 km</td>
</tr>
<tr>
<td>Grid resolution</td>
<td>500m × 500m</td>
</tr>
<tr>
<td>Grid dimensions</td>
<td>77 rows × 59 columns</td>
</tr>
<tr>
<td>Total cells</td>
<td>4,543</td>
</tr>
</tbody>
</table>
<p>At 500m, each cell covers roughly a few city blocks. That's fine enough to
distinguish a commercial strip from a residential street, but coarse enough that
most cells accumulate at least some crime over the 48-month period. It's a sweet
spot, and it's consistent with what
<a href="https://arxiv.org/abs/2502.07465">recent crime forecasting research</a> uses for
similar models in US cities.</p>
<h2>Simple maths, no spatial joins</h2>
<p>Working in NZTM2000 (the coordinate system we set up in Part 2, where units are
metres) makes the next bit easy. Assigning a crime to a grid cell is just floor
division:</p>
<pre><code class="language-python">grid_j = floor((x - xmin) / 500)  # column index
grid_i = floor((y - ymin) / 500)  # row index
</code></pre>
<p>No spatial joins, no polygon intersection, no geopandas overhead. Just
arithmetic. It processes all 400k Auckland records in under a second.</p>
<p>For the ~22% of Auckland records that didn't get meshblock coordinates in Part
2, we fall back to area unit centroids converted to NZTM2000. Those records land
at the centre of their suburb rather than their exact location. Less precise,
but dropping them entirely would be worse.</p>
<p>The result: 354,387 of 412,669 Auckland records (86.2%) fall within the grid.
The remaining 14% are in Auckland's outer fringes (Great Barrier Island, rural
Rodney, the edges of the Waitakere Ranges) beyond our urban bounding box. That's
fine. We're modelling urban crime patterns, not rural ones.</p>
<h2>The 4D tensor</h2>
<p>With every crime assigned to a cell, we aggregate by grid position, month, and
crime type:</p>
<pre><code>(grid_i, grid_j, month, crime_type) → sum(victimisations)
</code></pre>
<p>This gives us a 4D tensor:</p>
<table>
<thead>
<tr>
<th>Dimension</th>
<th>Size</th>
<th>Meaning</th>
</tr>
</thead>
<tbody>
<tr>
<td>T (time)</td>
<td>48</td>
<td>Months: Feb 2022 – Jan 2026</td>
</tr>
<tr>
<td>H (height)</td>
<td>77</td>
<td>Grid rows (south → north)</td>
</tr>
<tr>
<td>W (width)</td>
<td>59</td>
<td>Grid columns (west → east)</td>
</tr>
<tr>
<td>C (channels)</td>
<td>6</td>
<td>Crime types: theft, burglary, assault, robbery, sexual, harm</td>
</tr>
</tbody>
</table>
<p>Think of it as a 48-frame video with 6 colour channels. A regular video has 3
channels: red, green, blue. Ours has 6: theft, burglary, assault, robbery,
sexual offences, harm. Each pixel's brightness in a given channel tells you how
many of that crime type happened in that 500m cell during that month.</p>
<p>I genuinely love this framing. It takes a complicated spatial-temporal
prediction problem and maps it onto something that decades of computer vision
research already knows how to handle.</p>
<h2>91.7% zeros</h2>
<p>The tensor is overwhelmingly empty. 91.7% of all cells are zero.</p>
<p>This makes complete sense if you think about it. Most 500m squares in Auckland
don't have a single reported crime in any given month. Crime clusters:
commercial corridors, transport hubs, specific residential pockets. The non-zero
8.3% is where all the signal lives.</p>
<p>The sparsity does create a training challenge though. If the model just
predicted zero everywhere, it'd be right 91.7% of the time. Useless, but
technically accurate. That's why we'll use <code>log1p</code> normalisation during
training. It compresses the range from [0, 50+] to [0, ~4], giving the model a
more balanced gradient to learn from. And it's why the loss function needs to
care more about the non-zero cells than the empty ones.</p>
<p>The upside of all those zeros is storage. The
<a href="https://numpy.org/doc/stable/reference/generated/numpy.savez_compressed.html">compressed numpy format</a>
handles sparse data beautifully. The full 4D tensor saves to just 0.2 MB.
Compare that to the 21.9 MB Parquet from Part 2.</p>
<h2>Train, validate, test</h2>
<p>We split the 48 months temporally. No shuffling, no random sampling:</p>
<table>
<thead>
<tr>
<th>Set</th>
<th>Months</th>
<th>Range</th>
</tr>
</thead>
<tbody>
<tr>
<td>Train</td>
<td>36</td>
<td>Feb 2022 – Jan 2025</td>
</tr>
<tr>
<td>Validation</td>
<td>6</td>
<td>Feb 2025 – Jul 2025</td>
</tr>
<tr>
<td>Test</td>
<td>6</td>
<td>Aug 2025 – Jan 2026</td>
</tr>
</tbody>
</table>
<p>The model trains on three years, tunes on six months, and gets evaluated on the
most recent six months it's never seen. There's no spatial leakage either. We
don't hold out specific grid cells. The model has to predict all locations for
future months simultaneously.</p>
<p>This is the only honest way to evaluate a time-series model. If you randomly
shuffle months into train and test, the model can memorise seasonal patterns and
look brilliant without actually learning anything useful about temporal
dynamics.</p>
<h2>What the tensor reveals</h2>
<p>Even at this aggregate level, clear patterns jump out.</p>
<p>February tends to be the quietest month (~7–8k victimisations across Auckland),
while October through January (spring and early summer) consistently peaks at
8.5–9.5k. 2023 was the peak year across the board, with a gradual decline
through 2024 and into 2025.</p>
<p>Theft accounts for 72% of the tensor values (283k victimisations), burglary 17%
(68k), and assault 9% (34k). That theft dominance from Part 1, the 66% figure,
gets even more pronounced when you focus on Auckland, because theft clusters
harder in urban areas than other crime types do.</p>
<h2>What's next</h2>
<p>The tensor is built. The model input is ready. But before throwing deep learning
at anything, we need to properly understand what patterns actually exist in this
data: when does crime peak, where does it cluster, and how do different crime
types behave differently. Next post: exploratory data analysis.</p>
]]>
      </content:encoded>
      <pubDate>Thu, 26 Mar 2026 00:00:00 GMT</pubDate>
    </item>
    <item>
      <title>Giving Crime a Place on the Map</title>
      <link>https://jonnonz.com/posts/giving-crime-a-place-on-the-map/</link>
      <guid isPermaLink="false">https://jonnonz.com/posts/giving-crime-a-place-on-the-map/</guid>
      <description>
        Crime records come with names and codes but no coordinates. Here's how we joined 1.15 million records to Stats NZ boundary files and gave every crime a place in physical space.
      </description>
      <content:encoded>
        <![CDATA[<p>A crime record that says &quot;Woodglen, meshblock 0284305&quot; is useless for spatial
modelling. It's a name and a number. You can't plot it, you can't measure
distances from it, and you definitely can't feed it to a neural network that
thinks in grid cells.</p>
<p>To do anything spatial, every record needs actual coordinates: latitude,
longitude, or ideally metres on a proper projection. That means downloading
Stats NZ's geographic boundary files and joining them to our crime data.</p>
<h2>NZ's geographic hierarchy</h2>
<p>New Zealand has a neat nested system of geographic units maintained by
<a href="https://datafinder.stats.govt.nz/layer/92197-meshblock-2018-generalised/">Stats NZ</a>:</p>
<pre><code class="language-mermaid">graph TD
    A[&quot;Region (16)&quot;] --&gt; B[&quot;Territorial Authority (67)&quot;]
    B --&gt; C[&quot;Area Unit / SA2 (~2,000)&quot;]
    C --&gt; D[&quot;Meshblock (~53,000)&quot;]
</code></pre>
<p>Regions are the big ones: Auckland, Canterbury, Wellington. Territorial
authorities are your cities and districts. Area units are roughly suburb-sized.
And meshblocks are the smallest unit, about 100 people each, roughly a city
block. Our crime data uses area units and meshblocks, so those are the layers we
need.</p>
<p>There's a gotcha here. Stats NZ replaced &quot;Area Units&quot; with &quot;Statistical Area 2&quot;
(SA2) in 2018 as part of a geographic classification overhaul. But the NZ Police
crime data still uses the old area unit names. So we need the <strong>2017 vintage</strong>
boundary files, not the current ones. Use the wrong vintage and your join
silently fails on hundreds of area units. Ask me how I know.</p>
<h2>Three boundary files</h2>
<p>We downloaded three layers from
<a href="https://datafinder.stats.govt.nz/">Stats NZ DataFinder</a> via their WFS API:</p>
<table>
<thead>
<tr>
<th>Layer</th>
<th>Features</th>
<th>Size</th>
</tr>
</thead>
<tbody>
<tr>
<td>Area Unit 2017 (generalised)</td>
<td>2,004</td>
<td>88 MB</td>
</tr>
<tr>
<td>Meshblock 2018 (generalised)</td>
<td>53,589</td>
<td>213 MB</td>
</tr>
<tr>
<td>Territorial Authority 2023</td>
<td>68</td>
<td>34 MB</td>
</tr>
</tbody>
</table>
<p>All three come in <a href="https://epsg.io/2193">EPSG:2193</a>, which is
<a href="https://www.linz.govt.nz/guidance/geodetic-system/coordinate-systems-used-new-zealand/projections/new-zealand-transverse-mercator-2000-nztm2000">NZTM2000</a>,
New Zealand's official projected coordinate system. The units are metres, not
degrees. This matters a lot later when we need to build a &quot;500m grid&quot;. You want
that to be 500 actual metres, not some approximation based on latitude.</p>
<p>We use generalised (simplified) versions rather than high-definition. The
full-resolution meshblock layer is over a gigabyte. For centroid calculations
and spatial joins, the generalised versions are more than accurate enough.</p>
<h2>The area unit join: 99.4%</h2>
<p>Joining crime records to area unit boundaries by name was almost perfect.
1,146,721 of 1,154,102 records matched, 99.4%.</p>
<p>Only two area unit codes failed:</p>
<ul>
<li><code>999999</code>: the official &quot;unspecified&quot; catch-all (7,331 records)</li>
<li><code>-29</code>: a straight-up data entry error (50 records)</li>
</ul>
<p>That's a genuinely excellent result. The unmatched records aren't a bug in our
pipeline. They're unlocatable crimes that the police couldn't assign to a
specific area. Nothing we can do about those, and nothing we should try to do.</p>
<h2>The meshblock join: 81.2%</h2>
<p>The meshblock join came in lower at 81.2%, with 937,604 records matched out of
1,154,102.</p>
<p>This is expected and it's fine. Here's why: NZ meshblock boundaries get revised
with every census. We're using 2018 boundaries, but our crime data runs through
January 2026. Any crime from 2023 onwards might reference a 2023-vintage
meshblock code that simply doesn't exist in the 2018 file. Some meshblocks get
split, some get merged, some get renumbered entirely.</p>
<p>81.2% still gives us fine-grained coordinates for the vast majority of records.
For the ~19% that miss, we fall back to the area unit centroid. It's less
precise (suburb-level instead of block-level) but better than dropping the
records entirely.</p>
<h2>Two coordinate systems</h2>
<p>This is one of those things that seems like a minor detail but will bite you
hard if you get it wrong. We use two coordinate reference systems throughout the
project:</p>
<p><strong>NZTM2000 (EPSG:2193)</strong> for all spatial analysis. The units are metres, which
makes grid construction trivial: a 500m cell is literally 500 units on each
axis. Distance calculations are straightforward. No need to worry about the fact
that a degree of longitude means different things at different latitudes.</p>
<p><strong>WGS84 (EPSG:4326)</strong> for the frontend dashboard only. deck.gl and MapLibre
expect coordinates in degrees (latitude/longitude), which is the standard for
web mapping.</p>
<p>The rule is simple: do everything in NZTM2000, convert to WGS84 at the very end
when exporting for the dashboard. Mixing coordinate systems mid-pipeline is a
recipe for bugs that are incredibly annoying to track down.</p>
<h2>The output</h2>
<p>Each crime record now has up to 8 new geographic columns: area unit centroids,
meshblock centroids, and areas in both coordinate systems. The enriched dataset
saves as <code>crimes_with_geo.parquet</code> at 21.9 MB with 29 columns.</p>
<p>Quick sanity check: Auckland's mean crime centroid lands at lat -36.90, lon
174.78. Right in the middle of the urban area. If that number had come back as
somewhere in the Waikato, we'd know something went wrong.</p>
<h2>What's next</h2>
<p>Every crime record now has a place in physical space. But individual points
aren't what the neural network needs. It needs a regular grid. In the next post,
we'll overlay a 500m × 500m grid on Auckland, count crimes per cell per month,
and build the 4D tensor that turns crime prediction into a video prediction
problem.</p>
]]>
      </content:encoded>
      <pubDate>Thu, 26 Mar 2026 00:00:00 GMT</pubDate>
    </item>
    <item>
      <title>Predicting Crime in Aotearoa</title>
      <link>https://jonnonz.com/posts/predicting-crime-in-aotearoa/</link>
      <guid isPermaLink="false">https://jonnonz.com/posts/predicting-crime-in-aotearoa/</guid>
      <description>NZ Police publish over a million crime records openly. What happens when you point deep learning at them?</description>
      <content:encoded>
        <![CDATA[<p>NZ Police publish every recorded victimisation in the country, over a million
records, and most people have no idea.</p>
<p>I stumbled across
<a href="https://www.police.govt.nz/about-us/publications-statistics/data-and-statistics/policedatanz">policedata.nz</a>
a while back and was surprised by how much is there. Every reported theft,
assault, burglary, robbery, broken down by location, time of day, day of week,
and month. All the way down to meshblock level, which is roughly a city block.
Updated monthly.
<a href="https://www.police.govt.nz/about-us/publications-statistics/data-and-statistics/policedatanz">Creative Commons licensed</a>.
You can just... use it.</p>
<p>So naturally I started wondering: what happens if you point deep learning at
this?</p>
<h2>A million rows of crime</h2>
<p>The dataset I pulled covers February 2022 through January 2026. Four years,
1,154,102 records across the whole country. The breakdown is roughly what you'd
expect: theft dominates at 66%, followed by burglary at 21% and assault at 10%.
The remaining sliver covers robbery, sexual offences, and harm/endangerment.</p>
<p>What makes it interesting for modelling is the spatial granularity. Each record
maps to one of 42,778 meshblocks (tiny geographic units defined by Stats NZ).
That's detailed enough to see patterns at a neighbourhood level, not just
&quot;Auckland has more crime than Tauranga&quot; (which, yeah, obviously).</p>
<p>Auckland alone accounts for about 36% of all recorded crime. Then there's a long
tail: Wellington, Christchurch, Hamilton, and then it drops off fast. NZ's urban
geography is weird like that. One mega-city and a bunch of mid-size towns.</p>
<h2>The idea</h2>
<p>The core question is pretty simple. Given the crime patterns of the last few
months, can we predict what the next month looks like?</p>
<p>This isn't Minority Report. Nobody's getting arrested for crimes they haven't
committed. It's pattern recognition on publicly available statistics, the same
kind of modelling people do with weather data or traffic flows.</p>
<p>The neat trick is how you frame it. If you overlay a grid on a city (say 500m by
500m cells) and count crimes per cell per month, you get something that looks a
lot like a video. Each month is a frame. Each cell is a pixel. The brightness is
the crime count.</p>
<p>Predicting next month's crime becomes a video prediction problem. And there are
some really cool deep learning architectures built exactly for that.</p>
<h2>ConvLSTM and ST-ResNet</h2>
<p>The two models I'm building are ConvLSTM and ST-ResNet. Don't worry if those
sound like gibberish. The short version: they're neural networks designed to
learn patterns that are both spatial (where things cluster) and temporal (how
those clusters change over time).</p>
<p><strong>ConvLSTM</strong> is the primary model. A standard LSTM network is great at learning
sequences. It's the architecture behind a lot of language and time-series
models. ConvLSTM swaps out the matrix multiplications for convolutions, which
means it can process grid-structured data. Feed it the last six months of crime
grids and it learns both the shape of hotspots and how they evolve.
<a href="https://arxiv.org/abs/2502.07465">Recent research</a> has shown these work well
for crime forecasting across multiple US cities.</p>
<p><strong>ST-ResNet</strong> takes a different angle. Instead of one sequential view, it
captures three temporal perspectives: what happened recently, what happened at
the same time last year, and what's the long-term trend. Each gets its own
branch of residual convolutional networks, and a learned fusion layer combines
them. The
<a href="https://ojs.aaai.org/index.php/AAAI/article/view/10735">original paper</a> was for
crowd flow prediction in Beijing, but the architecture
<a href="https://www.nature.com/articles/s41598-025-24559-7">translates well to crime data</a>.</p>
<h2>Why NZ?</h2>
<p>Almost all published crime prediction research uses US data. Chicago, Los
Angeles, New York. A
<a href="https://pmc.ncbi.nlm.nih.gov/articles/PMC7319308/">systematic review of spatial crime forecasting</a>
makes this pretty clear. The models are well-studied, but they're trained on
American cities with American urban patterns.</p>
<p>New Zealand doesn't look like that. Our cities are smaller, more spread out, and
the distribution is completely different. Auckland dominates in a way that no
single US city does relative to the rest of the country. The spatial patterns
here are their own thing, and I couldn't find anyone who'd applied these deep
learning approaches to NZ data.</p>
<p>That's what got me keen. Not because I think I'll beat the published benchmarks.
Those researchers have GPUs and PhD students; I have a Ryzen 5 desktop with no
graphics card. But applying known techniques to new geography is useful work,
and nobody else seems to have done it.</p>
<h2>No GPU, no problem (mostly)</h2>
<p>All of this runs on my desktop, an AMD Ryzen 5 5600GT with 12 threads and 30GB
of RAM. No GPU at all. That sounds limiting, but the Auckland 500m grid works
out to about 60 by 80 cells. The ConvLSTM model ends up around 5 million
parameters, which trains in under an hour on CPU. You don't always need a beefy
rig.</p>
<p>It does mean being smart about model sizing and not going crazy with
hyperparameter searches. But for a hobby project, it's more than enough.</p>
<h2>What's coming</h2>
<p>This is the first post in a ten-part series covering the whole project end to
end.</p>
<p><strong>Part 1: Data Acquisition and Exploration.</strong> We start with a 503MB CSV file
from NZ Police that's UTF-16 encoded (because of course it is), has trailing
periods on area names, and 32% of records with unknown hour-of-day. We'll
wrangle it into a clean, typed Parquet file and get our first look at what's
actually in there.</p>
<p><strong>Part 2: Geographic Data Pipeline.</strong> Crime records come with meshblock IDs, but
no coordinates. We'll join them to Stats NZ geographic boundary files using
geopandas, giving every record a place on the map.</p>
<p><strong>Part 3: Spatiotemporal Grid Construction.</strong> This is where it gets fun. We
overlay a 500m by 500m grid on Auckland, count crimes per cell per month, and
build the 4D tensors that feed the neural networks. Crime prediction becomes
video prediction.</p>
<p><strong>Part 4: Exploratory Data Analysis.</strong> Before throwing deep learning at
anything, we need to understand what patterns actually exist. When does crime
peak? Where does it cluster? How do different crime types behave differently?</p>
<p><strong>Part 5: Baseline Models.</strong> Simple benchmarks (historical averages, naive
persistence) so we know whether the deep learning is actually adding value or
just being fancy for the sake of it.</p>
<p><strong>Part 6: ConvLSTM Architecture.</strong> Building and training the primary model.
Three ConvLSTM layers, six-month lookback window, learning spatial hotspots and
temporal dynamics simultaneously.</p>
<p><strong>Part 7: ST-ResNet Architecture.</strong> The three-branch alternative that captures
closeness, periodicity, and long-term trend separately, then fuses them with
learned weights.</p>
<p><strong>Part 8: Model Evaluation and Comparison.</strong> Which model wins? By how much? And
more importantly, where do they fail?</p>
<p><strong>Part 9: Building the Dashboard.</strong> A 3D interactive map built with deck.gl
where you can watch crime patterns evolve over time. Dark theme, extruded
columns, time-lapse playback.</p>
<p><strong>Part 10: Deployment and Reflections.</strong> Shipping to Vercel, what worked, what
didn't, and what I'd do differently next time.</p>
<p>Every post will include code and real results. The whole codebase will be open
source. And I'll be upfront about the stuff that didn't work. Trust me, there's
plenty of it.</p>
<p>This is a hobby project. It's not a policing tool, it's not a product, and it's
definitely not claiming to solve crime. It's just me being curious about what's
sitting in a publicly available dataset and seeing how far you can push it with
some Python and a bit of patience.</p>
]]>
      </content:encoded>
      <pubDate>Tue, 24 Mar 2026 00:00:00 GMT</pubDate>
    </item>
  </channel>
</rss>