{"version":"https://jsonfeed.org/version/1","title":"jonnonz.com","home_page_url":"https://jonnonz.com/","feed_url":"https://jonnonz.com/feed.json","description":"Mirror of jonno.nz — John Gregoriadis","author":{"name":"John Gregoriadis","url":"https://jonnonz.com"},"items":[{"id":"https://jonnonz.com/posts/product-market-fit-is-a-gauntlet/","url":"https://jonnonz.com/posts/product-market-fit-is-a-gauntlet/","title":"Product market fit isn't a stage, it's a gauntlet","content_html":"<p>Product market fit gets sold as a milestone. Find it and you're off to the\nraces. That's the bit nobody who's been through it actually believes.</p>\n<p>PMF is a gauntlet. It eats teams, it bends founders, and it quietly poisons the\ntechnical decisions you're proud of at the time. Most of the damage I've watched\ndone to good companies wasn't from missing PMF. It was from how they behaved\nwhile they were looking for it.</p>\n<p><img src=\"https://jonnonz.com/img/posts/pmf-gauntlet/gauntlet-loop.svg\" alt=\"The PMF gauntlet loop — Vision feeding Hypothesis, Ship, Market signal, and Adapt, with three drag forces (rigid roadmap, comprehension debt, over-scaled architecture) pulling on the loop.\"></p>\n<h2>It's not for everyone, and that's fine</h2>\n<p>There's a particular kind of person who does well in the pre-PMF phase. High\ntolerance for ambiguity, low need for closure. The deck never feels finished,\nthe metric you're chasing changes every six weeks, and the answer to &quot;what are\nwe doing in three months&quot; is &quot;depends.&quot;</p>\n<p>Plenty of really good operators just cannot function in that environment. That's\nnot a character flaw — it's a stage mismatch. Some people thrive at zero to one.\nSome thrive at one to ten. Almost no one thrives at both, and the industry\npretending otherwise has cost a lot of careers and a lot of sanity.</p>\n<p>Naming this honestly so people can self-select is one of the kindest things a\nfounder can do. You're not letting someone down by saying &quot;this stage probably\nisn't for you, but the next one will be.&quot; You're saving them eighteen months of\nfeeling broken.</p>\n<h2>The variables you don't control will eat you alive</h2>\n<p>Timing, market, economy, what your one big regulator decides on a Tuesday — none\nof that is yours. You can have a sharp thesis and a great team and ship\nsomething nobody buys, because something three layers above you shifted while\nyou were heads-down.</p>\n<p>The only protection I've found against this is a vision the team is genuinely\nbought into. Not the slide. Not the wall poster. The actual reason you all got\nout of bed this morning. When the macro turns and the metric you were proud of\nlast quarter goes sideways, that vision is what stops the org from devouring\nitself.</p>\n<p>I've watched startups where the thesis was right but the timing was a year early\nlose half their team in three months because nobody could explain why they were\nstill doing what they were doing. It wasn't a strategy problem. It was an\nalignment problem dressed up as a strategy problem.</p>\n<p>This is also where founders take the most damage personally. You can do\neverything well and still get hit by something nobody could have predicted. If\nyour sense of self is tied to PMF being a verdict on you, that breaks people.\nThe ones I've seen come through it healthy treated PMF like weather they were\nnavigating, not a test they were passing.</p>\n<h2>Agility is the actual moat at this stage</h2>\n<p>Your moat isn't the product. It isn't the tech. It definitely isn't the brand.\nYour moat is how fast the org can spot a shift in TAM or target market and\ntranslate it into a product move.</p>\n<p>Days, ideally. Weeks if you have to. Not quarters.</p>\n<p>This is where the technical decisions made in the name of &quot;scaling&quot; quietly\ncripple you. The microservices you split out before you needed to. The custom\ninfrastructure someone stood up because their last job had it. The platform\nabstractions that mean a small UI change touches four repos. Each of those felt\ndisciplined at the time. Each is now a tax on the only thing you actually have —\nspeed.</p>\n<p>Andreessen wrote the\n<a href=\"https://pmarchive.com/guide_to_startups_part4.html\">original PMF essay</a> almost\ntwenty years ago, and the line that's aged best is the bit about doing whatever\nit takes — changing people, rewriting the product, moving markets. That's not a\nlicense to be chaotic. It's a reminder that the org needs to be physically\ncapable of those moves. If your architecture, process, or contracts make\nrewriting the product a six-month project, you've already lost the gauntlet\nwhether you know it yet or not.</p>\n<p>I've got a strong opinion on this one: when in doubt, build it boring. Boring is\nfast to change.</p>\n<h2>The first cohort is a dance, and you have to lead</h2>\n<p>The customers who signed up first kept you alive. They also signed up for a\nslightly different company than the one you're now trying to become. That gap is\nwhere a lot of startups quietly die.</p>\n<p>Keep them too happy and you slow your evolution. Push too hard toward the new\nvision and you churn the cohort that's funding your runway. The actual job is to\ndo both at once, which is why I sometimes call it internal schizophrenia. You're\na different company to them than you are to yourselves, and that's not a bug —\nthat's the mode you're operating in.</p>\n<p>The skill is being honest with the early cohort about where you're going without\nselling them something they didn't buy. The art is using their feedback to\nsharpen the bigger vision rather than letting yourself be pulled back into being\ntheir bespoke vendor. The dance is doing both of those without your team\nthinking you've gone off-piste, because the gap between &quot;what we're shipping\ntoday&quot; and &quot;where we're going&quot; looks weird from the inside.</p>\n<h2>Where PMF teams quietly self-sabotage</h2>\n<p>Three patterns I keep seeing.</p>\n<p><img src=\"https://jonnonz.com/img/posts/pmf-gauntlet/discipline-vs-fragility.svg\" alt=\"Discipline vs fragility — three patterns where what looks like discipline (microservices on day one, rigid quarterly planning, founder comprehension debt) becomes fragility (can't pivot, defending old assumptions, context never reaches the team).\"></p>\n<p>Engineering over-scales the architecture. The team builds for the company they\nwant to be in two years instead of the company they need to be this quarter. By\nthe time PMF actually shows up, the org can't move. Worse, the engineers feel\nbusy and capable the whole time it's happening, which is why it's so hard to\nstop. Nobody is asking to slow down — everyone is shipping.</p>\n<p>Product holds the roadmap too tightly. The roadmap <em>is</em> the experiment at this\nstage. Treating it like a commitment is a category error. The product teams I've\nseen do this well treat the roadmap like a hypothesis with version numbers —\nlast month's was wrong, this month's is less wrong, and that's how it's supposed\nto feel. The ones who don't end up defending decisions they made when they knew\nless.</p>\n<p>Founder comprehension debt builds up faster than anyone notices. The founder is\nheads-down on signal — every customer call, every dropped deal, every weird\npattern in the data lands in their head and gets metabolised on the spot. The\nteam is two beats behind, working from last week's mental model. Each individual\ndelay feels minor. The cumulative gap is the thing that kills decisions.</p>\n<p>Each of these looks like discipline from the inside. Each of these is fragility\nwearing discipline's clothes.</p>\n<h2>AI changes the moat conversation, not the gauntlet</h2>\n<p>Moats in the AI space are shifting quarter by quarter right now. Feature moats\nhave basically collapsed — anything you can describe in a screenshot can be\ncloned in a weekend with the current generation of tools. What's\n<a href=\"https://www.latitudemedia.com/news/in-the-age-of-ai-can-startups-still-build-a-moat/\">actually defensible has moved</a>\ntoward proprietary data, deeply embedded workflows, distribution, trust, and\nregulatory positioning.</p>\n<p>For a founder in the PMF gauntlet that means the playbook is unreliable in a way\nit wasn't five years ago. You can't just lift what worked for the last cohort of\nSaaS winners and run it. You have to reason from first principles about where\nyour actual edge is going to come from over the next eighteen months, and place\nchips accordingly.</p>\n<p>The gauntlet itself hasn't changed. The chips you're placing have. That's harder\nthan it sounds, because most of us were trained in an era when the moat\nconversation was settled.</p>\n<h2>The unglamorous work that decides whether you survive</h2>\n<p>The thing nobody tells you is that the founder or leader's most important job\nduring the PMF stretch isn't strategy or product or sales. It's getting the\ncontext that's in your head out into the org while everyone is running at five\nthousand miles an hour.</p>\n<p>You will not feel like you have time for this. You won't. You have to carve it\nout anyway. The teams I've seen come through PMF intact are the ones whose\nleaders forced themselves to stop, write things down, repeat themselves more\nthan felt necessary, and trust that the slowdown was the work.</p>\n<p>The teams that don't make it tend to look back and realise everyone was busy and\nnobody knew why.</p>\n","date_published":"Fri, 01 May 2026 00:00:00 GMT"},{"id":"https://jonnonz.com/posts/change-management/","url":"https://jonnonz.com/posts/change-management/","title":"Change management","content_html":"<p>There's a whole business discipline called change management. Frameworks,\ncertifications, consultancies, the lot. Every big company has someone running it\nduring a restructure or a tech migration. Nobody runs it for you when your life\nturns over.</p>\n<p>Which is strange, because the personal version is the harder problem — and right\nnow, more people are facing it than at any point in recent memory.</p>\n<p>More than\n<a href=\"https://www.cnbc.com/2026/04/24/20k-job-cuts-at-meta-microsoft-raise-concern-of-ai-labor-crisis-.html\">92,000 tech workers have been laid off in 2026 alone</a>,\nbringing the total close to 900,000 since 2020. Meta cut 8,000 jobs last week.\nMicrosoft offered buyouts to 7% of its US workforce — the first time in its\n51-year history. Oracle has started cuts that could reach 30,000 by year end.\nCloser to home, Xero, Sharesies, Spark, One NZ and Eroad have all run their own\nrounds. AI is the headline reason, but the impact lands the same regardless of\nthe cause: hundreds of thousands of people closing a laptop and discovering\ntheir working identity has just been deleted.</p>\n<p>That's a lot of people being handed a forced version of personal change\nmanagement without ever signing up for the course.</p>\n<p>The business framing has the right insight buried in it.\n<a href=\"https://wmbridges.com/about/what-is-transition/\">William Bridges</a> made a\ndistinction in the 90s that most people miss: change is external, transition is\ninternal. Change is the new org chart, the redundancy email, the merger.\nTransition is what happens inside people's heads while all that is going on.\nChange can happen overnight. Transition takes as long as it takes.</p>\n<p>Personal change management is just transition without the org chart.</p>\n<p>I've been through a few years of it now. Not one big event — more like a slow\nstack of endings, some chosen, some not. Companies, relationships, versions of\nmyself I'd been building for a decade. The kind of stretch where you don't\nreally notice you're changing until you look up one day and the old you is gone.</p>\n<p>That's the part nobody warns you about. Real change isn't transformation. It's a\ncontrolled demolition followed by a slow rebuild, with a long, weird middle bit\nwhere neither the old you nor the new you is really there.</p>\n<h2>Something has to die</h2>\n<p>The thing that goes is usually the organising self. Whatever the old you was\narranged around — a fear, a need for approval, a story about who you had to be,\nan ambition that was really a wound. When that goes, the structure it was\nholding up collapses. That's the death. It's real.</p>\n<p>What survives is everything that wasn't load-bearing on the old arrangement.\nYour humour, your curiosity, the way you actually see people, the things you\ngenuinely care about. Those don't die because they weren't propping anything up.\nThey were just you, underneath.</p>\n<p>The disorienting part is feeling like a stranger to yourself and entirely\ncontinuous, at the same time. Both are true. The continuous parts are\ncontinuous. The organising self is gone. You're in between.</p>\n<h2>The middle is the work</h2>\n<p>Bridges calls this the neutral zone. The old reality has gone, the new one isn't\nthere yet. He says it's the hardest phase to manage, and most organisations rush\nthrough it because it looks unproductive. People do the same thing to\nthemselves.</p>\n<p>The temptation is to build a new identity fast, because the empty space is\nuncomfortable. Don't. Whatever you grab in a hurry will be made of whatever was\nlying around — which usually means the old patterns sneak back in wearing new\nclothes. Workaholism becomes &quot;building my legacy&quot;. Approval-seeking becomes\n&quot;being of service&quot;. Avoidance becomes &quot;protecting my peace&quot;. Same machine, new\npaint.</p>\n<p>The test is always: is this coming from fear or from truth? You'll know. The\nbody knows before the mind does. Pay attention to the part of you that goes\nquiet around certain people, certain projects, certain decisions. That's the\nsignal.</p>\n<h2>Fearlessness is a side effect</h2>\n<p>You don't get to fearless by trying. You get there by going through enough\nendings that the bluff stops working.</p>\n<p>Fear runs on a specific con: <em>if this thing happens, you won't survive it</em>. Not\nliterally die — but the you that exists now won't continue. You'll be broken,\nfinished, unrecognisable. The con works as long as it's untested. Then the thing\nhappens, and you go through it, and on the other side you notice you're still\nhere. Different, scarred, but continuous. The fear was lying about its hand.</p>\n<p>After that, fear can still show up — it doesn't leave — but it can't run the\nsame con. You've seen the card it was holding. Next time it says <em>you won't\nsurvive this</em>, some quiet part of you knows: I already did.</p>\n<p>That's not the absence of fear. It's knowing you can act from what's true even\nwith the fear in the room.</p>\n<h2>What's on the other side is ordinary</h2>\n<p>Here's the bit that surprised me. Once the demolition is done and the rebuild\nstarts, what comes back isn't impressive. It's just real. Less reactive. Less\nnoise. Less performance. You stop needing to be seen a particular way, partly\nbecause you've watched a few of those selves die and you don't trust the next\none enough to stake everything on it.</p>\n<p>The goal of change management — the personal kind — isn't to become someone\nadmirable. It's to become someone who's the same alone as in public. Someone who\ndoes the next true thing without announcing it. Most of the depth of this stuff\nlives in the texture of regular days. How you handle a boring Tuesday. Whether\nyou rest when you're tired or push through to prove something to nobody.</p>\n<p>Business change management has all this in it, and most people read it as a\nproject manager's manual. It's also a personal one. Endings, neutral zone, new\nbeginnings. Same shape, different blast radius.</p>\n<p>The seeds that grow through the demolition are the ones worth tending. The rest\nsorts itself out.</p>\n","date_published":"Sat, 25 Apr 2026 00:00:00 GMT"},{"id":"https://jonnonz.com/posts/three-ways-to-look-at-time/","url":"https://jonnonz.com/posts/three-ways-to-look-at-time/","title":"Three Ways to Look at Time","content_html":"<p>ST-ResNet's core insight is that not all history is created equal.</p>\n<p>When you're predicting crime in Auckland next month, three different kinds of\npast information matter. What happened in the last couple of months: the recent\ntrend. What happened at the same time last year: the seasonal pattern. And\nwhat's been happening over the longer term: whether crime is generally rising or\nfalling in an area.</p>\n<p>ConvLSTM treats all of this as one continuous sequence and hopes the network\nfigures out which parts matter. <a href=\"https://arxiv.org/abs/1610.00081\">ST-ResNet</a>\ntakes a more opinionated approach. It separates these three temporal scales\nexplicitly and gives each one its own dedicated neural network branch.</p>\n<p>The original paper by Zhang et al. was about predicting crowd flows in Beijing.\nPeople move through cities in patterns that look a lot like crime patterns:\ndaily rhythms, weekly cycles, long-term trends. The architecture\n<a href=\"https://www.nature.com/articles/s41598-025-24559-7\">translates well to crime data</a>,\nwith some modifications.</p>\n<h2>Closeness, period, trend</h2>\n<p>The three branches each look at different slices of history:</p>\n<p><strong>Closeness</strong> captures what's been happening recently. For our monthly data,\nthis means the last 3 months. If South Auckland has been trending upward over\nthe last quarter, the closeness branch sees that momentum.</p>\n<p><strong>Period</strong> captures seasonal patterns. It looks at the same month in previous\nyears. So to predict January 2026, it pulls in January 2025 and January 2024.\nThe assumption is that crime has an annual rhythm, and the same month tends to\nlook similar year to year.</p>\n<p><strong>Trend</strong> captures longer-term shifts. It uses quarterly averages from further\nback: broad strokes of whether an area is seeing more or less crime over time.\nThis is the slowest-moving signal.</p>\n<p>Each branch independently processes its temporal slice through a stack of\nresidual convolutional blocks, then a learned fusion layer combines the three\noutputs:</p>\n<pre><code>prediction = W_c · closeness + W_p · period + W_t · trend + bias\n</code></pre>\n<p>Where <code>W_c</code>, <code>W_p</code>, and <code>W_t</code> are learned weights that vary by grid cell. This\nis a nice touch. It means the model can decide that the CBD's crime is mostly\ndriven by recent trends (closeness), while a residential suburb might be more\nseasonal (period). Different areas get different temporal recipes.</p>\n<h2>Residual blocks</h2>\n<p>Each branch uses residual convolutional units, the building blocks that made\n<a href=\"https://arxiv.org/abs/1512.03385\">ResNet</a> so successful in image recognition.</p>\n<p>The key idea: instead of learning the full output at each layer, the network\nlearns the <em>residual</em>, the difference between input and output. The identity\nshortcut connection means gradients flow cleanly through the network during\ntraining, which lets you stack more layers without the signal degrading.</p>\n<pre><code>ResUnit(X) = ReLU(Conv(ReLU(Conv(X))) + X)\n</code></pre>\n<p>That <code>+ X</code> at the end is the skip connection. If the layer has nothing useful to\nadd, it can learn weights near zero and just pass the input through. This makes\ndeeper networks stable, which matters when you're trying to learn spatial\nfeatures at multiple scales.</p>\n<p>For our grid, I use 4 residual units per branch. Each unit has two 3×3\nconvolutional layers with 32 filters. That's deep enough to capture spatial\nrelationships across several kilometres without being so deep that the model\noverfits on 36 months of training data.</p>\n<h2>The NZ-specific problem</h2>\n<p>Here's where theory meets reality, and it gets a bit awkward.</p>\n<p>ST-ResNet was designed for dense, high-frequency data. The Beijing crowd flow\npaper used 30-minute intervals over months of data: thousands of timesteps. The\ncrime papers that report strong results typically use daily data over several\nyears.</p>\n<p>We have 48 monthly timesteps. Total. The period branch (which looks at the same\nmonth in previous years) has at most 3 data points per month (2022, 2023, 2024\nto predict 2025/2026). The trend branch is working with quarterly averages from\na four-year window. It's not a lot of temporal data for an architecture that's\nspecifically designed to decompose temporal patterns.</p>\n<p>I had a feeling this would be the bottleneck, and it was.</p>\n<h2>Implementation</h2>\n<pre><code>Closeness branch:\n  Input: last 3 months (3 × 6 channels = 18 input channels)\n  → 4 ResUnits (32 filters, 3×3 kernels)\n  → Output: 32 channels\n\nPeriod branch:\n  Input: same month from 2 prior years (2 × 6 = 12 input channels)\n  → 4 ResUnits (32 filters, 3×3 kernels)\n  → Output: 32 channels\n\nTrend branch:\n  Input: 2 quarterly averages (2 × 6 = 12 input channels)\n  → 4 ResUnits (32 filters, 3×3 kernels)\n  → Output: 32 channels\n\nFusion:\n  → Learned weighted sum across branches\n  → Conv2d(32, 6, 1×1) → 6 crime type predictions\n</code></pre>\n<p>Total parameters: roughly 180k. Slightly smaller than the ConvLSTM, which is\nfine. ST-ResNet's power is supposed to come from the temporal decomposition, not\nfrom model size.</p>\n<p>Training uses the same setup as ConvLSTM: Adam optimiser, learning rate 1e-4,\nMSE loss on <code>log1p</code>-transformed values, early stopping with patience of 15\nepochs. On CPU, each run takes about 35 minutes, a bit faster than ConvLSTM\nsince there's no sequential recurrence to deal with.</p>\n<h2>Results</h2>\n<table>\n<thead>\n<tr>\n<th>Crime Type</th>\n<th>Hist. Avg MAE</th>\n<th>ConvLSTM MAE</th>\n<th>ST-ResNet MAE</th>\n</tr>\n</thead>\n<tbody>\n<tr>\n<td>Theft</td>\n<td>1.28</td>\n<td>1.14</td>\n<td>1.18</td>\n</tr>\n<tr>\n<td>Burglary</td>\n<td>0.35</td>\n<td>0.32</td>\n<td>0.33</td>\n</tr>\n<tr>\n<td>Assault</td>\n<td>0.20</td>\n<td>0.19</td>\n<td>0.19</td>\n</tr>\n<tr>\n<td>Robbery</td>\n<td>0.04</td>\n<td>0.04</td>\n<td>0.04</td>\n</tr>\n<tr>\n<td>Sexual</td>\n<td>0.03</td>\n<td>0.03</td>\n<td>0.03</td>\n</tr>\n<tr>\n<td>Harm</td>\n<td>0.01</td>\n<td>0.01</td>\n<td>0.01</td>\n</tr>\n<tr>\n<td><strong>All types</strong></td>\n<td><strong>0.39</strong></td>\n<td><strong>0.35</strong></td>\n<td><strong>0.36</strong></td>\n</tr>\n</tbody>\n</table>\n<p>ST-ResNet beats the historical average but doesn't quite match ConvLSTM. The\naggregate MAE of 0.36 is a 7.7% improvement over the baseline, compared to\nConvLSTM's 10.3%.</p>\n<p>That's not a terrible result, but it's not what I was hoping for.</p>\n<h2>Why ConvLSTM wins here</h2>\n<p>When I dug into the learned fusion weights, the story became clear. The\ncloseness branch dominates. It gets 60–70% of the weight across most grid cells.\nThe period branch gets 20–25%, and the trend branch barely contributes at\n10–15%.</p>\n<p>The model is basically saying: &quot;Recent months matter most, seasonal patterns\nhelp a bit, and long-term trends are mostly noise.&quot; That's not a failure of the\narchitecture. It's a fair assessment of what's in the data.</p>\n<p>With only 2–3 examples of each calendar month, the period branch can't reliably\nlearn seasonal patterns. It's overfitting to individual years rather than\nextracting a stable seasonal signal. ConvLSTM handles this better because it\nprocesses the full sequence and implicitly learns seasonality from the\ncontinuous flow of months, without needing to explicitly align calendar periods.</p>\n<p>The trend branch suffers even more. Quarterly averages over a four-year window\ndon't give it much to work with. In the original crowd flow papers with years of\nhalf-hourly data, the trend branch captures genuine long-term shifts in\npopulation movement. Here, it's essentially learning a constant.</p>\n<h2>Where ST-ResNet does shine</h2>\n<p>Despite losing on aggregate, ST-ResNet has one clear advantage: it's better at\npredicting seasonal transitions.</p>\n<p>The months where crime shifts gears (the spring uptick in September/October and\nthe February dip) ST-ResNet handles more gracefully than ConvLSTM. The period\nbranch, sparse as its data is, does capture enough of the annual rhythm to\nanticipate these transitions a bit earlier.</p>\n<p>ConvLSTM tends to lag these transitions by about a month. It needs to &quot;see&quot; the\nuptick starting before it predicts continuation. ST-ResNet, by explicitly\nlooking at last year's same month, can anticipate the shift before it fully\nmaterialises in the recent sequence.</p>\n<p>For an operational forecasting tool, that one-month lead time on seasonal\ntransitions could be valuable. But in our test set metrics, it's a small\nadvantage that doesn't overcome ST-ResNet's overall weaker performance on\nmonth-to-month dynamics.</p>\n<h2>Head to head</h2>\n<table>\n<thead>\n<tr>\n<th>Metric</th>\n<th>Historical Avg</th>\n<th>ConvLSTM</th>\n<th>ST-ResNet</th>\n</tr>\n</thead>\n<tbody>\n<tr>\n<td>Overall MAE</td>\n<td>0.39</td>\n<td>0.35</td>\n<td>0.36</td>\n</tr>\n<tr>\n<td>Theft MAE</td>\n<td>1.28</td>\n<td>1.14</td>\n<td>1.18</td>\n</tr>\n<tr>\n<td>Training time (CPU)</td>\n<td>N/A</td>\n<td>~40 min</td>\n<td>~35 min</td>\n</tr>\n<tr>\n<td>Parameters</td>\n<td>0</td>\n<td>~200k</td>\n<td>~180k</td>\n</tr>\n<tr>\n<td>Seasonal transitions</td>\n<td>Poor</td>\n<td>Lagging</td>\n<td>Better</td>\n</tr>\n<tr>\n<td>Spatial dynamics</td>\n<td>None</td>\n<td>Good</td>\n<td>Good</td>\n</tr>\n</tbody>\n</table>\n<p>ConvLSTM is the better model for this specific dataset. Not by a lot. We're\ntalking about small differences on already-small error values. But consistently\nbetter on the main crime types that have enough signal to matter.</p>\n<p>Neither model is a revelation. A 7–10% improvement over &quot;just use the historical\naverage&quot; is real but modest. Deep learning's strengths (learning complex\nnonlinear dynamics from huge datasets) are somewhat wasted on 48 monthly\ntimesteps over a relatively low-crime city.</p>\n<p>If I had daily data instead of monthly, or ten years instead of four, I'd expect\nST-ResNet to close the gap or pull ahead. Its architecture is fundamentally\nsound. The temporal decomposition is a genuinely good idea. It's just starved of\nthe data it needs to shine.</p>\n<p>Both models meaningfully beat the baselines. Both learn spatial patterns that\nsimple averages can't capture. And both are honest about the sparse crime types:\nthey predict near-zero and move on, which is the right call.</p>\n<p>Next up: we'll take these predictions and build something you can actually look\nat. A 3D interactive dashboard where you can watch crime patterns evolve across\nAuckland over time. The modelling was the hard bit. Making it visual is the fun\nbit.</p>\n","date_published":"Thu, 23 Apr 2026 00:00:00 GMT"},{"id":"https://jonnonz.com/posts/what-an-hour-of-your-attention-is-worth/","url":"https://jonnonz.com/posts/what-an-hour-of-your-attention-is-worth/","title":"What an hour of your attention is worth","content_html":"<p>I stood up a working social network for eight mates last weekend. Profile pages,\na shared feed, a photo wall, a jukebox bolted onto a spare domain. It took me a\nSaturday, about forty bucks in Claude credits, and exactly zero\nproduct-market-fit meetings.</p>\n<p>The same weekend, Meta earned about six bucks off me. Google made ten. LinkedIn,\nYouTube, TikTok, X — all quietly billing in the background, none of them sending\na receipt. If you add them all up for the average American, the annual total is\nnorth of $1,000. You just never see it, because no money changes hands and no\ninvoice arrives.</p>\n<p>The clever thing about &quot;free&quot; on the internet isn't that the trade doesn't\nexist. It's that it's been designed so you can't see it. No money moves. No\ninvoice lands. No app shows you the meter ticking as you scroll. The exchange is\nreal — your attention and your data in, Instagram and Google and LinkedIn out —\nbut by the time the numbers get tallied, they live in a quarterly earnings\nreport you'll never read. So the trade feels weightless.</p>\n<p>It isn't. You just can't see the price tag.</p>\n<p>The strange thing is the price tag has been public the whole time. Every\nplatform listed on a stock exchange tells you, four times a year, exactly what\nyou're worth to them. You've just never been shown how to read it — and until\nrecently, the only practical alternative to reading it was &quot;live in a cabin.&quot;\nThat part has changed, and it's the part almost nobody is talking about.</p>\n<p>The invisibility isn't an accident either. If Meta had to send you a cheque\nevery month for the money they made off you, you'd treat the relationship very\ndifferently. You'd notice when the amount went up. You'd notice that the\nteenager version of the payment looks nothing like the adult version. You'd\nwonder why the Auckland cheque was ten times the Jakarta one for the exact same\nhour of scrolling. The whole edifice of &quot;free&quot; rests on keeping the accounting\none-sided — they measure you in basis points to three decimal places, you\nexperience the trade as a vague sense of having lost your afternoon.</p>\n<h2>The price tag they're legally required to print</h2>\n<p>The number you want is called ARPU — average revenue per user. Every public\nplatform reports it, because investors demand it. The maths is blunt: take the\ncompany's annual revenue, divide by monthly active users. What comes out is what\nthe platform earns off the average human who shows up, per year.</p>\n<p>For Meta last year the global figure was about $52 per user. For YouTube's\nad-supported side, around $24. For\n<a href=\"https://www.linkedin.com/posts/dshapero_earnings-update-to-close-out-our-2025-fiscal-activity-7361399679256858624-vVg7\">LinkedIn it's $15 averaged across all 1.2B members</a>,\nbut much higher once you strip out the dormant accounts.</p>\n<p>These aren't guesses from a watchdog group. They're from the companies\nthemselves, in the part of the earnings release where the whole purpose is to\nconvince shareholders each user is worth more than last quarter. The incentive\nis to talk the number up, not down.</p>\n<p>Whatever ARPU says, the reality on the ground probably isn't lower. If anything,\nit's a floor.</p>\n<h2>Your annual bill, itemised</h2>\n<p>Rough figures, from the companies' own filings:</p>\n<ul>\n<li><strong>Meta</strong>: ~$52/yr global, ~$320 in the US</li>\n<li><strong>Google (all products)</strong>: ~$100/yr globally, ~$500 US —\n<a href=\"https://abc.xyz/investor/\">$400B in revenue</a> across ~4B users spanning\nSearch, Android, YouTube, Cloud and Workspace combined</li>\n<li><strong>YouTube ads alone</strong>: ~$24/yr global, ~$80 US</li>\n<li><strong>LinkedIn</strong>: $15/yr averaged across all 1.2B members, but ~$57/yr across the\n<a href=\"https://www.linkedin.com/posts/dshapero_earnings-update-to-close-out-our-2025-fiscal-activity-7361399679256858624-vVg7\">310M monthly active ones</a></li>\n<li><strong>TikTok</strong>: ~$16 global, ~$70 US — doubled in two years</li>\n<li><strong>Snapchat, Reddit, Pinterest, X</strong>: all in the $10–30/user/yr range</li>\n</ul>\n<p>The geographic skew is the part most people miss. Meta's figure in the US is\nroughly ten times what it is in Asia-Pacific. Europe sits in the middle at about\n$92. Same product, same features, same algorithm — different rate card, because\nad buyers pay more to reach wealthier audiences. You are literally worth more in\nAuckland than you are in Jakarta, and your feed is tuned accordingly.</p>\n<p><img src=\"https://jonnonz.com/img/posts/arpu-meta-by-region.svg\" alt=\"Meta ARPU by region — US $320, Europe $92, global $52, Asia-Pacific $32\"></p>\n<p>The same skew shows up across every ad-funded platform. The US rate card is the\none the rest of the world gets compared to:</p>\n<p><img src=\"https://jonnonz.com/img/posts/arpu-us-vs-global.svg\" alt=\"US vs global ARPU comparison — Google $500/$100, Meta $320/$52, YouTube $80/$24, TikTok $70/$16, LinkedIn $57/$15\"></p>\n<p>Marketplaces don't fit ARPU cleanly, but the extraction is still there if you\nlook for it. Uber and Lyft take around 20% of each fare. Airbnb combines host\nand guest fees for about 14–16%. DoorDash and Uber Eats take closer to 25%.\nShopify's card take is 2.9% plus 30 cents per transaction. Different mechanism,\nsame game — a percentage of every transaction, quietly skimmed, never itemised.</p>\n<h2>The meter, in dollars per hour</h2>\n<p>ARPU is annual. Attention isn't spent in years though — it's spent in hours, in\nthe little windows between other things. So the honest conversion is to divide.</p>\n<p>The average US Meta user burns about 200 hours a year across Facebook and\nInstagram. $320 ÷ 200 = roughly $1.60 per hour of your attention. YouTube works\nout to about $0.27/hour. TikTok $0.22. Snapchat cheaper still. Do the same sum\non global averages and Meta drops to around 26 cents an hour, YouTube to 8.</p>\n<p>Those rates are only what the platform <em>earns</em> this year, mind. They aren't what\nyour data is ultimately <em>worth</em>. Everything you click and hover and pause on\nfeeds ad targeting across the wider web, plus — now — AI training corpora. ARPU\nis the rent. The equity is bigger, and the equity compounds.</p>\n<p>The AI-training bit is genuinely new and worth pausing on. For fifteen years the\ndata you generated on these platforms powered one thing: better ad targeting on\nthose same platforms. It was a closed loop. You scrolled, they learned, they\nsold the targeting back to advertisers, the advertisers bought your attention\nagain. Bounded. Weird, but bounded.</p>\n<p>That loop isn't bounded anymore. Your posts and comments and DMs are now\ntraining data for models that will be sold, resold, and embedded into every\npiece of software you touch for the next decade. The $320 Meta earned off you in\nthe US last year is a rounding error next to what the underlying corpus is worth\nto the next generation of AI products. ARPU doesn't capture any of that. It's\nliterally last quarter's ad rent, with none of the capital gains on the asset.</p>\n<p>Even the rent, laid out per hour, makes one thing obvious: you can see exactly\nwhy every platform is obsessed with &quot;time spent&quot; as a north-star metric. If one\nextra hour a week on Facebook is worth ~$83 a year per US user, multiplied\nacross three billion users, the maths for why the feed never stops scrolling is\nnot mysterious. The feed is a meter. Keeping it running is the business. Every\n&quot;new feature&quot; that shows up in your settings — reels, shorts, a nudge to open\nthe app on your commute — is a hand on that meter.</p>\n<p>Once you see it that way, a lot of product decisions stop looking like product\ndecisions.</p>\n<h2>Run your own numbers</h2>\n<p>The point of making the numbers this concrete is that you can plug in your own\nusage and see what you personally throw into the machine each year. Drag the\nsliders for how much time goes into each platform and watch the ledger tally up.\nRates are global averages.</p>\n<section class=\"ledger\" id=\"ledger\" aria-label=\"The Ledger calculator\"><style>\n.ledger{--lb:#141f2e;--lb2:#1a2637;--li:#e4e9ee;--ld:#bfc8d2;--lm:#6a7d92;--la:#d4a853;--lbd:rgba(255,255,255,.08);--lbs:rgba(255,255,255,.16);max-width:38rem;background:var(--lb);border:1px solid var(--lbs);border-radius:.35rem;padding:1.15rem 1.25rem 1rem;margin:2rem auto;color:var(--ld);font-family:text,'Roboto',-apple-system,sans-serif;font-size:.88rem;line-height:1.5;text-align:left}\n.ledger *,.ledger *::before,.ledger *::after{box-sizing:border-box}\n.ledger h3,.ledger h4{font-family:inherit;margin:0;padding:0;color:inherit;font-weight:inherit;font-size:inherit;letter-spacing:0;line-height:1.2}\n.ledger h3::before,.ledger h4::before{content:none}\n.ledger p{margin:0;padding:0}\n.ledger input{font:inherit;color:inherit;background:transparent;border:none;outline:none}\n.l-head{display:flex;justify-content:space-between;align-items:baseline;gap:.75rem;padding-bottom:.75rem;margin-bottom:.9rem;border-bottom:1px solid var(--lbs)}\n.l-title{font-family:serif,'Fraunces',Georgia,serif;font-weight:400;font-size:1.15rem;letter-spacing:-.015em;color:var(--li)}\n.l-tag{font-family:code,'JetBrains Mono',monospace;font-size:.58rem;letter-spacing:.16em;text-transform:uppercase;color:var(--lm)}\n.l-total{display:flex;align-items:baseline;justify-content:space-between;gap:1rem;padding:.85rem 1rem;background:#0e1623;border:1px solid var(--lbd);border-radius:.25rem;margin-bottom:1.1rem}\n.l-total-amt{font-family:serif,'Fraunces',Georgia,serif;font-weight:400;font-size:1.9rem;line-height:1;color:var(--la);letter-spacing:-.02em;font-variant-numeric:tabular-nums}\n.l-total-lab{font-family:code,'JetBrains Mono',monospace;font-size:.58rem;letter-spacing:.18em;text-transform:uppercase;color:var(--lm);text-align:right}\n.l-sec{margin-top:1.15rem}\n.l-sec:first-of-type{margin-top:0}\n.l-row{padding:.9rem 0;border-bottom:1px dashed var(--lbd)}\n.l-row:last-child{border-bottom:none}\n.l-row-top{display:flex;align-items:baseline;justify-content:space-between;gap:.75rem;margin-bottom:.55rem}\n.l-row-n{color:var(--li);font-size:.94rem;line-height:1.25}\n.l-row-meta{font-family:code,'JetBrains Mono',monospace;font-size:.6rem;letter-spacing:.05em;color:var(--lm);margin-top:.15rem;display:block}\n.l-row-a{font-family:serif,'Fraunces',Georgia,serif;font-size:1.05rem;color:var(--la);text-align:right;font-variant-numeric:tabular-nums;white-space:nowrap;flex-shrink:0}\n.l-row-a.z{color:var(--lm)}\n.l-row-c{display:flex;align-items:center;gap:.9rem}\n.l-row-v{font-family:code,'JetBrains Mono',monospace;font-size:.66rem;letter-spacing:.05em;color:var(--ld);white-space:nowrap;min-width:6rem;text-align:right}\n.l-sl{-webkit-appearance:none;appearance:none;flex:1;min-width:0;height:32px;background:transparent;cursor:pointer;padding:0;margin:0;touch-action:manipulation}\n.l-sl::-webkit-slider-runnable-track{height:3px;background:var(--lbd);border-radius:2px}\n.l-sl::-webkit-slider-thumb{-webkit-appearance:none;appearance:none;width:22px;height:22px;border-radius:50%;background:var(--la);border:3px solid var(--lb);margin-top:-10px;box-shadow:0 0 0 1px var(--la),0 2px 6px rgba(0,0,0,.3);cursor:grab}\n.l-sl:active::-webkit-slider-thumb{cursor:grabbing;box-shadow:0 0 0 1px var(--la),0 0 0 6px rgba(212,168,83,.22)}\n.l-sl::-moz-range-track{height:3px;background:var(--lbd);border-radius:2px}\n.l-sl::-moz-range-thumb{width:22px;height:22px;border-radius:50%;background:var(--la);border:3px solid var(--lb);box-shadow:0 0 0 1px var(--la)}\n.l-sl:focus::-webkit-slider-thumb{box-shadow:0 0 0 1px var(--la),0 0 0 6px rgba(212,168,83,.28)}\n.l-notes{margin-top:1.25rem;padding-top:.9rem;border-top:1px solid var(--lbs)}\n.l-notes h5{font-family:code,'JetBrains Mono',monospace;font-size:.56rem;letter-spacing:.18em;text-transform:uppercase;color:var(--lm);margin:0 0 .55rem;font-weight:400}\n.l-notes p{font-family:serif,'Fraunces',Georgia,serif;font-size:.85rem;line-height:1.55;color:var(--ld);margin:0 0 .4rem;text-wrap:pretty}\n.l-notes p:last-child{margin-bottom:0}\n.l-notes strong{color:var(--li);font-weight:500}\n@media (max-width:560px){\n.ledger{padding:1rem .9rem;margin:1.5rem auto;font-size:.92rem}\n.l-total{flex-direction:column;align-items:flex-start;gap:.25rem;padding:.75rem .9rem}\n.l-total-lab{text-align:left}\n.l-row{padding:1rem 0}\n.l-row-c{gap:.75rem}\n.l-row-v{min-width:5rem;font-size:.7rem}\n.l-sl::-webkit-slider-thumb{width:26px;height:26px;margin-top:-12px}\n.l-sl::-moz-range-thumb{width:26px;height:26px}\n}\n</style><div class=\"l-head\"><h3 class=\"l-title\">The Ledger</h3><span class=\"l-tag\">global averages · per year</span></div><div class=\"l-total\"><div class=\"l-total-amt\" id=\"l-total\">$0</div><div class=\"l-total-lab\">extracted per year</div></div><div class=\"l-sec\"><div id=\"l-attn-rows\"></div></div><div class=\"l-notes\"><h5>Notes on the method</h5><p><strong>ARPU is rent, not equity.</strong> What a platform earns this year isn't what the underlying data is worth across the wider web and AI training corpora.</p><p><strong>Averages hide heavy users.</strong> Freemium smears free and paying users into one figure. If you're all-in, you're worth more than average.</p><p><strong>Multi-product companies cheat the top line.</strong> Google's per-user number isn't all Search — it's Search plus Android plus YouTube plus Cloud.</p></div><script>(function(){var R={meta:.26,youtube:.08,tiktok:.05,x:.07,reddit:.12,snap:.07,pin:.15,li:.15,gq:.04};\nvar ATTN=[{id:'meta',name:'Meta (FB / IG / WhatsApp)',rate:'meta',unit:'hrs/day',mult:365,max:6,step:.25},{id:'youtube',name:'YouTube (ad-supported)',rate:'youtube',unit:'hrs/day',mult:365,max:6,step:.25},{id:'tiktok',name:'TikTok',rate:'tiktok',unit:'hrs/day',mult:365,max:6,step:.25},{id:'x',name:'X (Twitter)',rate:'x',unit:'hrs/day',mult:365,max:4,step:.25},{id:'reddit',name:'Reddit',rate:'reddit',unit:'hrs/day',mult:365,max:4,step:.25},{id:'snap',name:'Snapchat',rate:'snap',unit:'hrs/day',mult:365,max:4,step:.25},{id:'pin',name:'Pinterest',rate:'pin',unit:'hrs/day',mult:365,max:4,step:.25},{id:'li',name:'LinkedIn (free)',rate:'li',unit:'hrs/day',mult:365,max:2,step:.1},{id:'gq',name:'Google Search',rate:'gq',unit:'searches/day',mult:365,max:100,step:1}];\nvar state={attn:{}};\nfunction fmt(n){n=Math.round(n);if(n===0)return'$0';if(n>=1000)return'$'+n.toLocaleString();return'$'+n;}\nfunction recalc(){var tot=0;\nATTN.forEach(function(s){var v=state.attn[s.id]||0;var amt=v*R[s.rate]*s.mult;tot+=amt;var el=document.getElementById('l-a-'+s.id);if(el){el.textContent=fmt(amt);el.classList.toggle('z',amt<1);}var vl=document.getElementById('l-v-'+s.id);if(vl)vl.textContent=v+' '+s.unit;});\ndocument.getElementById('l-total').textContent=fmt(tot);}\nfunction renderAttn(){document.getElementById('l-attn-rows').innerHTML=ATTN.map(function(s){var meta='$'+R[s.rate].toFixed(2)+(s.unit==='searches/day'?' / search':' / hr');return '<div class=\"l-row\"><div class=\"l-row-top\"><div><span class=\"l-row-n\">'+s.name+'</span><span class=\"l-row-meta\">'+meta+'</span></div><span class=\"l-row-a z\" id=\"l-a-'+s.id+'\">$0</span></div><div class=\"l-row-c\"><input type=\"range\" class=\"l-sl\" data-cat=\"attn\" data-svc=\"'+s.id+'\" min=\"0\" max=\"'+s.max+'\" step=\"'+s.step+'\" value=\"0\" aria-label=\"'+s.name+' '+s.unit+'\"><span class=\"l-row-v\" id=\"l-v-'+s.id+'\">0 '+s.unit+'</span></div></div>';}).join('');}\nfunction renderAll(){renderAttn();recalc();}\nvar root=document.getElementById('ledger');\nroot.addEventListener('input',function(e){var t=e.target;if(t.classList.contains('l-sl')){state[t.dataset.cat][t.dataset.svc]=parseFloat(t.value)||0;recalc();}});\nrenderAll();})();</script></section>\n<p>The rates come from the earnings-report maths above — global ARPU divided by\naverage annual hours on the platform.</p>\n<h2>The weekend social network</h2>\n<p>Once the number has somewhere to sit, it's much harder to ignore.</p>\n<p>Most people look at a total over $1,000/yr and go quiet for a second. Not\nbecause any one platform is egregious — on a per-hour basis they really aren't —\nbut because the aggregate is real, and it's been invisible until now. That's the\nfirst useful thing the exercise does. It makes a choice possible.</p>\n<p>The obvious next move is to look at alternatives. Signal instead of WhatsApp.\nKagi or Brave Search instead of Google. Paid Spotify instead of ad-supported\nSpotify. Bluesky or Mastodon instead of X. Fastmail instead of Gmail. None are\nperfect, and some cost actual money — but once you can price what you're\ncurrently &quot;not paying&quot;, the paid alternative often looks less expensive than it\ndid five minutes ago. Fastmail at\n$5/month stops being a luxury when the honest comparison is &quot;$60/yr vs being the\nproduct for an ad network that paid $500 for me last year.&quot;</p>\n<p>That's the defensive move. It's the one everyone talks about, every time one of\nthese pieces gets written. You switch to the more honest vendor, you feel\nslightly better, and the fundamental shape of the market doesn't move.</p>\n<p>The more interesting move is what's happened on the <em>build</em> side, and it's the\npart almost nobody has internalised yet.</p>\n<p>Standing up a social app used to take a small team months. You needed a backend\nengineer, a frontend engineer, a designer, probably a DevOps person, and a spare\nthree months. That was the real moat — not the network effects, not the\nalgorithm, but the sheer human-hours required to put a working thing on the\ninternet. That's why the only viable answer for twenty years was to build\nsomething big enough to run ads against. Small social didn't exist because small\nsocial couldn't pay the salaries.</p>\n<p>With Claude Code, Cursor, v0, and Lovable, that equation has quietly inverted. A\nprofile page, a shared feed, a wall for photos, maybe a jukebox, a chat wall — a\nMySpace-sized thing for you and a dozen friends, on a domain you own, with none\nof it feeding anyone's ad platform — is a weekend. I know because I just did it.\nNot as some Silicon Valley startup trying to replace Facebook. As a Saturday\nproject for eight mates, on a domain that cost twelve bucks, running on a box\nthat costs ten a month.</p>\n<p>The bill of materials is embarrassingly short. A boring Postgres. A boring\nNext.js app. Auth via magic link. Storage for photos. An LLM for the fiddly bits\nnobody wants to write from scratch. All of it plumbed together in an afternoon\nof prompting, an evening of cleanup, and a Sunday of adding the jukebox because\nmy mate Hamish wouldn't stop asking.</p>\n<p>It is not good software. It is good <em>enough</em> software for eight humans who know\neach other.</p>\n<p>That qualifier is the whole thing. Facebook has to be good software at planet\nscale because Facebook is selling ad impressions at planet scale. A group of\neight doesn't need p99 latency and a content moderation policy. A group of eight\nneeds a place to put photos from the weekend where the photos don't end up\ntraining someone's image model in twelve months' time. Those are very different\nengineering problems, and the second one is much, much easier than the first.</p>\n<p>A lot of things genuinely don't work on the weekend version. There's no\nrecommendation algorithm. There's no real search. The feed is\nreverse-chronological and that's it. When someone posts something at 3am nobody\nsees it until the morning. There's no cleverness about which photos get surfaced\nor which memories get resurrected. If you go on holiday for two weeks, you come\nback to a feed that's exactly what your eight mates posted, in the order they\nposted it.</p>\n<p>That sounds like a limitation until you notice the thing it is not doing is\noptimising for your engagement. Reverse-chronological across eight friends is\nnot a meter. It's a wall. You check it, you see what's there, you leave. There's\nno reason for the software to try to keep you around because there's nobody\npaying the software to keep you around. That inversion — from meter to wall — is\nthe entire point.</p>\n<p>The thing that would have been a VC round in 2015 is now a side quest you finish\nbefore the roast is in the oven. The tools genuinely got that much better in the\nlast eighteen months. We just haven't updated our intuitions yet about what that\nmeans.</p>\n<p>What it means, specifically, is that the ad-supported social network is no\nlonger the only technically viable answer. For twenty years it was. That was the\nconstraint the whole &quot;free web&quot; was built around. The constraint is gone, and\nnobody has sent the memo.</p>\n<p>The cheapest social network in 2026 is the one you and six mates build on a\nSaturday afternoon. It doesn't scale. It doesn't need to. It costs less than a\nmonth of Netflix, produces no ad revenue for anyone, and feeds no one's training\nset. You own the domain. You own the data. You own the product decisions — which\nin practice means there are no product decisions, because nobody is trying to\nsqueeze another hour out of anyone's week.</p>\n<p>None of this replaces the platforms, to be clear. You still need Gmail for the\nrecruiter, LinkedIn for the job hunt, YouTube for the tutorial, WhatsApp for the\ngroup chat your family refuses to leave. The ad-supported internet isn't going\nanywhere and I'm not pretending it is. What's changed is that it's no longer the\nonly game in town. For the circle of people you actually care about — the eight\nmates, the cousins, the old uni flat — you don't have to hand them over to the\nad machine anymore. You can build them a room of their own, and the tools to\nbuild that room have become trivial in a way we haven't fully absorbed yet.</p>\n<p>The meter's been running your whole life. You just got the tools to turn it off.</p>\n","date_published":"Tue, 21 Apr 2026 12:00:00 GMT"},{"id":"https://jonnonz.com/posts/teaching-a-neural-network-to-watch-crime-like-video/","url":"https://jonnonz.com/posts/teaching-a-neural-network-to-watch-crime-like-video/","title":"Teaching a Neural Network to Watch Crime Like Video","content_html":"<p>ConvLSTM was invented to predict rainstorms.</p>\n<p>Specifically,\n<a href=\"https://arxiv.org/abs/1506.04214\">Shi et al. at the Hong Kong Observatory</a>\nneeded to forecast radar echo maps: 2D grids of rainfall intensity that evolve\nover time. They had sequences of spatial images and wanted to predict the next\nframes. Sound familiar?</p>\n<p>That's exactly what we built in Part 3. Crime on a 500m grid, one frame per\nmonth, six channels for crime types. The Auckland crime tensor is structurally\nidentical to a weather radar sequence. Same dimensionality, same prediction\ntask, just a very different domain.</p>\n<h2>Why not regular LSTM?</h2>\n<p>Standard LSTM networks are fantastic at learning sequences. They're the backbone\nof a lot of time-series forecasting. But they have a fundamental problem with\nspatial data: they need flat vectors as input.</p>\n<p>To feed our 77×59 grid into a regular LSTM, we'd have to flatten it into a\nvector of 4,543 values per crime type. That's 27,258 values per timestep across\nall six channels. The network would process this as a sequence of big flat\nvectors, with no concept that cell (10, 5) is <em>next to</em> cell (10, 6).</p>\n<p>All the spatial structure (the fact that crime clusters, that hotspots have\nneighbourhoods, that the CBD is a contiguous area) gets thrown away. The model\nwould have to rediscover spatial relationships from scratch, purely from\ncorrelations in the flattened vector. With only 36 training months, that's not\nhappening.</p>\n<h2>The convolutional trick</h2>\n<p>ConvLSTM's insight is elegant. Take the standard LSTM equations (the input gate,\nforget gate, output gate, cell state update) and replace every matrix\nmultiplication with a convolution operation.</p>\n<p>In a regular LSTM:</p>\n<pre><code>input_gate = sigmoid(W_xi * x_t + W_hi * h_{t-1} + b_i)\n</code></pre>\n<p>In ConvLSTM:</p>\n<pre><code>input_gate = sigmoid(W_xi ∗ X_t + W_hi ∗ H_{t-1} + b_i)\n</code></pre>\n<p>That <code>∗</code> is a convolution instead of a matrix multiply. <code>X_t</code> is the full 2D\ngrid at time <code>t</code>, and <code>H_{t-1}</code> is the previous hidden state, also a 2D grid.\nThe convolution kernel slides across the spatial dimensions, so each cell's gate\nvalues depend on its local neighbourhood.</p>\n<p>This means the network naturally learns that a spike in cell (10, 5) might\naffect predictions for cell (10, 6). Spatial proximity is baked into the\narchitecture. It doesn't need to learn it from data.</p>\n<p>The kernel size controls how much spatial context each cell sees. A 3×3 kernel\nmeans each cell looks at its immediate 8 neighbours. Stack multiple ConvLSTM\nlayers and the effective receptive field grows. Deeper layers can capture\nrelationships between cells that are several kilometres apart.</p>\n<h2>Architecture choices</h2>\n<p>Here's what I settled on after a fair bit of experimentation (which on CPU means\n&quot;a lot of patient waiting&quot;):</p>\n<pre><code>Input: (batch, 6, 6, 77, 59), 6 months, 6 crime types, 77×59 grid\n  ↓\nConvLSTM2d(in=6, hidden=32, kernel=3×3, padding=1)\n  ↓\nBatchNorm2d\n  ↓\nConvLSTM2d(in=32, hidden=32, kernel=3×3, padding=1)\n  ↓\nBatchNorm2d\n  ↓\nConv2d(in=32, out=6, kernel=1×1), project to 6 crime type channels\n  ↓\nOutput: (batch, 6, 77, 59), next month prediction\n</code></pre>\n<p>Two ConvLSTM layers with 32 hidden channels each. The 3×3 kernel gives each cell\na neighbourhood view, and stacking two layers means the effective receptive\nfield covers about 1–1.5 km. Enough to capture the spatial extent of most crime\nhotspots.</p>\n<p>Why only 32 hidden channels? This is where the CPU constraint actually helps. A\nbigger model would be tempting with a GPU, but on a Ryzen 5 we need to keep it\ntight. 32 channels gives us about 200k trainable parameters: small enough to\ntrain in under an hour, large enough to learn meaningful spatial-temporal\npatterns.</p>\n<p>The 1×1 convolution at the end is a channel projection. It maps the 32 learned\nfeatures back to 6 crime type predictions.</p>\n<h2>Sequence length: six months</h2>\n<p>The lookback window is six months. The model sees January through June and\npredicts July. Then February through July to predict August. And so on.</p>\n<p>Six months captures one half of the seasonal cycle, which turned out to be the\nsweet spot. Shorter sequences (3 months) missed seasonal context. Longer\nsequences (12 months) didn't improve results, likely because the model doesn't\nhave enough data to learn year-long dependencies with only 36 training months\ntotal.</p>\n<p>The training set gives us 30 sequences (months 1–6 predict 7, months 2–7 predict\n8, all the way to months 30–35 predict 36). That's not a lot. Every sequence\ncounts.</p>\n<h2>Training details</h2>\n<pre><code class=\"language-python\">optimiser = Adam(lr=1e-4)\nloss = MSE  # on log1p-transformed values\nbatch_size = 4  # small because sequences are large\nepochs = 150 with early stopping (patience=15)\n</code></pre>\n<p>The <code>log1p</code> transformation from Part 3 is critical here. Raw crime counts range\nfrom 0 to 50+. After <code>log1p</code>, the range compresses to 0–4. Without this, the\nloss function would be dominated by the handful of high-count CBD cells, and the\nmodel would essentially ignore the rest of the grid.</p>\n<p>Training on CPU takes about 40 minutes per run. Not fast, but manageable. I\ncould typically fit in 3–4 experimental runs per evening, which meant progress\nwas slow but steady. Each run I'd tweak one thing (kernel size, hidden channels,\nlearning rate) and compare validation MAE.</p>\n<p>Early stopping triggers around epoch 80–100 in most runs. The model converges\nrelatively quickly, which makes sense given the small dataset and architecture.</p>\n<h2>Results</h2>\n<p>So how does ConvLSTM stack up against the baselines from Part 5?</p>\n<table>\n<thead>\n<tr>\n<th>Crime Type</th>\n<th>Hist. Avg MAE</th>\n<th>ConvLSTM MAE</th>\n<th>Improvement</th>\n</tr>\n</thead>\n<tbody>\n<tr>\n<td>Theft</td>\n<td>1.28</td>\n<td>1.14</td>\n<td>10.9%</td>\n</tr>\n<tr>\n<td>Burglary</td>\n<td>0.35</td>\n<td>0.32</td>\n<td>8.6%</td>\n</tr>\n<tr>\n<td>Assault</td>\n<td>0.20</td>\n<td>0.19</td>\n<td>5.0%</td>\n</tr>\n<tr>\n<td>Robbery</td>\n<td>0.04</td>\n<td>0.04</td>\n<td>2.5%</td>\n</tr>\n<tr>\n<td>Sexual</td>\n<td>0.03</td>\n<td>0.03</td>\n<td>~0%</td>\n</tr>\n<tr>\n<td>Harm</td>\n<td>0.01</td>\n<td>0.01</td>\n<td>~0%</td>\n</tr>\n<tr>\n<td><strong>All types</strong></td>\n<td><strong>0.39</strong></td>\n<td><strong>0.35</strong></td>\n<td><strong>10.3%</strong></td>\n</tr>\n</tbody>\n</table>\n<p>A 10% improvement on the aggregate MAE. Not earth-shattering, but real.</p>\n<p>Theft gets the biggest lift because there's the most signal to work with. The\nmodel genuinely learns spatial dynamics that the historical average can't\ncapture. When a cluster of cells in South Auckland trends upward over several\nmonths, ConvLSTM picks up on that momentum and adjusts its predictions\naccordingly.</p>\n<p>Burglary sees a decent improvement too, likely driven by the spatial correlation\nwith theft that we spotted in the EDA.</p>\n<p>For the sparse crime types (robbery, sexual offences, harm) ConvLSTM basically\nlearns to predict near-zero, same as the baseline. There simply isn't enough\nsignal at 500m monthly resolution for these types. The model is honest about\nwhat it doesn't know, which I actually respect.</p>\n<h2>Where it shines and where it doesn't</h2>\n<p>The improvement isn't uniform across the grid. ConvLSTM does best in the\ntransition zones: cells on the edges of established hotspots where crime counts\nfluctuate month to month. It learns that these boundary cells tend to follow the\ntrend of their neighbours, which is exactly the kind of spatial-temporal pattern\nit was designed to capture.</p>\n<p>In the stable hotspot cores (the CBD, Manukau) the model performs about the same\nas the baseline. Those cells are consistently high, and the historical average\nalready captures that well.</p>\n<p>Where it properly struggles is with sudden spikes in normally quiet areas. A\ncell that's been near-zero for months and then gets 5 thefts in one month: the\nmodel doesn't see that coming. Neither does any other model, to be fair. Those\nevents are closer to random noise than learnable signal.</p>\n<h2>Putting it in perspective</h2>\n<p>A 10% MAE improvement is meaningful but modest.\n<a href=\"https://arxiv.org/pdf/2502.07465v1\">Recent ConvLSTM crime prediction papers</a>\nreport larger gains, but they typically work with much more data: years of daily\nrecords across cities with higher crime density. Our setup is tougher. Monthly\nresolution limits temporal signal, Auckland is relatively low-crime by global\nstandards, and we only have four years.</p>\n<p>The model is also running on CPU with a deliberately small architecture. A\nbigger model on a GPU might squeeze out more performance. But the point of this\nproject was always to see how far you can push it with modest resources, and a\n10% beat over simple baselines feels like a real result.</p>\n<p>The question now is whether ST-ResNet's different approach to temporal modelling\ncan do better. ConvLSTM processes time as one continuous sequence. ST-ResNet\nbreaks it into three separate temporal scales: closeness, period, and trend.\nWith a seasonal dataset like crime, that decomposition might be exactly what's\nneeded.</p>\n","date_published":"Thu, 16 Apr 2026 00:00:00 GMT"},{"id":"https://jonnonz.com/posts/open-source-agent-that-teaches-claude-code-your-architecture/","url":"https://jonnonz.com/posts/open-source-agent-that-teaches-claude-code-your-architecture/","title":"Open-Source Agent That Teaches Claude Code Your Architecture","content_html":"<p>AI has made building software cheap. A solo founder with Claude Code or Cursor\ncan ship an MVP in a weekend that would've taken a small team a month two years\nago. I've watched this happen across the NZ startup scene. Ideas that used to\ndie in the &quot;can we afford to build it&quot; phase now get built over a long weekend.</p>\n<p>This is mostly great. Velocity is what startups need. Cost of testing an idea is\nnow close to zero, and the business prioritises speed.</p>\n<p>The catch shows up when the idea works.</p>\n<p>AI builds for <em>right now</em>. It optimises for the current prompt, the current\nfile, the current feature. It doesn't think about what happens when your billing\nservice needs to handle 10x the volume, or when your email notifications need to\nmove from inline calls to a queue. It doesn't plan for the evolutionary pressure\nyour system will face once it has users.</p>\n<p>That's the gap I've been thinking about, and it's what led me to build\n<a href=\"https://github.com/jonnonz1/domain-agents\">domain-agents</a>.</p>\n<h2>Give the tools their credit</h2>\n<p>I want to be fair to the current generation of AI coding assistants. They're not\nstupid about finding code.</p>\n<p>Claude Code runs an agentic search loop (grep, glob, file reads) iterating\nthrough your codebase to find what's relevant. Boris Cherny (who created Claude\nCode) <a href=\"https://x.com/bcherny/status/2017824286489383315\">has said</a> they tried\nRAG with a local vector database early on and dropped it because agentic search\noutperformed it. Cursor takes a different approach: it\n<a href=\"https://read.engineerscodex.com/p/how-cursor-indexes-codebases-fast\">chunks your codebase, generates embeddings</a>,\nand stores them for semantic search so you can find code by concept rather than\nkeyword. Copilot combines semantic indexing with LSP-powered reference tracing\nfrom VS Code.</p>\n<p>The search works. If you ask Claude Code to find your billing service, it'll\nfind it. Ask Cursor for authentication logic and the embeddings will surface it\neven if the code never uses the word &quot;authentication.&quot;</p>\n<p>None of them understand the architecture those files live in.</p>\n<p>All the information needed to understand domain relationships sits in the code:\nimport graphs, interface signatures, dependency patterns. These tools don't\nextract or structure it that way. They find files one at a time. They don't map\nout that your billing service depends on the email service, that\n<code>BillingService</code> is consumed by two other domains, or that changing its\ninterface is a cross-domain event. The information is in the codebase. Nobody's\npulling it together.</p>\n<p>And every session starts from zero. The AI learned your architecture yesterday\nand forgot it today.</p>\n<h2>Evolutionary architecture for the AI era</h2>\n<p>My thesis: cheap AI-built MVPs plus expensive scaling problems point toward\nevolutionary architecture with domain-based boundaries.</p>\n<p>The idea isn't new. The reason it matters now is.</p>\n<p>In an evolutionary architecture, you focus on clean interfaces between business\ndomains. Your email service exposes a contract like\n<code>sendEmail(to, subject, body)</code>, and the rest of the system calls that interface.\nBehind the interface, the implementation evolves through stages as your scaling\nneeds change:</p>\n<pre><code class=\"language-mermaid\">graph LR\n    A[&quot;Inline\\n(direct call)&quot;] --&gt; B[&quot;Async\\n(fire &amp; forget)&quot;]\n    B --&gt; C[&quot;Queued\\n(BullMQ/SQS)&quot;]\n    C --&gt; D[&quot;Separate Service&quot;]\n    D --&gt; E[&quot;Distributed&quot;]\n</code></pre>\n<p>Day one, <code>sendEmail</code> is a function that calls Resend directly. Inline,\nsynchronous, dead simple. When traffic picks up, you drop the <code>await</code> and let it\nrun in the background. Later, you introduce BullMQ or SQS. Eventually it becomes\nits own service. The interface stays put. Only the implementation behind it\nchanges.</p>\n<p>This is the kind of evolution AI coding assistants are terrible at planning for.\nThey'll inline that email call because it works <em>right now</em>. They have no\nconcept of where this domain sits on its scaling trajectory.</p>\n<h2>Where domain-agents fits in</h2>\n<p><a href=\"https://github.com/jonnonz1/domain-agents\">domain-agents</a> is a CLI tool that\nruns static analysis on TypeScript codebases, discovers business domains, and\ngenerates AI agent context files for Claude Code and Cursor.</p>\n<pre><code class=\"language-bash\">domain-agents discover .    # Analyse codebase → proposal.json\ndomain-agents init .        # Generate agents/*.md + AGENTS.md\ndomain-agents hooks claude  # Wire into Claude Code (rules + MCP server)\ndomain-agents hooks cursor  # Wire into Cursor (.mdc rules)\n</code></pre>\n<p>After setup, opening <code>src/billing/invoice.ts</code> in Claude Code loads the billing\ndomain agent into context. The AI now knows: billing depends on email (coupling\nscore 0.23), exposes <code>BillingService</code> consumed by 2 other domains, sits at the\n&quot;inline&quot; scaling stage with a path toward async queuing, and has 3 tracked tech\ndebt items.</p>\n<p>It plans work accordingly. The context was loaded before the first prompt, no\nsearch required.</p>\n<h2>Five signals, not one</h2>\n<p>The discovery engine runs 5 analysis passes because no single signal identifies\nbusiness domains on its own.</p>\n<p>Directory structure works for greenfield projects (<code>src/auth/</code>, <code>src/billing/</code>)\nbut fails for legacy MVC apps. Import graphs capture coupling but not business\nintent. Package dependencies hint at external integrations but miss internal\ndomains.</p>\n<pre><code class=\"language-mermaid\">graph TD\n    S[&quot;Structure Analysis&quot;] --&gt; O[&quot;Signal Orchestrator&quot;]\n    I[&quot;Import Graph\\n(TS Compiler API)&quot;] --&gt; O\n    N[&quot;Naming Patterns&quot;] --&gt; O\n    D[&quot;Dependency Mapping\\n(npm → domain hints)&quot;] --&gt; O\n    IF[&quot;Interface Detection&quot;] --&gt; O\n    O --&gt; M[&quot;Merge Pipeline&quot;]\n    M --&gt; R[&quot;Domain Proposal&quot;]\n</code></pre>\n<p><strong>Structure</strong> detects whether the codebase is feature-organised,\nlayer-organised, mixed, or flat. <strong>Import graph</strong> uses the TypeScript Compiler\nAPI to parse each <code>.ts</code> file, resolve imports, and build a directed edge graph.\nType-only imports get weighted at 0.3 because they're a weaker coupling signal\nthan value imports. <strong>Naming patterns</strong> extract domain prefixes:\n<code>auth.controller.ts</code> → &quot;auth&quot;. <strong>Dependency mapping</strong> maps npm packages to\ndomain hints (<code>stripe</code> → billing, <code>@sendgrid/mail</code> → email). <strong>Interface\ndetection</strong> identifies files imported across domain boundaries and calculates\ncoupling scores between domain pairs.</p>\n<p>Each pass produces weighted signals. The orchestrator combines them with\nconfidence scoring: average signal strength plus a bonus for signal count,\ncapped at 0.99. Layer-organised codebases get an 0.85 multiplier because they're\nharder to discover.</p>\n<h2>Most real codebases aren't clean</h2>\n<p>Feature-organised codebases are easy. The directory structure <em>is</em> the domain.\nBut most real codebases look like this:</p>\n<pre><code>src/\n  controllers/\n    auth.controller.ts\n    billing.controller.ts\n  services/\n    auth.service.ts\n    billing.service.ts\n  models/\n    invoice.model.ts\n    user.model.ts\n</code></pre>\n<p>Here <code>auth.controller.ts</code>, <code>auth.service.ts</code>, and <code>auth.routes.ts</code> all belong to\nthe &quot;auth&quot; domain despite living in three different directories. domain-agents\nuses naming pattern extraction cross-referenced with import graph cohesion to\ncluster these. The <code>auth.*</code> files form a tight import cluster, which confirms\nthe naming signal.</p>\n<h2>Merging is the hard bit</h2>\n<p>Raw signals produce too many small, overlapping clusters. The orchestrator runs\na multi-phase normalisation pipeline.</p>\n<p>Plurals merge: <code>journals</code> + <code>journal</code> → whichever has more files. Compound names\nconsolidate: <code>bank-balance</code> + <code>bank-statement</code> + <code>bank-transaction</code> →\n<code>bank-accounts</code> (the largest cluster). Small clusters merge into their strongest\nimport target, but only if they have a dominant dependency: more than 40% of\nimports from one target, and that target is at least 2x larger. This prevents\ncascading, where A merges into B, B gets bigger and attracts C, C pulls in D.</p>\n<p>Files that import from 3+ domains get moved to &quot;unassigned.&quot; These are coupling\nhotspots: middleware, orchestrators, shared handlers. Assigning them to one\ndomain would mislead the AI, so the tool surfaces them for a human decision.\nThat's the right call for architectural boundaries.</p>\n<p>The E2E test suite validates the complete pipeline against 3 fixture codebases\n(feature-organised, layer-organised, mixed). Current benchmark: 100% activation\naccuracy across all 3 patterns and all 3 activation levels (domain assignment,\nglob matching, MCP lookup).</p>\n<h2>Auto-activation, not search</h2>\n<p>The integration into Claude Code and Cursor uses glob-based rule activation, the\nnative mechanism both tools already support.</p>\n<p>Each domain gets a rule file with glob patterns in the frontmatter:</p>\n<pre><code class=\"language-yaml\">---\ndescription: billing domain\nglobs:\n  - src/billing/**\n  - **/billing.*\n  - **/billing-*\n---\n</code></pre>\n<p>When Claude Code opens a file matching those globs, the domain context loads. No\nMCP call, no background process, zero runtime overhead.</p>\n<p>An <a href=\"https://modelcontextprotocol.io/\">MCP server</a> complements the rules with 4\non-demand tools: <code>domain_lookup(file)</code>, <code>domain_context(name)</code>,\n<code>domain_files(name)</code>, and <code>list_domains()</code>. A SessionStart hook prints a domain\nsummary at the start of every Claude Code session, so the AI has system-level\nawareness from the first prompt.</p>\n<h2>Agents as a team model</h2>\n<p>This is the bit I'm most keen on long-term.</p>\n<p>At Vend and Xero, teams owned domains. The billing team owned billing, the\nintegrations team owned integrations. Ownership meant knowing the interfaces,\nthe coupling points, the tech debt, and where things were headed. That knowledge\nlived in people's heads and got passed on through code reviews, architecture\nchats, and tribal memory.</p>\n<p>Domain-specific AI agents formalise that same ownership model. An email agent\nloads the email domain's interface contract, its coupling to other domains, its\ncurrent scaling stage, and its tracked tech debt. A billing agent carries the\nsame for billing. They work within their boundaries and flag when a change\ncrosses a domain line.</p>\n<p>You don't need this from day one. Early on, one agent covers multiple areas. As\nthe product grows, agents split along the same lines engineering teams split: by\nbusiness domain. The operator (that's you) resolves conflicts where agents\ndisagree, the same way an engineering manager resolves cross-team dependencies.</p>\n<p>The analogy is rough, but it captures how AI-assisted development scales past a\nsingle person staring at a single context window.</p>\n","date_published":"Wed, 15 Apr 2026 00:00:00 GMT"},{"id":"https://jonnonz.com/posts/claude-code-can-now-spawn-copies-of-itself-in-isolated-vms/","url":"https://jonnonz.com/posts/claude-code-can-now-spawn-copies-of-itself-in-isolated-vms/","title":"Claude Code Can Now Spawn Copies of Itself in Isolated VMs","content_html":"<p>The moment this project went from &quot;fun weekend hack&quot; to something I actually use\nevery day was when I got the MCP server working. Claude Code on my laptop sends\na prompt to the orchestrator sitting under my desk, which boots a VM, runs\nClaude Code inside it with full permissions, and streams the results back.\nClaude delegating work to Claude.</p>\n<p>It's a weird feeling watching it happen. You're in a conversation with Claude,\nit decides a task needs isolation, calls the MCP tool, and a few seconds later\nyou can see a fresh VM spinning up in the dashboard. Like having an intern who\ncan clone themselves.</p>\n<p><a href=\"https://jonnonz.com/posts/claude-code-running-claude-code-in-4-second-disposable-vms/\">Part 1</a>\ncovered why I built this.\n<a href=\"https://jonnonz.com/posts/29-hours-debugging-iptables-to-boot-vms-in-4-seconds/\">Part 2</a> was the\nguts of it — rootfs, networking, the guest agent. This last post is about the\ninterfaces, the streaming pipeline, and what I'd change if this needed to work\nfor more than just me.</p>\n<h2>The MCP server</h2>\n<p>The orchestrator exposes an <a href=\"https://modelcontextprotocol.io/\">MCP</a> server with\neight tools. The main one is <code>run_task</code> — give it a prompt, optional config\n(RAM, vCPUs, timeout, max turns), and it blocks until the task completes.\nReturns the task ID, status, exit code, result files, cost, and the output\ntruncated to 4000 characters.</p>\n<p>Two transport modes. Stdio for when Claude Code runs on the same machine:</p>\n<pre><code class=\"language-json\">{\n  &quot;mcpServers&quot;: {\n    &quot;orchestrator&quot;: {\n      &quot;command&quot;: &quot;sudo&quot;,\n      &quot;args&quot;: [&quot;/opt/firecracker/bin/orchestrator&quot;, &quot;mcp&quot;]\n    }\n  }\n}\n</code></pre>\n<p>And Streamable HTTP for network access — Claude Code on any machine on the LAN\ncan use it:</p>\n<pre><code class=\"language-json\">{\n  &quot;mcpServers&quot;: {\n    &quot;orchestrator&quot;: {\n      &quot;type&quot;: &quot;http&quot;,\n      &quot;url&quot;: &quot;http://192.168.50.44:8081/mcp&quot;\n    }\n  }\n}\n</code></pre>\n<p>The other tools are for poking around: <code>get_task_status</code>, <code>list_vms</code>,\n<code>exec_in_vm</code> (run a command in a still-running VM), <code>read_vm_file</code>,\n<code>destroy_vm</code>, <code>list_task_files</code>, and <code>get_task_file</code>. That last one is smart\nabout content types — text files come back as plain text, images come back as\nbase64 MCP image content so Claude can actually see screenshots the VM took.</p>\n<pre><code class=\"language-go\">if isImageMime(mimeType) {\n    encoded := base64.StdEncoding.EncodeToString(data)\n    return mcplib.NewToolResultImage(&quot;Screenshot from task &quot;+taskID, encoded, mimeType), nil\n}\n</code></pre>\n<h2>The migration that broke everything</h2>\n<p>This bit is worth telling because it'll save someone else the debugging time.</p>\n<p>I originally built the MCP server with\n<a href=\"https://github.com/mark3labs/mcp-go\">mcp-go</a> v0.45.0 using SSE (Server-Sent\nEvents) transport. Worked great. Then Claude Code updated to expect the newer\nStreamable HTTP transport, and everything fell over.</p>\n<p>The failure mode was confusing. Claude Code would try to connect, attempt OAuth\ndiscovery against the <code>/sse</code> endpoint, get a 404 (my server doesn't do OAuth),\nand fail with:</p>\n<pre><code>Error: HTTP 404: Invalid OAuth error response: SyntaxError: JSON Parse error: Unable to parse JSON string\n</code></pre>\n<p>Nothing in my code changed. The client just started speaking a different\nprotocol.</p>\n<p>The fix was small once I understood it:</p>\n<pre><code class=\"language-go\">// Before — SSE transport\nfunc (s *Server) ServeSSE(addr string) error {\n    sseServer := server.NewSSEServer(s.mcpServer,\n        server.WithBaseURL(&quot;http://&quot;+addr),\n    )\n    return sseServer.Start(addr)\n}\n\n// After — Streamable HTTP transport\nfunc (s *Server) ServeHTTP(addr string) error {\n    httpServer := server.NewStreamableHTTPServer(s.mcpServer,\n        server.WithEndpointPath(&quot;/mcp&quot;),\n        server.WithStateLess(true),\n    )\n    return httpServer.Start(addr)\n}\n</code></pre>\n<p>Bumped mcp-go from v0.45.0 to v0.46.0, swapped the server constructor, changed\nthe endpoint from <code>/sse</code> to <code>/mcp</code>, updated the client config. Done. But\ndiagnosing &quot;OAuth error on a server that doesn't do OAuth&quot; — that bit took a\nwhile.</p>\n<h2>Output streaming</h2>\n<p>When Claude Code runs inside a VM, its output needs to get from stdout inside\nthe guest all the way to a browser tab on my laptop. The path:</p>\n<pre><code class=\"language-mermaid\">flowchart LR\n    A[&quot;Claude Code stdout&quot;] --&gt; B[&quot;Guest agent\\nvsock frame&quot;]\n    B --&gt; C[&quot;Host vsock client\\nExecStream&quot;]\n    C --&gt; D[&quot;Task runner\\nOnEvent callback&quot;]\n    D --&gt; E[&quot;Stream Hub\\nring buffer + fan-out&quot;]\n    E --&gt; F[&quot;WebSocket\\nto browser&quot;]\n</code></pre>\n<p>The stream hub (<code>internal/stream/hub.go</code>) is a per-task pub/sub system. Each\ntask gets a stream with a 1000-event ring buffer. When a WebSocket client\nconnects, it gets all the buffered history first, then live events as they\narrive.</p>\n<p>Fan-out is non-blocking:</p>\n<pre><code class=\"language-go\">for ch := range s.subscribers {\n    select {\n    case ch &lt;- event:\n    default:\n        // Subscriber is slow, drop the event\n    }\n}\n</code></pre>\n<p>A slow WebSocket client can't block the task runner. If the browser can't keep\nup, it misses events. In practice this never happens because the bottleneck is\nalways Claude thinking, not the network.</p>\n<h2>The web dashboard</h2>\n<p>The React frontend is compiled to static files and embedded into the Go binary:</p>\n<pre><code class=\"language-go\">//go:embed all:web-dist\nvar webDistEmbed embed.FS\n</code></pre>\n<p>Single binary deployment. No nginx, no separate frontend server, no CORS\nheadaches in production. The API server falls through to <code>index.html</code> for\nunknown paths, which gives you SPA client-side routing.</p>\n<p>The most interesting page is the task detail view. Claude Code's\n<code>--output-format stream-json</code> spits out one JSON object per line — thinking\nblocks, text responses, tool calls, tool results, cost summaries. The dashboard\nparses these into coloured blocks:</p>\n<ul>\n<li>Purple for thinking (Claude's internal reasoning)</li>\n<li>Blue for text responses</li>\n<li>Orange for tool calls (shows the tool name and input)</li>\n<li>Grey for tool results (truncated to 2000 chars — some of these are enormous)</li>\n<li>Green for the final result with cost</li>\n</ul>\n<p>A <code>useWebSocket</code> hook connects when the task is running and disconnects when\nit's done. Green pulsing dot for live streaming. Auto-scroll to the bottom as\nevents arrive. Image files in the results get inline previews pointing at the\nAPI's file download endpoint — so when Claude takes a screenshot inside the VM,\nyou see it immediately.</p>\n<p>Dark theme. Orange accents. Obviously.</p>\n<h2>What productionising looks like</h2>\n<p>This runs on one box with no auth. It's a home lab project. But the gap between\n&quot;works for me&quot; and &quot;works for a small team&quot; isn't as big as it looks.</p>\n<p><strong>Persistence</strong> is the most obvious one. The task store is an in-memory Go map.\nOrchestrator restarts? All task history gone. VM metadata already persists to\ndisk and gets recovered on startup — tasks should too. SQLite or bbolt, a few\nhours of work. I just haven't needed it because I don't restart the process very\noften.</p>\n<p><strong>Task queue with backpressure.</strong> Right now tasks fire as goroutines with no\nconcurrency limit. Submit 20 tasks on a 30GB machine where each VM wants 2GB and\nthe last few fail because there's no memory left. A buffered channel or\nsemaphore would fix this. You could get fancier with priority queues — quick\ncode generation tasks ahead of long research tasks — but even a simple\nconcurrency cap would be enough.</p>\n<p><strong>Authentication.</strong> The REST API and MCP server accept requests from anyone who\ncan reach the port. For a team: API keys at minimum, mTLS if you're serious\nabout it. The MCP spec supports auth flows now — that'd be the right way to do\nit for the MCP endpoint.</p>\n<p><strong>The OnEvent callback race.</strong> This one's a latent bug. The task runner's\n<code>OnEvent</code> callback is stored on the runner struct, not passed per-task:</p>\n<pre><code class=\"language-go\">s.taskRunner.OnEvent = func(id string, event agent.StreamEvent) {\n    taskStream.Publish(event)\n}\ns.taskRunner.Run(context.Background(), t)\n</code></pre>\n<p>Two simultaneous tasks overwrite each other's callbacks. It works today because\nMCP tasks block (one at a time) and the API handler sets up the stream before\nthe goroutine runs. But it's the kind of thing that works until it doesn't. Fix\nis trivial — pass the callback into <code>Run()</code> as a parameter.</p>\n<p><strong>Graceful shutdown.</strong> There's no signal handler. Ctrl-C kills the process,\nrunning VMs become orphans. They keep running as Firecracker processes — the\n<code>recoverState()</code> function on next startup finds them and starts tracking them\nagain — but their tasks are lost. A proper signal handler would stop accepting\nnew tasks, wait for running ones to finish with a timeout, then tear everything\ndown cleanly.</p>\n<p><strong>For real multi-user</strong> you'd want result storage on S3 or R2 instead of local\ndisk. A web auth layer. Per-user credential vaults so different people's Claude\ntokens don't mix. Usage tracking and cost attribution.</p>\n<p><strong>What I wouldn't change:</strong> the single-binary deployment, vsock for host-guest\ncommunication, ephemeral VMs as the isolation model, the embedded frontend.\nThose are the right calls regardless of scale. The architecture is sound — it's\nthe operational bits around it that need work.</p>\n<p>Most of these are a weekend each. The project is about 3,200 lines of Go and 860\nof TypeScript. It's not a big codebase. Adding persistence, auth, and a task\nqueue would maybe take it to 4,500 lines. Still fits in your head.</p>\n<p>For now, it sits under my desk and boots VMs when I ask it to. Claude delegating\nto Claude, in complete isolation, on hardware I own. That's enough.</p>\n","date_published":"Mon, 13 Apr 2026 00:00:00 GMT"},{"id":"https://jonnonz.com/posts/openhealth-chat-with-apple-health-data/","url":"https://jonnonz.com/posts/openhealth-chat-with-apple-health-data/","title":"OpenHealth – Chat with Apple Health Data, Anywhere","content_html":"<p>For years I've worn an Apple Watch and let my iPhone quietly hoover up my\nresting heart rate, HRV, sleep stages, every workout, every nutrition log.\nMillions of data points. And for most of that time, when I wanted to actually\n<em>ask</em> something about my training — &quot;am I cooked this week?&quot;, &quot;has my recovery\ngotten worse since Christmas?&quot; — I'd open ChatGPT and get an answer that was\nbasically vibes, because it couldn't see any of the data.</p>\n<p>So I built <a href=\"https://github.com/jonnonz1/openhealth\">openhealth</a>. It turns your\nApple Health export into seven short markdown files any LLM can read. Drop the\nzip in your browser at\n<a href=\"https://openhealth-axd.pages.dev/\">openhealth-axd.pages.dev</a>, run the CLI, or\nbeam the zip straight from your iPhone over WebRTC. Paste the output into Claude\nor ChatGPT and start asking the questions you actually wanted to ask.</p>\n<p><img src=\"https://jonnonz.com/img/posts/openhealth/hero.png\" alt=\"openhealth's web app — drop the zip, get seven markdown files, nothing uploaded\"></p>\n<h2>What's US-only and why that's annoying</h2>\n<p>In January, Anthropic\n<a href=\"https://www.macrumors.com/2026/01/22/claude-ai-adds-apple-health-connectivity/\">shipped an Apple Health connector</a>\nfor Claude. OpenAI has one in ChatGPT. Both are US-only — if you're in New\nZealand like me, or the UK, EU, or Switzerland,\n<a href=\"https://context-link.ai/blog/chatgpt-connectors\">they're not available</a>. That's\na lot of people locked out of the most natural way to use this data.</p>\n<p>And even if you are in the US, you're letting Anthropic or OpenAI decide what\nthe model reads, how it's framed, and what tier unlocks it. I wanted control\nover the whole pipeline — including which LLM I feed it into.</p>\n<h2>What I built</h2>\n<p>openhealth ships three ways.</p>\n<p><strong>A static web app.</strong> Drop <code>export.zip</code>, wait five seconds, download seven\nfiles. The browser does the parse. There's no upload endpoint because there's no\nserver — the Cloudflare Pages site is static HTML plus a tiny Web Worker. Open\nDevTools, watch the Network panel, nothing goes out.</p>\n<p><strong>A Bun-compiled CLI.</strong> <code>openhealth ~/export.zip -o ./output</code> gets you seven\nmarkdown files. <code>--bundle</code> concatenates them into one. <code>--clipboard</code> pushes that\nbundle straight to your system clipboard so you can paste it into any chat\nwindow. Zero deps beyond <code>saxes</code> for XML and <code>fflate</code> for unzip — even the\nargument parsing is <code>node:util parseArgs</code>, not Commander. One binary, put it\nwherever.</p>\n<p><strong>A phone-to-desktop handoff over WebRTC.</strong> The desktop site renders a QR code.\nPoint your iPhone camera at it, Safari opens a tiny receiver page, pick the zip,\nand it streams directly to your desktop browser over a DataChannel. The only\nbackend in the whole stack is a ~100-line Cloudflare Worker that relays the\nWebRTC handshake — it never sees a byte of your health data.</p>\n<p><img src=\"https://jonnonz.com/img/posts/openhealth/walkthrough.png\" alt=\"Getting the export off your iPhone — six taps, or scan the desktop QR\"></p>\n<h2>How the parse actually works</h2>\n<p>Apple's <code>export.xml</code> is\n<a href=\"https://www.tdda.info/in-defence-of-xml-exporting-and-analysing-apple-health-data\">properly huge</a>.\nA long-term Watch user can easily have a 500MB–4GB file with millions of rows.\nMost XML parsers build a tree in memory, which OOMs before they finish.</p>\n<p>openhealth uses <a href=\"https://github.com/lddubeau/saxes\">saxes</a> — a streaming SAX\nparser in pure TypeScript. It's isomorphic, so the same parser runs in Bun,\nNode, and the browser. I tested it against a synthetic 169MB / 1 million-record\nexport and it finished in about 5 seconds in Chrome, with the main-thread heap\nstaying around 5MB because the parse runs in a Web Worker.</p>\n<p>The rest of the core is a small pipeline: stream XML, accumulate\nper-record-type, roll up into weekly and monthly summaries, run each through a\nwriter that produces one markdown file. Every writer is snapshot-tested against\nbyte-for-byte expected output. 85 tests, TDD throughout.</p>\n<h2>What the seven files are</h2>\n<p>Each one is deliberately small and shaped to be LLM-readable:</p>\n<ul>\n<li><code>health_profile.md</code> — baselines, data sources, long-term averages</li>\n<li><code>weekly_summary.md</code> — current week plus a 4-week rolling comparison with\nweek-over-week deltas</li>\n<li><code>workouts.md</code> — detailed log for the last 4 weeks: HR, duration, distance,\nenergy</li>\n<li><code>body_composition.md</code> — weight trend, recent readings, nutrition averages</li>\n<li><code>sleep_recovery.md</code> — nightly stages, 8-week averages, HRV, resting HR, SpO2\ntrends</li>\n<li><code>cardio_fitness.md</code> — running log, HR-zone distribution, walking-speed trends</li>\n<li><code>prompt.md</code> — a ready-to-paste system prompt that frames the other six as\ncoaching input</li>\n</ul>\n<p>Drop one file or all seven, depending on which chat model you're using.</p>\n<h2>What it's actually good at</h2>\n<p>Feeding real data to an LLM is a different experience from answering its\nquestions. When Claude can see that my resting HR has crept up 4bpm over the\nlast fortnight while my HRV has dropped and my training load stayed the same, it\ngives a real answer — &quot;you're likely undercooked on recovery this week, here's\nwhat I'd change&quot; — rather than a generic reminder to drink water.</p>\n<p>It's especially good if you've got multiple devices in the mix. I've got data\nfrom Apple Watch, the iPhone step counter, a Withings scale, and MyFitnessPal.\nThe parser picks the highest-trust source per metric — Apple Watch wins over\niPhone for steps, Watch sleep beats AutoSleep which beats Withings,\nduplicate-weight entries on the same day get deduped. You feed in one zip and\nget one coherent picture.</p>\n<p>Ask it about your recovery, your training load, what you might be doing wrong,\nhow your sleep correlates with your long runs. It'll tell you — and it'll be\nright more often than not.</p>\n<h2>If privacy matters, go all the way</h2>\n<p>openhealth itself never uploads your data. The web app parses in your browser\ntab. The CLI runs locally. The WebRTC handoff stays peer-to-peer — the\nCloudflare Worker that relays the handshake never sees a byte of the file. Clone\nthe repo, diff the build output, and confirm it yourself.</p>\n<p>When you paste the seven files into ChatGPT or Claude, <em>they</em> see the data.\nThat's the trade most people will take for convenience, and it's fine. But if\nyou don't want to make that trade, you don't have to — run the CLI and pipe the\nbundle into a local model:</p>\n<pre><code class=\"language-bash\">openhealth ~/export.zip --bundle -o ./out\nollama run llama3 &lt; ./out/openhealth.md\n</code></pre>\n<p>Ollama, llama.cpp, LM Studio, whatever you run. Your health data never leaves\nyour laptop. The output is just markdown — it doesn't care what reads it.</p>\n<p>That's why the shape is seven files and not an API. You pick what sees them.</p>\n<p>I'm not a doctor. Neither is the model. Use this for thinking out loud about\nyour own training, not diagnosing anything.</p>\n<p>MIT, source at\n<a href=\"https://github.com/jonnonz1/openhealth\">github.com/jonnonz1/openhealth</a>. Web\napp at <a href=\"https://openhealth-axd.pages.dev/\">openhealth-axd.pages.dev</a>. If you've\nbeen sitting on a 200MB <code>export.zip</code> with nothing that'll open it, have a go.</p>\n","date_published":"Mon, 13 Apr 2026 00:00:00 GMT"},{"id":"https://jonnonz.com/posts/llm-kills-compromised-services-at-3am/","url":"https://jonnonz.com/posts/llm-kills-compromised-services-at-3am/","title":"The Future of Security Is an Open-Source Model That Detects and Acts on Threats","content_html":"<p>Anthropic just dropped <a href=\"https://www.anthropic.com/glasswing\">Project Glasswing</a>\n— a big collaborative cybersecurity initiative with a shiny new model called\nClaude Mythos Preview that can find zero-day vulnerabilities at scale. Twelve\nmajor tech companies involved. $100M in credits. Found a 27-year-old flaw in\nOpenBSD. Impressive stuff.</p>\n<p>But let's be real about what's happening here. Anthropic trained a model so\ncapable at breaking into systems that they decided it was too dangerous to\nrelease publicly. So they wrapped the release in a collaborative security\ninitiative. The security work is genuinely valuable. But it's also a smart way\nto keep control of something they know is too powerful to let loose.</p>\n<p>The part that actually matters, though, is who benefits. Glasswing is for the\nbig players. The companies with security teams, budgets, and the kind of\ninfrastructure that gets invited to sit at the table with AWS, Microsoft, and\nPalo Alto Networks. What about the rest of us? The startups, the small SaaS\nshops, the indie developers running production systems on a shoestring?</p>\n<p>The internet is a\n<a href=\"https://bigthink.com/books/how-the-dark-forest-theory-helps-us-understand-the-internet/\">dark forest</a>.\nThat's not a metaphor anymore — it's becoming the literal reality. Bots,\nscrapers, automated exploit chains, credential stuffing, AI-generated phishing.\nA server goes up and within hours it's being scanned, fingerprinted, and probed\nby systems that don't sleep. Visibility equals vulnerability. And AI is making\nthe attackers faster, cheaper, and more autonomous every month.</p>\n<p>The\n<a href=\"https://www.isc2.org/insights/2026/04/ai-driven-defense-and-autonomous-attacks\">ISC2 put it plainly</a>\n— both offence and defence now operate at speeds beyond human intervention. The\nthreats aren't people sitting at keyboards anymore. They're autonomous systems\nrunning campaigns end-to-end.</p>\n<p>So what do we do about it?</p>\n<h2>Offensive security — but not the kind you're thinking</h2>\n<p>When I say offensive security, I don't mean red-teaming or penetration testing.\nI mean giving your systems the ability to fight back.</p>\n<p>Picture an LLM that sits across your centralised logs — network traffic,\ndatabase queries, user interactions, access patterns — and builds an\nunderstanding of what normal looks like for your system over weeks and months.\nNot just pattern matching against known signatures. Actually understanding the\nshape of healthy behaviour.</p>\n<p>When something breaks the pattern, it doesn't just alert. It acts.</p>\n<p>Disable a compromised account. Kill a service that's behaving strangely. Block a\ndatabase connection that shouldn't exist. Create an incident with full context\nfor a human to review. The response is proportional and immediate — not waiting\nfor someone to check their phone at 3am.</p>\n<p>The architecture is pretty straightforward:</p>\n<pre><code class=\"language-mermaid\">graph TD\n    A[Application Logs] --&gt; D[Secure Isolated Log Store]\n    B[Network Traffic] --&gt; D\n    C[Database Queries] --&gt; D\n    D --&gt; F[Baseline Health Model]\n    E[User Activity] --&gt; D\n    F --&gt;|Anomaly Detected| G[LLM Analysis]\n    G --&gt;|Analyse &amp; Plan| H{Threat Assessment}\n    H --&gt;|Low| I[Alert &amp; Log]\n    H --&gt;|Medium| J[Restrict &amp; Escalate]\n    H --&gt;|High| K[Disable &amp; Isolate]\n    I --&gt; L[Human Review]\n    J --&gt; L\n    K --&gt; L\n</code></pre>\n<p>The key is that the logging and analysis layer has to be isolated and secured\nseparately from the systems it's watching. If an attacker can compromise the\nthing that's watching them, the whole model falls apart.</p>\n<p>In practice that means separate infrastructure with its own auth boundary.\nIngestion is write-only — your application services push logs in but can never\nread or modify what's already there. Append-only, immutable. The analysis layer\ngets scoped service accounts that can read logs, fire alerts, and pull specific\nemergency levers through a narrow API. Nothing else. If a compromised service\ntries to reach the log store directly, it hits a wall.</p>\n<p>None of this is exotic. Centralised logging, immutable storage, scoped IAM — the\nbuilding blocks exist. The hard part is wiring an LLM into that loop with the\nright constraints. Enough access to act, not enough to make things worse.</p>\n<h2>Adaptive, not rule-based</h2>\n<p>Traditional security tooling runs on signatures and static rules. Known bad\npatterns, blocklists, threshold alerts. That worked when threats were mostly\nhuman-paced. It doesn't work when you're up against autonomous systems that\nadapt faster than you can write rules.</p>\n<p>The alternative is a system that learns what normal looks like for <em>your</em>\nenvironment — not a generic baseline, but the actual shape of healthy behaviour\nin your specific infrastructure. Traffic patterns, query frequencies, access\ntiming, user behaviour. Weeks of observation before it starts making decisions.</p>\n<p>When something breaks the pattern, the response is proportional. A sudden spike\nin unusual API calls might trigger deeper correlation — the system widens its\nsearch, pulls in more signals, lowers its threshold for flagging related\nactivity. Repeated failed auth attempts from new IPs tighten access controls\nautomatically. A database connection that shouldn't exist gets killed.</p>\n<p>This isn't a static ruleset you configure once and hope covers everything. It's\na system that develops behavioural intuition from running in your environment,\nresponding to your traffic. The difference matters — static rules are brittle\nagainst novel attacks, while adaptive systems can catch anomalies they've never\nseen before.</p>\n<p>The baseline isn't magic. It's watching five things:</p>\n<ul>\n<li><strong>Rate</strong> — how many events per time window. A user who averages 50 API calls\nper hour suddenly making 500 is a signal.</li>\n<li><strong>Composition</strong> — what's in those events. The same user always hitting\n/api/users and /api/orders suddenly hammering /api/admin/export.</li>\n<li><strong>Cardinality</strong> — how many unique values. One IP hitting 3 endpoints is\nnormal. One IP cycling through 200 endpoints in an hour isn't.</li>\n<li><strong>Latency</strong> — how fast things happen. Legitimate users pause, think, navigate.\nBots don't.</li>\n<li><strong>Novelty</strong> — things the system has never seen. A new endpoint, a new\nparameter, a user agent string that doesn't match anything in the training\nwindow.</li>\n</ul>\n<p>Three layers of detection stack on top of each other. Layer one is simple\nthresholds — hard caps that trigger immediately. Layer two is statistical\ndeviation — standard deviations from the learned baseline. Layer three is\ncorrelation — looking across multiple signals simultaneously. A spike in rate\nalone might be fine. A spike in rate plus unusual composition plus new source\nIP? That's a pattern.</p>\n<h2>Learning to recognise yourself</h2>\n<p>A pure anomaly detector would go nuts during deploys. New code paths, changed\nresponse times, config reloads — all of it looks unusual. Same with cron jobs.\nYour 3am batch job that hits the database hard every night would trigger alerts\nevery night.</p>\n<p>Tolerance patterns solve this. The system learns to recognise you.</p>\n<p>Mark a deploy event, and the system creates a tolerance window — elevated\nthresholds for the next 30 minutes. Register a recurring cron job, and the\nsystem expects that exact spike at that exact time. These aren't exceptions you\nconfigure manually. They're patterns the system learns from watching.</p>\n<p>After a few weeks, it knows when your weekly cache warm-up runs, when your daily\nreports generate, when deploys happen. It stops bothering you about the things\nyou do on purpose.</p>\n<h2>The system gets cheaper over time</h2>\n<p>Calling an LLM for every anomaly would be expensive. The trick is building\nimmune memory.</p>\n<p>When the LLM analyses an anomaly and decides it's benign — say, a deploy spike\nor a legitimate traffic surge — that verdict gets stored. Next time the same\npattern appears, the system recognises it. No LLM call needed.</p>\n<p>This is how your security bill drops over the first few weeks. Early on,\neverything is novel. The LLM gets called constantly. A month in, most anomalies\nmatch patterns it's already seen. The LLM only gets called for genuinely new\nsituations.</p>\n<p>The more your system runs, the smarter it gets and the less it costs.</p>\n<h2>Setup without a PhD</h2>\n<p>The hardest part of any security tool is configuration. Getting thresholds\nright. Understanding your traffic patterns before you can tell the tool what's\nnormal.</p>\n<p><code>darkforest init</code> flips this. Point it at a log sample — a day's worth of\ntraffic, a week if you've got it — and Claude reads it. Not just parsing,\nactually understanding the shape of your system. It figures out what your\nendpoints are, what normal request rates look like, what user agents show up,\nwhere your traffic comes from geographically.</p>\n<p>Then it writes your config file for you.</p>\n<p>You review it, tweak anything that looks wrong, and you're running. No\nspreadsheets. No guesswork about what &quot;normal&quot; means for your specific stack.\nThe LLM that's going to watch your logs already understands them.</p>\n<h2>This has to be open</h2>\n<p>Glasswing is cool.\n<a href=\"https://github.com/aliasrobotics/CAI\">Open-source frameworks like CAI</a> are\nmaking progress — but mostly on the offensive side, using LLMs for penetration\ntesting and vulnerability research. On the defensive side, the tooling barely\nexists. There's no open-source equivalent for the kind of adaptive monitoring\nand response I'm describing here.</p>\n<p>The building blocks are around. Centralised logging is a solved problem. Open\nstandards for security event formats are maturing. Smaller open models are more\nthan capable of pattern analysis on local infrastructure. What's missing is the\nglue — a framework that takes logs in, builds a baseline, detects anomalies, and\ncan actually respond. Something a small team can deploy without a six-figure\nsecurity budget.</p>\n<p>The threats don't discriminate by company size. The defences shouldn't either.\nThis can't be proprietary or locked behind enterprise contracts.</p>\n<p>The dark forest doesn't care how big your company is. The bots scanning your\ninfrastructure don't check your headcount before they attack. If the threats are\ngoing to be this accessible, the defences need to be too.</p>\n<p>I'm building this. An open-source security agent — adaptive, autonomous, acts\nwhen something breaks the pattern. Small enough for a startup to run on their\nown infrastructure. Centralised logging, open LLMs, scoped response actions. The\npieces are all there. I'm wiring them together now.</p>\n<p>For v0.1, one real action working end-to-end: detect anomalous authentication\npatterns, call the LLM for analysis, and disable the compromised account via\nyour identity provider's API. Not just alerting — actually responding while\nyou're asleep. That's the proof of concept that matches the headline.</p>\n<p>I'm actively working on this and looking for early testers. If you want alpha\naccess when it's ready, or just want to follow along,\n<a href=\"https://jonnonz.com/posts/llm-kills-compromised-services-at-3am/#newsletter\">drop your email below</a>. I'll reach out when there's something to\ntry.</p>\n","date_published":"Sat, 11 Apr 2026 00:00:00 GMT"},{"id":"https://jonnonz.com/posts/29-hours-debugging-iptables-to-boot-vms-in-4-seconds/","url":"https://jonnonz.com/posts/29-hours-debugging-iptables-to-boot-vms-in-4-seconds/","title":"I Spent 29 Hours Debugging iptables to Boot VMs in 4 Seconds","content_html":"<p>The first time I got a Firecracker VM to boot and respond to a vsock ping from\nthe host, I sat there grinning like an idiot. Typed a command on my machine, it\nreached through a kernel-level socket into a completely separate Linux system\nwith its own kernel, and got a reply. Under a second.</p>\n<p>That was about 30 hours into the project. The previous 29 were mostly fighting\nwith rootfs images and iptables rules.</p>\n<p><a href=\"https://jonnonz.com/posts/claude-code-running-claude-code-in-4-second-disposable-vms/\">Part 1</a>\ncovered why I built this — Firecracker MicroVMs for running Claude Code in\nfull-permission isolation. This post is the actual build. Rootfs, networking,\nthe guest agent, and the streaming pipeline.</p>\n<h2>Building the rootfs</h2>\n<p>A Firecracker VM needs two things: an uncompressed Linux kernel (<code>vmlinux</code>, not\n<code>bzImage</code> — there's no bootloader) and an ext4 filesystem image to use as the\nroot disk.</p>\n<p>The kernel is straightforward — grab a prebuilt 6.1 LTS vmlinux. The rootfs took\nmore work.</p>\n<p>It's a standard ext4 image with Debian Bookworm, and it needs everything Claude\nCode might want: Node.js 24, Python 3.11, Chromium for browser automation, git,\ncurl, jq, and the full Claude Code CLI installed globally via npm. The image\nends up at about 4GB.</p>\n<p>The guest agent — the Go binary that listens for commands from the host — lives\ninside the rootfs as a systemd service:</p>\n<pre><code class=\"language-bash\">sudo mount /opt/firecracker/rootfs/base-rootfs.ext4 /mnt\nsudo cp bin/agent /mnt/usr/local/bin/agent\nsudo chmod +x /mnt/usr/local/bin/agent\n\nsudo tee /mnt/etc/systemd/system/agent.service &lt;&lt;'EOF'\n[Unit]\nDescription=Orchestrator Guest Agent\nAfter=network.target\n\n[Service]\nType=simple\nExecStart=/usr/local/bin/agent\nRestart=always\nRestartSec=1\n\n[Install]\nWantedBy=multi-user.target\nEOF\n\nsudo chroot /mnt systemctl enable agent.service\nsudo umount /mnt\n</code></pre>\n<p>That <code>RestartSec=1</code> matters. If the agent crashes for any reason, systemd has it\nback up in a second. The orchestrator polls vsock every 500ms waiting for the\nagent, so even a crash during boot is barely noticeable.</p>\n<p>You build this rootfs once, by hand. Every new VM gets a sparse copy of it.</p>\n<h2>VM lifecycle</h2>\n<p><code>internal/vm/manager.go</code> handles the whole lifecycle. It's sequential with\ncleanup at each step — if anything fails, it tears down what it already set up\nand returns the error.</p>\n<pre><code class=\"language-mermaid\">flowchart TD\n    A[&quot;Copy rootfs (sparse)&quot;] --&gt; B[&quot;Mount &amp; inject network config&quot;]\n    B --&gt; C[&quot;Create TAP device&quot;]\n    C --&gt; D[&quot;Add iptables rules&quot;]\n    D --&gt; E[&quot;Setup jailer chroot&quot;]\n    E --&gt; F[&quot;Write Firecracker config JSON&quot;]\n    F --&gt; G[&quot;Launch via jailer --daemonize&quot;]\n    G --&gt; H[&quot;Find PID, save metadata&quot;]\n    H --&gt; I[&quot;VM ready — poll vsock&quot;]\n</code></pre>\n<p>The sparse copy is the first thing that happens:</p>\n<pre><code class=\"language-go\">cmd := exec.Command(&quot;cp&quot;, &quot;--sparse=always&quot;, BaseRootfs, vm.RootfsPath)\n</code></pre>\n<p><code>--sparse=always</code> means zero blocks aren't allocated on disk. A 4GB image might\nonly use 2GB of actual disk space. Takes under a second on NVMe.</p>\n<p>After copying, the rootfs gets mounted and three files are injected: a\nsystemd-networkd config with a static IP, <code>/etc/resolv.conf</code> for DNS, and\n<code>/etc/hostname</code>. Then it's unmounted and copied again into the jailer chroot.</p>\n<p>Yeah, that's two copies of the rootfs per VM. The first for network injection,\nthe second because the jailer expects everything inside its chroot. I could\ncollapse this into one copy by injecting the network config directly into the\nchroot copy, but it's never been a bottleneck — sparse copy of 4GB takes less\ntime than Firecracker takes to boot. So I left it.</p>\n<h2>The jailer</h2>\n<p>Firecracker's jailer is a separate binary that creates a chroot, sets up minimal\n<code>/dev</code> entries (kvm, net/tun, urandom), and runs the Firecracker process inside\nit. The VM config is a JSON file:</p>\n<pre><code class=\"language-go\">vmConfig := map[string]interface{}{\n    &quot;boot-source&quot;: map[string]interface{}{\n        &quot;kernel_image_path&quot;: &quot;/vmlinux&quot;,\n        &quot;boot_args&quot;:         &quot;console=ttyS0 reboot=k panic=1 pci=off init=/sbin/init&quot;,\n    },\n    &quot;drives&quot;: []map[string]interface{}{{\n        &quot;drive_id&quot;:       &quot;rootfs&quot;,\n        &quot;path_on_host&quot;:   &quot;/rootfs.ext4&quot;,\n        &quot;is_root_device&quot;: true,\n        &quot;is_read_only&quot;:   false,\n    }},\n    &quot;machine-config&quot;: map[string]interface{}{\n        &quot;vcpu_count&quot;:  vm.VCPUs,\n        &quot;mem_size_mib&quot;: vm.RamMB,\n    },\n    &quot;network-interfaces&quot;: []map[string]interface{}{{\n        &quot;iface_id&quot;:      &quot;eth0&quot;,\n        &quot;guest_mac&quot;:     &quot;06:00:AC:10:00:02&quot;,\n        &quot;host_dev_name&quot;: netCfg.TapDev,\n    }},\n    &quot;vsock&quot;: map[string]interface{}{\n        &quot;guest_cid&quot;: vm.VsockCID,\n        &quot;uds_path&quot;:  &quot;/vsock.sock&quot;,\n    },\n}\n</code></pre>\n<p><code>pci=off</code> because Firecracker doesn't emulate PCI. Paths are relative to the\njailer chroot. The vsock entry creates a Unix domain socket at <code>/vsock.sock</code>\ninside the chroot — that's how the host talks to the guest.</p>\n<p>Launch looks like this:</p>\n<pre><code class=\"language-go\">cmd := exec.Command(JailerBin,\n    &quot;--id&quot;, vm.JailID,\n    &quot;--exec-file&quot;, FCBin,\n    &quot;--uid&quot;, &quot;0&quot;, &quot;--gid&quot;, &quot;0&quot;,\n    &quot;--cgroup-version&quot;, &quot;2&quot;,\n    &quot;--daemonize&quot;,\n    &quot;--&quot;,\n    &quot;--config-file&quot;, &quot;/vm-config.json&quot;,\n)\ncmd.Run()\n</code></pre>\n<p>After launch there's a 2-second sleep — Firecracker needs a moment to start —\nthen the PID is found via <code>pgrep</code> and saved to a metadata file. If the\norchestrator restarts, it reads these metadata files and picks up where it left\noff. VMs survive orchestrator crashes.</p>\n<h2>Networking</h2>\n<p>This is where I burned the most time. Not because the concepts are hard, but\nbecause of one specific bug that had me questioning reality.</p>\n<p>Each VM needs internet access for Claude Code to fetch packages, clone repos,\nand hit the Anthropic API. The approach: each VM gets a Linux TAP device on the\nhost, a dedicated <code>/24</code> subnet, and iptables rules for NAT.</p>\n<h3>IP allocation</h3>\n<p>Subnets are deterministic, derived from the VM name using FNV-1a hashing:</p>\n<pre><code class=\"language-go\">func NetSlot(name string) int {\n    h := fnv.New32a()\n    h.Write([]byte(name))\n    return int(h.Sum32()%253) + 1\n}\n</code></pre>\n<p>VM named <code>task-a3bfca80</code> might hash to slot 61, giving it subnet\n<code>172.16.61.0/24</code>, guest IP <code>172.16.61.2</code>, TAP IP <code>172.16.61.1</code>. No coordination\nneeded, no DHCP server, no IP pool to manage. The collision space is 253 slots —\nmore than enough for 12-13 concurrent VMs.</p>\n<h3>TAP devices</h3>\n<p>A TAP device is a virtual ethernet interface. Firecracker attaches the guest's\n<code>eth0</code> to it.</p>\n<pre><code class=\"language-go\">tap := &amp;netlink.Tuntap{\n    LinkAttrs: netlink.LinkAttrs{Name: cfg.TapDev},\n    Mode:      netlink.TUNTAP_MODE_TAP,\n}\nnetlink.LinkAdd(tap)\naddr, _ := netlink.ParseAddr(cfg.TapIP + &quot;/24&quot;)\nlink, _ := netlink.LinkByName(cfg.TapDev)\nnetlink.AddrAdd(link, addr)\nnetlink.LinkSetUp(link)\n</code></pre>\n<p>TAP names are <code>fc-&lt;vm-name&gt;</code>, truncated to 15 characters because Linux interface\nnames can't be longer. A fun constraint to discover at runtime.</p>\n<h3>The iptables rules</h3>\n<p>Three rules per VM:</p>\n<pre><code class=\"language-go\">// NAT — rewrite source IP when traffic exits the host\nipt.AppendUnique(&quot;nat&quot;, &quot;POSTROUTING&quot;,\n    &quot;-s&quot;, cfg.Subnet, &quot;-o&quot;, cfg.HostIface, &quot;-j&quot;, &quot;MASQUERADE&quot;)\n\n// FORWARD — allow outbound from TAP\nipt.Insert(&quot;filter&quot;, &quot;FORWARD&quot;, 1,\n    &quot;-i&quot;, cfg.TapDev, &quot;-o&quot;, cfg.HostIface, &quot;-j&quot;, &quot;ACCEPT&quot;)\n\n// FORWARD — allow established/related inbound\nipt.Insert(&quot;filter&quot;, &quot;FORWARD&quot;, 1,\n    &quot;-i&quot;, cfg.HostIface, &quot;-o&quot;, cfg.TapDev,\n    &quot;-m&quot;, &quot;state&quot;, &quot;--state&quot;, &quot;RELATED,ESTABLISHED&quot;, &quot;-j&quot;, &quot;ACCEPT&quot;)\n</code></pre>\n<p>See those <code>Insert</code> calls with position 1? That's the bug fix.</p>\n<h3>The UFW bug</h3>\n<p>I originally used <code>Append</code> for the FORWARD rules. Traffic from the VM would\nleave the host fine (NAT worked), but return traffic got dropped. The VM could\nresolve DNS but couldn't complete TCP handshakes. I spent an embarrassing amount\nof time staring at <code>tcpdump</code> output before I figured it out.</p>\n<p>Ubuntu's UFW adds a blanket <code>DROP</code> rule to the FORWARD chain. If you append your\nACCEPT rules, they land <em>after</em> UFW's DROP. They never match. The packets hit\nthe DROP rule first and get silently killed.</p>\n<p><code>Insert</code> at position 1 puts the rules before UFW's. Return traffic flows, VMs\nget internet access, everything works.</p>\n<p>The traffic path through a working VM:</p>\n<pre><code>Guest (172.16.61.2) → eth0 → TAP (fc-task-xxx) → FORWARD ACCEPT\n→ NAT MASQUERADE (rewrite src to host IP) → host interface → internet\n→ response → RELATED,ESTABLISHED → TAP → guest eth0\n</code></pre>\n<p>VMs can't reach each other. Each TAP device is point-to-point on its own <code>/24</code>.\nThere's no route between subnets.</p>\n<h2>The guest agent</h2>\n<p><code>cmd/agent/main.go</code> — 420 lines of Go. It's a static binary that starts on boot,\nlistens on vsock port 9001, and handles five request types: ping, exec,\nwrite_files, read_file, and signal.</p>\n<p>The interesting one is streaming exec.</p>\n<p>When the orchestrator wants to run Claude Code, it sends an exec request with\n<code>stream: true</code>. The agent spawns the command, reads stdout and stderr line by\nline, and sends each line back as a framed event over the vsock connection. When\nthe process exits, it sends an exit event with the exit code.</p>\n<p>Sounds straightforward. The tricky part is background processes.</p>\n<p>Claude Code can start things that outlive the main command — dev servers, file\nwatchers, whatever it decides it needs. These child processes inherit the\nstdout/stderr pipes. If the agent waits for the pipes to close (the normal\napproach), it hangs forever because the children are still holding them open.</p>\n<p>The fix has three parts:</p>\n<pre><code class=\"language-go\">// 1. Process group isolation\ncmd.SysProcAttr = &amp;syscall.SysProcAttr{Setpgid: true}\n\n// 2. Wait for the main process, not the pipes\n&lt;-waitDone\n\n// 3. Kill the entire process group\npgid, _ := syscall.Getpgid(cmd.Process.Pid)\nsyscall.Kill(-pgid, syscall.SIGTERM)\ntime.Sleep(500 * time.Millisecond)\nsyscall.Kill(-pgid, syscall.SIGKILL)\n</code></pre>\n<p><code>Setpgid: true</code> puts the command in its own process group. When the main process\nexits, kill the group (<code>-pgid</code> means &quot;everything in this group&quot;). SIGTERM first,\nwait half a second, then SIGKILL for anything that didn't listen.</p>\n<p>Even after killing the group, there's a 3-second timeout waiting for the\npipe-reading goroutines to drain. If they're still stuck after that, move on and\nsend the exit event anyway. Can't let a hung pipe block the entire task.</p>\n<p>The line-by-line reader uses a 256KB buffer because Claude Code's\n<code>--output-format stream-json</code> can produce enormous single lines — tool results\nthat include the full contents of files it read.</p>\n<h2>Credential injection</h2>\n<p>Before Claude Code runs, the orchestrator writes five things into the VM via\nvsock:</p>\n<p>OAuth credentials from the host's <code>~/.claude/.credentials.json</code> (mode 0600). A\nsettings file that allows all tools. An environment script that sets\n<code>CLAUDE_DANGEROUSLY_SKIP_PERMISSIONS=true</code>. Task metadata. And a marker file to\ncreate the output directory.</p>\n<p>The prompt itself gets written to a temp file inside the VM to avoid shell\nescaping nightmares, then referenced in the command:</p>\n<pre><code class=\"language-go\">claudeArgs := fmt.Sprintf(\n    &quot;claude -p \\&quot;$(cat %s)\\&quot; --output-format stream-json --verbose&quot;,\n    promptFile,\n)\ncmd := []string{&quot;bash&quot;, &quot;-c&quot;,\n    &quot;source /etc/profile.d/claude.sh &amp;&amp; &quot; + claudeArgs}\n</code></pre>\n<p>When the VM is destroyed, the rootfs — containing the credentials — is deleted.\nCredentials only exist for the lifetime of the task.</p>\n<h2>Collecting results</h2>\n<p>After Claude Code finishes, the orchestrator searches for files it created:</p>\n<pre><code class=\"language-go\">// Anything in the output directory\nvsock.Exec(jailID, []string{&quot;find&quot;, outputDir, &quot;-type&quot;, &quot;f&quot;, &quot;-not&quot;, &quot;-name&quot;, &quot;.keep&quot;}, nil, &quot;/root&quot;)\n\n// Any new files under /root, created after the prompt was written\nvsock.Exec(jailID, []string{&quot;find&quot;, &quot;/root&quot;, &quot;-maxdepth&quot;, &quot;2&quot;, &quot;-type&quot;, &quot;f&quot;,\n    &quot;-newer&quot;, &quot;/tmp/claude-prompt.txt&quot;}, nil, &quot;/root&quot;)\n</code></pre>\n<p>Each file gets downloaded via <code>vsock.ReadFile</code> and saved to\n<code>/opt/firecracker/results/&lt;task-id&gt;/</code>. The runner also scans the accumulated\noutput for Claude's <code>total_cost_usd</code> field to record what the task cost in API\ncredits.</p>\n<p>Then the VM is destroyed. Firecracker process killed, TAP device removed,\niptables rules deleted, jailer chroot deleted, VM state directory deleted. Clean\nslate.</p>\n<p>The whole cycle — boot, inject, run, collect, destroy — typically takes 30-120\nseconds depending on how complex the prompt is. The 4-second boot and ~1-second\nteardown are rounding errors compared to the time Claude actually spends\nthinking.</p>\n<p><a href=\"https://jonnonz.com/posts/claude-code-can-now-spawn-copies-of-itself-in-isolated-vms/\">Part 3</a>\ngets into the fun stuff — the MCP server that lets Claude delegate tasks to\nitself, the streaming architecture, the web dashboard, and what productionising\nthis would actually look like.</p>\n","date_published":"Fri, 10 Apr 2026 00:00:00 GMT"},{"id":"https://jonnonz.com/posts/can-you-beat-last-month/","url":"https://jonnonz.com/posts/can-you-beat-last-month/","title":"Can You Beat Last Month?","content_html":"<p>Every machine learning project needs a reality check.</p>\n<p>It's tempting to jump straight to the neural network. That's the exciting bit,\nright? But if you don't establish what a dead-simple model can do first, you've\ngot no idea whether your fancy architecture is actually learning anything useful\nor just being expensive.</p>\n<p>So before ConvLSTM gets anywhere near this data, we're going to throw three\ngloriously simple baselines at it and see how they do.</p>\n<h2>Persistence: next month equals this month</h2>\n<p>The dumbest possible model. To predict April, just use March's values. Every\ncell, every crime type. Carbon copy.</p>\n<p>It sounds ridiculous, but it works surprisingly well when patterns are stable.\nAnd as we saw in the EDA, Auckland's crime hotspots are remarkably persistent.\nThe CBD doesn't suddenly go quiet. South Auckland doesn't randomly calm down.</p>\n<p>On the six-month test set (August 2025 – January 2026):</p>\n<table>\n<thead>\n<tr>\n<th>Crime Type</th>\n<th>MAE</th>\n<th>RMSE</th>\n</tr>\n</thead>\n<tbody>\n<tr>\n<td>Theft</td>\n<td>1.42</td>\n<td>3.18</td>\n</tr>\n<tr>\n<td>Burglary</td>\n<td>0.38</td>\n<td>0.91</td>\n</tr>\n<tr>\n<td>Assault</td>\n<td>0.22</td>\n<td>0.64</td>\n</tr>\n<tr>\n<td>Robbery</td>\n<td>0.04</td>\n<td>0.15</td>\n</tr>\n<tr>\n<td>Sexual</td>\n<td>0.03</td>\n<td>0.12</td>\n</tr>\n<tr>\n<td>Harm</td>\n<td>0.01</td>\n<td>0.04</td>\n</tr>\n</tbody>\n</table>\n<p>Those MAE numbers for theft and burglary look small until you remember that most\ncells are zero. For the active cells (the ones we actually care about) the error\nis larger. A busy CBD cell might have 35 thefts in one month and 28 the next.\nPersistence would be off by 7 there, which is a 20% miss on an important\nprediction.</p>\n<h2>Seasonal naive: same month last year</h2>\n<p>Instead of copying last month, copy the same month from the previous year.\nJanuary 2026 gets predicted from January 2025. This should capture seasonal\npatterns: the summer spike, the February dip.</p>\n<p>The catch? We only have four years of data. The test set months (August–January)\neach have at most three prior examples of the same month. That's not a lot of\nseasonal training data.</p>\n<table>\n<thead>\n<tr>\n<th>Crime Type</th>\n<th>MAE</th>\n<th>RMSE</th>\n</tr>\n</thead>\n<tbody>\n<tr>\n<td>Theft</td>\n<td>1.51</td>\n<td>3.42</td>\n</tr>\n<tr>\n<td>Burglary</td>\n<td>0.41</td>\n<td>0.97</td>\n</tr>\n<tr>\n<td>Assault</td>\n<td>0.24</td>\n<td>0.68</td>\n</tr>\n<tr>\n<td>Robbery</td>\n<td>0.05</td>\n<td>0.17</td>\n</tr>\n<tr>\n<td>Sexual</td>\n<td>0.04</td>\n<td>0.13</td>\n</tr>\n<tr>\n<td>Harm</td>\n<td>0.01</td>\n<td>0.04</td>\n</tr>\n</tbody>\n</table>\n<p>Slightly worse than persistence across the board. That surprised me initially.\nShouldn't capturing seasonality help?</p>\n<p>The issue is that the 2023-to-2025 decline we spotted in the EDA bites hard\nhere. If you predict January 2026 from January 2025, you're using data from a\nperiod when crime was higher. The seasonal pattern is real, but the\nyear-over-year trend works against it. With more years of data, seasonal naive\nwould likely pull ahead.</p>\n<h2>Historical average: the mean of all training months</h2>\n<p>For each cell and crime type, take the average across all 36 training months.\nThis smooths out month-to-month noise and gives you a &quot;typical&quot; value for each\nlocation.</p>\n<table>\n<thead>\n<tr>\n<th>Crime Type</th>\n<th>MAE</th>\n<th>RMSE</th>\n</tr>\n</thead>\n<tbody>\n<tr>\n<td>Theft</td>\n<td>1.28</td>\n<td>2.95</td>\n</tr>\n<tr>\n<td>Burglary</td>\n<td>0.35</td>\n<td>0.84</td>\n</tr>\n<tr>\n<td>Assault</td>\n<td>0.20</td>\n<td>0.58</td>\n</tr>\n<tr>\n<td>Robbery</td>\n<td>0.04</td>\n<td>0.14</td>\n</tr>\n<tr>\n<td>Sexual</td>\n<td>0.03</td>\n<td>0.11</td>\n</tr>\n<tr>\n<td>Harm</td>\n<td>0.01</td>\n<td>0.04</td>\n</tr>\n</tbody>\n</table>\n<p>The best baseline. By averaging over three years, it smooths out the\nmonth-to-month noise and the year-over-year trend simultaneously. It won't\ncapture seasonal peaks or sudden changes, but for the &quot;typical month&quot; prediction\nit's solid.</p>\n<h2>Why MAPE breaks down</h2>\n<p>You might wonder why I'm not reporting MAPE (Mean Absolute Percentage Error).\nIt's the standard metric in a lot of forecasting work. The reason: sparse data.</p>\n<p>MAPE divides the error by the actual value. When the actual value is zero (which\nit is for 91.7% of our tensor) you get division by zero. Even for cells with\nsmall counts (1 or 2 crimes), a prediction of 0 gives you 100% MAPE while a\nprediction of 2 gives you 0–100%. The metric becomes wildly unstable.</p>\n<p>MAE and RMSE are more honest here. They tell you the absolute magnitude of your\nerrors in actual crime counts, which is what we care about. A miss of 3\nvictimisations means the same thing whether the cell usually has 5 or 50.</p>\n<h2>The bar to clear</h2>\n<p>Here's the scoreboard going forward. Any deep learning model needs to beat the\nhistorical average baseline to justify its existence:</p>\n<table>\n<thead>\n<tr>\n<th>Crime Type</th>\n<th>Historical Avg MAE</th>\n<th>Historical Avg RMSE</th>\n</tr>\n</thead>\n<tbody>\n<tr>\n<td>Theft</td>\n<td>1.28</td>\n<td>2.95</td>\n</tr>\n<tr>\n<td>Burglary</td>\n<td>0.35</td>\n<td>0.84</td>\n</tr>\n<tr>\n<td>Assault</td>\n<td>0.20</td>\n<td>0.58</td>\n</tr>\n<tr>\n<td>All types</td>\n<td>0.39</td>\n<td>0.95</td>\n</tr>\n</tbody>\n</table>\n<p>Theft is the easiest to beat because there's the most signal: high counts, clear\nspatial patterns, strong seasonality. Robbery, sexual offences, and harm are\nessentially noise at this resolution. The models will probably predict near-zero\nfor those types and be mostly correct.</p>\n<p>The real test will be the middle ground. Can ConvLSTM or ST-ResNet predict the\n<em>changes</em> in theft and burglary better than a static average? Can they catch the\nmonths where a cell spikes or dips? That's where simple baselines fall flat,\nbecause they don't model dynamics at all.</p>\n<p>If the deep learning can't meaningfully beat &quot;just use the average,&quot; then it's\nnot worth the CPU cycles. Or in my case, the many hours of a Ryzen 5 grinding\naway.</p>\n","date_published":"Thu, 09 Apr 2026 00:00:00 GMT"},{"id":"https://jonnonz.com/posts/claude-code-running-claude-code-in-4-second-disposable-vms/","url":"https://jonnonz.com/posts/claude-code-running-claude-code-in-4-second-disposable-vms/","title":"Claude Code Running Claude Code in 4-Second Disposable VMs","content_html":"<p>Running Claude Code with full permissions inside a Docker container is a\nterrible idea. I did it anyway for about a week, then built something better.</p>\n<p>Anthropic has an internal platform — people have been calling it\n<a href=\"https://ai.gopubby.com/anthropics-antspace-the-secret-paas-nobody-was-supposed-to-find-a79ce1e02151\">Antspace</a>\nsince it got reverse-engineered from the Claude Code source — that runs AI\ncoding tasks in isolated environments. It's part of a vertical stack they're\nbuilding internally: intent goes in, code comes out, and the agent never touches\nthe host machine.</p>\n<p>I wanted that. Not the whole platform-as-a-service thing, just the core idea:\ngive Claude Code a prompt, let it run with zero permission restrictions, stream\nthe output back, grab any files it created, and destroy everything when it's\ndone. On a single Linux box sitting in my office.</p>\n<p>The result is about 3,200 lines of Go and 860 lines of TypeScript. It boots a\nfresh Linux VM in ~4 seconds, runs Claude Code inside it, and tears it down when\nthe task finishes. Three ways to use it: a CLI, a REST API with a web dashboard,\nand an MCP server so Claude Code on other machines can delegate tasks to it.</p>\n<p>This first post is about why I built it this way.\n<a href=\"https://jonnonz.com/posts/29-hours-debugging-iptables-to-boot-vms-in-4-seconds/\">Part 2</a> and\n<a href=\"https://jonnonz.com/posts/claude-code-can-now-spawn-copies-of-itself-in-isolated-vms/\">Part 3</a> get\ninto the actual implementation.</p>\n<h2>The container problem</h2>\n<p><code>CLAUDE_DANGEROUSLY_SKIP_PERMISSIONS=true</code> — that's the environment variable\nthat tells Claude Code to stop asking before it runs shell commands or writes\nfiles. It just does whatever it thinks it needs to. For autonomous tasks, you\nneed this. Claude can't ask for confirmation when there's nobody watching.</p>\n<p>The question is where you let it run.</p>\n<p>Docker is the obvious first thought. Fast startup, everyone knows it, easy to\norchestrate. But containers share the host kernel. Every container on the\nmachine issues syscalls to the same Linux kernel, and a kernel vulnerability is\na vulnerability in every container on the host.\n<a href=\"https://huggingface.co/blog/agentbox-master/firecracker-vs-docker-tech-boundary\">The isolation boundary is the container runtime</a>,\nnot hardware — and that surface area is big.</p>\n<p>For most workloads this is fine. Running a web server in Docker? No worries. But\nrunning an AI agent that can execute arbitrary shell commands with root-level\npermissions? That's a different threat model. A container escape gives you the\nhost. And you've just given the thing inside the container permission to try\nanything.</p>\n<p>Anthropic's own approach to\n<a href=\"https://www.anthropic.com/engineering/claude-code-sandboxing\">sandboxing Claude Code</a>\nuses OS-level primitives — bubblewrap on Linux, Seatbelt on macOS — for\nfilesystem and network isolation. They report an 84% reduction in permission\nprompts internally. That's smart for the normal use case where Claude is helping\nyou write code in your own project. But I wanted something more aggressive: full\nisolation where even a kernel exploit can't reach the host.</p>\n<h2>Why Firecracker</h2>\n<p><a href=\"https://firecracker-microvm.github.io/\">Firecracker</a> is what AWS built for\nLambda and Fargate. Each MicroVM is a real KVM-backed virtual machine with its\nown guest kernel, its own memory space, and hardware-enforced isolation via\nIntel VT-x or AMD-V. The attack surface is the KVM hypervisor — which the kernel\nteam at AWS has spent years minimising.</p>\n<p>The trade-off is boot time. Containers start in under a second. Firecracker VMs\ntake about 4 seconds on my hardware once you account for the guest kernel boot,\nsystemd init, and the agent process starting up. For tasks that typically run\n20-120 seconds, 4 seconds of overhead is nothing.</p>\n<p>Each VM also copies a 4GB rootfs image. Sparse copies make this fast (&lt;1\nsecond), but it does use disk. On a machine with a 1TB NVMe, I'm not losing\nsleep over it.</p>\n<p>The hardware is an AMD Ryzen 5 5600GT with 30GB of RAM. Nothing exotic. About\n$400 worth of parts sitting under my desk. Each VM gets 2GB of RAM by default,\nso I can run roughly 12-13 VMs concurrently before the host runs out of memory.</p>\n<h2>Talking to a VM without a network</h2>\n<p>This was my favourite bit to figure out.</p>\n<p>The obvious way to communicate with a process inside a VM is SSH. Set up keys,\nopen a port, connect over the network. But SSH means key management, an open\nnetwork port inside the VM, and another service to configure. If the guest's\nnetwork breaks during a task, you've lost your control channel.</p>\n<p><a href=\"https://github.com/firecracker-microvm/firecracker/blob/main/docs/vsock.md\">vsock</a>\n(AF_VSOCK, address family 40) is a kernel-level host-guest communication\nchannel. It doesn't touch the network stack. No IP addresses, no ports, no keys.\nFirecracker exposes the guest's vsock as a Unix domain socket on the host side —\nyou connect to the socket, send <code>CONNECT &lt;port&gt;\\n</code>, and you're talking directly\nto a process inside the VM.</p>\n<pre><code class=\"language-go\">func Connect(jailID string, port int) (net.Conn, error) {\n    socketPath := fmt.Sprintf(&quot;/srv/jailer/firecracker/%s/root/vsock.sock&quot;, jailID)\n    conn, _ := net.Dial(&quot;unix&quot;, socketPath)\n    conn.Write([]byte(fmt.Sprintf(&quot;CONNECT %d\\n&quot;, port)))\n    // Read &quot;OK &lt;port&gt;&quot; response\n    return conn, nil\n}\n</code></pre>\n<p>On the guest side, Go's standard library doesn't support AF_VSOCK — address\nfamily 40 doesn't exist in the <code>net</code> package. So the guest agent uses raw\nsyscalls:</p>\n<pre><code class=\"language-go\">fd, _ := syscall.Socket(40, syscall.SOCK_STREAM, 0)  // AF_VSOCK = 40\n// Manually construct struct sockaddr_vm (16 bytes)\nsa := [16]byte{}\n*(*uint16)(unsafe.Pointer(&amp;sa[0])) = 40          // family\n*(*uint32)(unsafe.Pointer(&amp;sa[4])) = uint32(port) // port (9001)\n*(*uint32)(unsafe.Pointer(&amp;sa[8])) = 0xFFFFFFFF   // VMADDR_CID_ANY\nsyscall.RawSyscall(syscall.SYS_BIND, uintptr(fd), uintptr(unsafe.Pointer(&amp;sa[0])), 16)\nsyscall.RawSyscall(syscall.SYS_LISTEN, uintptr(fd), 5, 0)\n</code></pre>\n<p>Yeah, that's <code>unsafe.Pointer</code> and manual struct layout. Not the prettiest Go\nyou'll ever write. But it works, it's fast, and the whole vsock layer is about\n160 lines shared between both binaries.</p>\n<p>The wire protocol is dead simple — length-prefixed JSON frames:</p>\n<pre><code class=\"language-go\">func WriteFrame(w io.Writer, v interface{}) error {\n    data, _ := json.Marshal(v)\n    binary.Write(w, binary.BigEndian, uint32(len(data)))\n    w.Write(data)\n    return nil\n}\n</code></pre>\n<p>Each operation (ping, exec, write files, read file) opens a new connection,\nsends one request, reads the response, and closes. Connection-per-request. Not\nfancy, but vsock connections are local and effectively instant, so there's no\nreason to complicate things with multiplexing.</p>\n<h2>The shape of the thing</h2>\n<p>The whole system is two Go binaries — the orchestrator (runs on the host) and\nthe agent (runs inside each VM).</p>\n<pre><code class=\"language-mermaid\">graph TD\n    subgraph &quot;Host — orchestrator binary&quot;\n        API[&quot;REST API + WebSocket :8080&quot;]\n        MCP[&quot;MCP Server :8081&quot;]\n        VM[&quot;VM Manager&quot;]\n        NET[&quot;TAP + iptables&quot;]\n        TASK[&quot;Task Runner&quot;]\n        STREAM[&quot;Pub/Sub Hub&quot;]\n        VSOCK[&quot;vsock Client&quot;]\n    end\n\n    subgraph &quot;Guest — agent binary&quot;\n        AGENT[&quot;Guest Agent vsock:9001&quot;]\n        CLAUDE[&quot;Claude Code&quot;]\n    end\n\n    API --&gt; TASK\n    MCP --&gt; TASK\n    TASK --&gt; VM\n    VM --&gt; NET\n    TASK --&gt; VSOCK\n    VSOCK --&gt; AGENT\n    AGENT --&gt; CLAUDE\n    TASK --&gt; STREAM\n    STREAM --&gt; API\n</code></pre>\n<p>The orchestrator is a single 14MB binary with the React dashboard embedded via\n<code>//go:embed</code>. Copy it to a server, run it with sudo, done. Seven Go dependencies\ntotal — chi for routing, netlink for TAP devices, go-iptables for firewall\nrules, <a href=\"https://github.com/mark3labs/mcp-go\">mcp-go</a> for the MCP protocol, and a\nfew others.</p>\n<p>The agent is a 2.5MB static binary compiled with <code>CGO_ENABLED=0</code>. It ships\ninside the VM's rootfs and starts via systemd on boot. Within about a second of\nthe VM coming up, the agent is listening on vsock port 9001 and ready to accept\ncommands.</p>\n<p>They share exactly one file — <code>internal/agent/protocol.go</code> — which defines the\nwire protocol types and framing functions. Everything else is independent.</p>\n<h2>What a task looks like</h2>\n<p>You give it a prompt. It does the rest.</p>\n<ol>\n<li>Generate a task ID and VM name</li>\n<li>Copy the base rootfs image (sparse, &lt;1 second)</li>\n<li>Inject network config into the rootfs</li>\n<li>Create a TAP device and iptables rules for internet access</li>\n<li>Launch Firecracker via the jailer</li>\n<li>Poll vsock until the agent responds (~1 second)</li>\n<li>Inject credentials and files via vsock</li>\n<li>Run Claude Code with streaming output</li>\n<li>Collect any files Claude created</li>\n<li>Destroy the VM</li>\n</ol>\n<p>From the CLI it looks like this:</p>\n<pre><code class=\"language-bash\">sudo ./bin/orchestrator task run \\\n    --prompt &quot;Write a Python script that generates Fibonacci numbers&quot; \\\n    --ram 2048 \\\n    --vcpus 2 \\\n    --timeout 120\n</code></pre>\n<p>Output streams to your terminal in real time. When it's done:</p>\n<pre><code>=== Task Complete ===\nID:     a3bfca80\nStatus: completed\nExit:   0\nCost:   $0.0582\nFiles:  [fibonacci.py]\n</code></pre>\n<p>The VM is gone. The rootfs is deleted. The TAP device and iptables rules are\ncleaned up. All that's left is the result files in\n<code>/opt/firecracker/results/a3bfca80/</code>.</p>\n<p>Or you use the MCP server, and Claude Code on your laptop delegates the task to\na VM on the box under your desk. Claude spawning Claude. That bit is properly\ncool, and I'll get into it in Part 3.</p>\n<h2>Why Go</h2>\n<p>Quick aside on this because people always ask.</p>\n<p>Go produces static binaries. The agent needs to be a single file with zero\ndependencies that runs inside a minimal Debian guest — <code>CGO_ENABLED=0</code> makes\nthis trivial. The orchestrator needs to manage concurrent VMs, and goroutines\nare a natural fit for that. Syscall support is first-class, which matters when\nyou're doing raw vsock operations. And it compiles in about 2 seconds, which is\nnice when you're iterating.</p>\n<pre><code class=\"language-makefile\">build-agent:\n\tCGO_ENABLED=0 GOOS=linux GOARCH=amd64 go build -o bin/agent -ldflags=&quot;-s -w&quot; ./cmd/agent\n</code></pre>\n<p>That <code>-ldflags=&quot;-s -w&quot;</code> strips debug info and DWARF tables, dropping the agent\nbinary from ~3.5MB to ~2.5MB. Every byte counts when you're baking it into a\nrootfs that gets copied for every VM.</p>\n<p><a href=\"https://jonnonz.com/posts/29-hours-debugging-iptables-to-boot-vms-in-4-seconds/\">Part 2</a> gets into\nthe actual build — the rootfs, the networking (including a fun bug with Ubuntu's\nUFW that had me staring at iptables rules for an embarrassing amount of time),\nthe guest agent, and the streaming pipeline that gets Claude's output from\ninside a VM to your browser.</p>\n","date_published":"Wed, 08 Apr 2026 00:00:00 GMT"},{"id":"https://jonnonz.com/posts/what-if-your-browser-built-the-ui-for-you/","url":"https://jonnonz.com/posts/what-if-your-browser-built-the-ui-for-you/","title":"What if your browser built the UI for you?","content_html":"<p>We're at a genuinely weird inflection point in frontend development. AI can\ngenerate entire interfaces now. LLMs can reason about data and layout. And yet —\nmost SaaS products still ship hand-crafted React apps, each building its own UI,\nits own accessibility layer, its own theme system, its own responsive\nbreakpoints. Not every service, but the vast majority.</p>\n<p>That's a lot of duplicated effort for what's essentially the same job — showing\na human some data and letting them do stuff with it.</p>\n<p>I've been thinking about this a lot lately, and I built a proof of concept to\ntest an idea: what if the browser itself generated the UI?</p>\n<h2>Where we are right now</h2>\n<p>The industry is circling this idea from multiple angles, but nobody's quite\nlanded on it yet.</p>\n<p><a href=\"https://www.apollographql.com/docs/graphos/schema-design/guides/sdui/basics\">Server-driven UI</a>\nhas been around for a while — Airbnb and others pioneered it for mobile, where\napp store review cycles make shipping UI changes painful. The server sends down\na JSON tree describing what to render, and the client just follows instructions.\nIt's clever, but the server is still calling the shots. x.</p>\n<p>Google recently shipped\n<a href=\"https://developers.google.com/natively-adaptive-interfaces\">Natively Adaptive Interfaces</a>\n— a framework that uses AI agents to make accessibility a default rather than an\nafterthought. Really cool idea, and the right instinct. But it's still operating\nwithin a single app's boundaries. Your accessibility preferences don't carry\nbetween Google's products and, say, your project management tool.</p>\n<p>Then there's the\n<a href=\"https://www.copilotkit.ai/blog/the-developer-s-guide-to-generative-ui-in-2026\">generative UI</a>\nwave — CopilotKit, Vercel's AI SDK, and others building frameworks where LLMs\ngenerate components on the fly. These are powerful developer tools, but they're\nstill developer tools. The generation happens at build time or on the server.\nThe service is still in control.</p>\n<p>See the pattern? Every approach keeps the power on the service side.</p>\n<h2>Flip it</h2>\n<p>Here's the idea behind the\n<a href=\"https://github.com/jonnonz1/adaptive-browser\">adaptive browser</a>: what if the\ngeneration happened on <em>your</em> side?</p>\n<p>Instead of a service shipping you a finished frontend, it publishes a manifest —\na structured description of what it can do. Its capabilities, endpoints, data\nshapes, what actions are available. Think of it like an API spec, but semantic.\nNot just &quot;here's a GET endpoint&quot; but &quot;here's a list of repositories, they're\nsortable by stars and language, you can create, delete, star, or fork them.&quot;</p>\n<p>Your browser takes that manifest, calls the actual APIs, gets real data back,\nand then generates the UI based on your preferences. Your font size. Your colour\nscheme. Your preferred layout (tables vs cards vs kanban). Your accessibility\nneeds. All applied universally, across every service.</p>\n<p>The manifest for something like GitHub looks roughly like this — a service\ndescribes its capabilities and the browser figures out the rest:</p>\n<pre><code class=\"language-yaml\">service:\n  name: &quot;GitHub&quot;\n  domain: &quot;api.github.com&quot;\n\ncapabilities:\n  - id: &quot;repositories&quot;\n    endpoints:\n      - path: &quot;/user/repos&quot;\n        semantic: &quot;list&quot;\n        entity: &quot;repository&quot;\n        sortable_fields: [name, updated_at, stargazers_count]\n        actions: [create, delete, star, fork]\n</code></pre>\n<p>The browser takes that, fetches the data, and generates a bespoke interface —\nusing an LLM to reason about the best way to present it given who you are and\nwhat you're trying to do.</p>\n<h2>Why this matters more than it sounds</h2>\n<p>When I was building the app store and integrations platforms at Xero, one of the\nconstant headaches was that every third-party integration had its own UI\npatterns. Users had to learn a new interface for every app they connected. If\nthe browser was generating the UI from a shared set of preferences, that problem\njust… goes away.</p>\n<p>Accessibility is the big one though. Right now, accessibility is a feature that\ngets bolted on — and often badly. When the browser generates the UI,\naccessibility isn't a feature. It's the default. Your preferences — high\ncontrast, keyboard-first navigation, screen reader optimisation, larger text —\napply everywhere. Not because every developer remembered to implement them, but\nbecause they're baked into how the UI gets generated in the first place.</p>\n<p>Customisation becomes genuinely personal too. Not &quot;pick from three themes the\ndeveloper made&quot; but &quot;this is how I interact with software, full stop.&quot;</p>\n<h2>The trade-off is real though</h2>\n<p>Frontend complexity drops dramatically, but the complexity doesn't disappear —\nit moves behind the API. And honestly, it probably increases.</p>\n<p>API design becomes way more important. You can't just throw together some REST\nendpoints and call it a day. Your manifest needs to be semantic — describing\nwhat the data means, not just what shape it is. Data contracts between services\nmatter more. Versioning matters more.</p>\n<pre><code class=\"language-mermaid\">graph LR\n    A[Service] --&gt;|Publishes manifest + APIs| B[Browser Agent]\n    C[User Preferences] --&gt; B\n    D[Org Guardrails] --&gt; B\n    B --&gt;|Generates| E[Bespoke UI]\n</code></pre>\n<p>But here's the thing — this trade-off pushes us somewhere genuinely interesting.\nIf every service needs to describe itself semantically through APIs and\nmanifests, those APIs become the actual product surface. Not the frontend. The\nAPIs.</p>\n<p>And once APIs are the product surface, sharing context between platforms becomes\nthe interesting problem. Your project management tool knows what you're working\non. Your email client knows who you're talking to. Your code editor knows what\nyou're building. Right now, none of these talk to each other in any meaningful\nway because they're all locked behind their own UIs. In a manifest-driven world,\nthat context flows through the APIs — and your browser can stitch it all\ntogether into something coherent.</p>\n<h2>Where this is headed (IMHO)</h2>\n<p>I reckon we're about 3-5 years from this being mainstream. The pieces are all\nthere — LLMs that can reason about UI,\n<a href=\"https://www.builder.io/blog/ui-over-apis\">standardisation efforts</a> around\nsending UI intent over APIs, and a growing expectation from users that software\nshould adapt to them, not the other way around.</p>\n<p>The services that win in this world won't be the ones with the prettiest\nhand-crafted UI. They'll be the ones with the best APIs, the richest manifests,\nand the most useful data. The frontend becomes a generated output, not a\nhand-crafted input.</p>\n<p>Organisations will set preference guardrails — &quot;our people can use dark or light\nmode, must have destructive action confirmations, these fields are always\nvisible&quot; — while individuals customise within those bounds. Your browser becomes\nyour agent, not just a renderer.</p>\n<p>I built the <a href=\"https://github.com/jonnonz1/adaptive-browser\">adaptive browser</a> as\na proof of concept to test this thinking — it uses Claude to generate UIs from a\nGitHub manifest and user preferences defined in YAML. It's rough, but the\ndirection feels right.</p>\n<p>The frontend isn't dying. But what we think of as &quot;frontend development&quot; is\nabout to change. The interesting work moves to API design, semantic data\ncontracts, and building browsers smart enough to be genuine user agents.</p>\n","date_published":"Sun, 05 Apr 2026 00:00:00 GMT"},{"id":"https://jonnonz.com/posts/stealing-nanoclaw-patterns-for-webapps-and-saas/","url":"https://jonnonz.com/posts/stealing-nanoclaw-patterns-for-webapps-and-saas/","title":"Stealing NanoClaw Patterns for Web Apps and SaaS","content_html":"<p>In <a href=\"https://jonnonz.com/posts/nanoclaw-architecture-masterclass-in-doing-less/\">Part 1</a> I pulled\napart NanoClaw's codebase and found six patterns that make an 8,000-line AI\nassistant surprisingly robust. But NanoClaw is a single-user tool running on\nyour laptop. Surely these patterns fall apart once you've got real tenants, real\nmoney, and real scale?</p>\n<p>Nah. Four of them translate almost directly — and the ones that don't still\nteach you something useful.</p>\n<h2>The credential sidecar</h2>\n<p>NanoClaw's credential proxy — where containers get a placeholder API key and a\nlocalhost proxy injects the real one — sounds like a neat trick for a personal\ntool. But this exact pattern is showing up in production Kubernetes deployments\nright now.</p>\n<p>The broader version is a\n<a href=\"https://www.apistronghold.com/blog/phantom-token-pattern-production-ai-agents\">sidecar proxy that handles credential injection</a>\nfor any service that needs API keys or tokens. Your application code never\ntouches the real secret. A sidecar container intercepts outbound requests, swaps\nin credentials, and forwards them upstream.</p>\n<p>At Vend we managed a bunch of third-party integrations — payment gateways,\nshipping providers, accounting platforms. Each one had API keys that needed to\nlive somewhere. We went through the typical evolution: environment variables,\nthen a secrets manager, then a service that distributed keys at startup. Every\nstep was an improvement, but the keys still ended up <em>in the application's\nmemory</em>.</p>\n<p>The sidecar approach skips that entirely. Your app sends requests with a\nplaceholder. The proxy — which is a separate process with its own security\nboundary — does the credential swap. Even if your application gets compromised,\nthe real keys aren't there to steal.</p>\n<p>If you're running any kind of multi-service architecture where services call\nexternal APIs, this pattern is worth adopting. Your API gateway might already be\ndoing a version of it — the insight is making it explicit and consistent across\nall outbound credential flows.</p>\n<h2>Isolation as the security model</h2>\n<p>This is the one I keep thinking about.</p>\n<p>NanoClaw uses filesystem mounts to control what each container can see. No\napplication-level permission checks — the security model <em>is</em> the infrastructure\ntopology. If a container can't see a file, it can't access it. No bugs, no\nmissed checks, no escalation vulnerabilities.</p>\n<p>In SaaS, we spend enormous amounts of time writing authorisation logic. Role\nchecks, permission middleware, tenant-scoping queries. And it works — until\nsomeone forgets a WHERE clause.</p>\n<p>AWS's own\n<a href=\"https://docs.aws.amazon.com/whitepapers/latest/saas-architecture-fundamentals/tenant-isolation.html\">SaaS tenant isolation guidance</a>\nmakes this point explicitly: authentication and authorisation are not the same\nas isolation. The fact that a user logged in doesn't mean your system has\nachieved tenant isolation. A\n<a href=\"https://workos.com/blog/tenant-isolation-in-multi-tenant-systems\">single missed tenant filter</a>\non a database query and you've got a cross-tenant data leak.</p>\n<p>The NanoClaw-inspired approach is to push isolation down the stack. Separate\ndatabase schemas per tenant. Separate containers. Separate cloud accounts for\nyour highest-value customers. Not instead of application-level checks — but as a\nbackstop that catches the bugs your application-level checks inevitably have.</p>\n<p>At Xero, working across the integrations and app store teams, I saw first-hand\nhow multi-tenant data isolation gets complicated fast. The teams that had the\nfewest incidents were the ones where the infrastructure itself enforced\nboundaries, not just the application code.</p>\n<p>You don't need to go full NanoClaw and give every tenant their own container.\nBut you should be asking: if my application-level authorisation has a bug,\nwhat's my second line of defence? If the answer is &quot;nothing&quot; — that's the\npattern to steal.</p>\n<h2>Polling when it's the right call</h2>\n<p>NanoClaw polls SQLite every 2 seconds. No WebSockets, no event bus, no pub/sub.\nJust a loop that checks for new stuff.</p>\n<p>The instinct for most teams is to treat polling as a temporary hack you'll\nreplace with &quot;proper&quot; event-driven architecture later. Yan Cui wrote a\n<a href=\"https://theburningmonk.com/2025/05/understanding-push-vs-poll-in-event-driven-architectures/\">solid breakdown of push vs poll in event-driven systems</a>\nand the takeaway isn't that one is always better — it's that the right choice\ndepends on your throughput, ordering, and failure-handling requirements.</p>\n<p>For a lot of internal systems, polling is the correct permanent answer.</p>\n<p>Admin dashboards. Background job status. Internal reporting. Webhook retry\nqueues. Deployment pipelines. These systems don't need sub-second latency. They\nneed reliability and simplicity. A polling loop against your database gives you\nboth, with zero infrastructure overhead.</p>\n<p>At Xero we shipped multiple times per day, and some of the internal tooling that\nsupported continuous deployment was surprisingly simple under the hood. Cron\njobs. Polling loops. SQL queries on a timer. Not because anyone was cutting\ncorners — because the requirements genuinely didn't need anything more\nsophisticated.</p>\n<p>The trap is reaching for Kafka or RabbitMQ because you think you'll need it\neventually.\n<a href=\"https://synmek.com/saas-architecture-for-startups-2025-guide\">70% of startups fail due to premature scaling</a>.\nThe infrastructure you don't deploy is the infrastructure that never breaks.</p>\n<h2>Your database is your message queue</h2>\n<p>NanoClaw uses JSON files on the filesystem for inter-process communication.\nAtomic rename, directory-based identity, simple polling to pick up new messages.\nNo Redis. No message broker.</p>\n<p>That specific approach won't scale to a multi-tenant SaaS — but the <em>instinct</em>\nbehind it absolutely does. The instinct is: use the infrastructure you already\nhave.</p>\n<p>For most web apps, that means Postgres. The\n<a href=\"https://dagster.io/blog/skip-kafka-use-postgres-message-queue\">Postgres-as-queue movement</a>\nhas been gaining serious traction, and tools like\n<a href=\"https://github.com/pgmq/pgmq\">PGMQ</a> make it practical. You get ACID guarantees,\nyou don't need to manage another service, and your queue is backed by the same\ndatabase you're already monitoring and backing up.</p>\n<p>NanoClaw's\n<a href=\"https://dev.to/constanta/crash-safe-json-at-scale-atomic-writes-recovery-without-a-db-3aic\">atomic write pattern</a>\n— write to a temp file, rename into place — maps to <code>INSERT INTO queue_table</code>\nfollowed by a <code>SELECT ... FOR UPDATE SKIP LOCKED</code> consumer. Same principle: the\nmessage either exists completely or doesn't exist at all. No partial state.</p>\n<p>The &quot;just add Redis&quot; reflex is strong in our industry. Sometimes it's the right\ncall. But I've seen plenty of teams introduce a message broker for a workload\nthat Postgres could've handled without breaking a sweat — and then spend the\nnext six months debugging consumer lag and dead letter queues.</p>\n<h2>The real pattern</h2>\n<p>The specific techniques matter less than the discipline behind them.</p>\n<p>NanoClaw's developer looked at a 500,000-line framework and asked: what are my\n<em>actual</em> constraints? Single user. Local machine. One AI provider. And then\nbuilt exactly the architecture those constraints required — nothing more.</p>\n<p>Most teams don't do this. They build for imaginary scale, imaginary\nmulti-tenancy requirements, imaginary traffic spikes. They reach for Kubernetes\nbefore they've outgrown a single server. They deploy event buses before they've\noutgrown a polling loop. They write complex authorisation middleware before\nthey've considered whether infrastructure isolation would eliminate the problem\nentirely.</p>\n<p>The pattern worth stealing isn't the credential proxy or the polling loop or\nPostgres-as-queue. It's the habit of understanding your constraints first and\nletting them delete complexity from your architecture.</p>\n<p>Hardest pattern to adopt, though. Because it means admitting you're smaller than\nyou think.</p>\n","date_published":"Sun, 05 Apr 2026 00:00:00 GMT"},{"id":"https://jonnonz.com/posts/what-the-data-actually-shows/","url":"https://jonnonz.com/posts/what-the-data-actually-shows/","title":"What the Data Actually Shows","content_html":"<p>You can't just shove a tensor into a neural network and hope for the best.</p>\n<p>I mean, you <em>can</em>. People do it all the time. But you'll have no idea whether\nyour model is learning something real or just memorising noise. Before we get\nanywhere near ConvLSTM or ST-ResNet, we need to properly understand what\npatterns actually exist in this data, and whether they're strong enough for a\nmodel to learn.</p>\n<p>This is the part that most ML blog posts skip. It's also the part that saves you\nweeks of debugging later.</p>\n<h2>When does crime happen?</h2>\n<p>The monthly pattern across Auckland is surprisingly consistent year to year.\nCrime peaks in late spring and early summer (October through January) and dips\nin late summer through winter. February is reliably the quietest month at around\n7,000–8,000 victimisations, while November and December regularly push past\n9,000.</p>\n<p>This tracks with what\n<a href=\"https://link.springer.com/article/10.1007/s43762-023-00094-x\">criminology research has found globally</a>:\nwarmer months mean more people out and about, more opportunities for property\ncrime, and more interpersonal conflict. It's a well-documented pattern called\nseasonal variation in crime, and it shows up clearly in the NZ data.</p>\n<p>The seasonal signal isn't uniform across crime types though. Theft drives most\nof the swing. It surges in summer and drops in winter, accounting for nearly all\nthe monthly variance. Assault has its own rhythm. It peaks around the holiday\nperiod (December–January) and shows a secondary bump in winter weekends,\nprobably pub-related. Burglary is flatter, with a slight winter uptick when\nhouses are dark earlier.</p>\n<p>2023 was the peak year across the board, with a noticeable decline through 2024\nand into early 2025. Whether that's a real trend or a reporting artefact, I\ngenuinely don't know. But it means the model's training data includes both an\nupswing and a downswing, which is useful. It can't just learn &quot;crime always goes\nup.&quot;</p>\n<h2>Where does crime cluster?</h2>\n<p>Crime in Auckland is not randomly distributed. That's obvious to anyone who\nlives here, but it's worth quantifying.</p>\n<p>Running a\n<a href=\"https://www.publichealth.columbia.edu/research/population-health-methods/hot-spot-spatial-analysis\">Moran's I test</a>\non our 500m grid confirms strong positive spatial autocorrelation. Cells with\nhigh crime counts are surrounded by other high-crime cells. The Moran's I\nstatistic comes out at 0.43 (p &lt; 0.001), which means the clustering is highly\nsignificant. Crime begets more crime in adjacent cells.</p>\n<p>The hotspots are exactly where you'd expect. The CBD dominates: Queen Street,\nKarangahape Road, and the surrounding blocks consistently light up across all\ncrime types. South Auckland corridors (Manukau, Ōtāhuhu, Papatoetoe) form a\nsecond cluster, particularly for assault and robbery. Henderson in the west\nshows up for burglary.</p>\n<p>What's less obvious is how stable these hotspots are over time. The top 5% of\ncells (about 227 cells) account for over 60% of all recorded crime across the\nentire four-year period. These aren't random spikes. They're persistent. A cell\nthat's hot in 2022 is almost certainly still hot in 2025. That temporal\npersistence is exactly what makes this data amenable to prediction. If hotspots\nmoved randomly month to month, no model could learn them.</p>\n<h2>Crime type correlations</h2>\n<p>The six channels in our tensor don't behave independently. Theft and burglary\nshow moderate positive correlation (r ≈ 0.52). Cells with lots of theft tend to\nhave more burglary too, which makes sense given similar opportunity structures\n(commercial areas, transport hubs).</p>\n<p>Assault correlates weakly with everything else (r ≈ 0.15–0.25). It has its own\nspatial logic (nightlife areas, specific residential pockets) that doesn't align\nneatly with property crime.</p>\n<p>Robbery, sexual offences, and harm are so sparse at the 500m monthly resolution\nthat correlation analysis is basically meaningless. Most cells have zero counts\nfor these types in any given month. That sparsity is going to be a real headache\nfor the models.</p>\n<h2>The sparsity problem, again</h2>\n<p>We flagged this in Part 3: 91.7% of the tensor is zeros. But the EDA makes the\nproblem even clearer.</p>\n<p>The distribution of non-zero cell values is heavily right-skewed. The median\nnon-zero value is 1. One crime, in one cell, in one month. The mean is about\n2.3. A handful of cells (the CBD, Manukau) hit 30–50+ in peak months for theft.\nThe model needs to learn the difference between &quot;always zero&quot; cells,\n&quot;occasionally one&quot; cells, and &quot;consistently busy&quot; cells.</p>\n<p>If you plot the crime count distribution across non-zero cells, it follows\nsomething close to a power law. A tiny number of cells carry an outsized share\nof the signal. This is textbook\n<a href=\"https://pmc.ncbi.nlm.nih.gov/articles/PMC7319308/\">spatial concentration of crime</a>,\ndocumented in basically every city ever studied.</p>\n<p>For modelling, this means two things. First, aggregate metrics like RMSE will be\ndominated by how well the model predicts the high-count cells. Second,\npredicting &quot;zero&quot; for a sparse cell is almost always correct but completely\nuninformative. We'll need to think carefully about what &quot;accuracy&quot; actually\nmeans when we get to evaluation.</p>\n<h2>What this means for the models</h2>\n<p>The EDA tells us a few things that should directly shape how we build and\nevaluate the models:</p>\n<p>The seasonal signal is strong and consistent. A model that can't capture monthly\nseasonality is worse than useless. It's worse than a calendar.</p>\n<p>Spatial structure is real and persistent. Hotspots don't move much. A model that\nlearns static spatial patterns will get a lot of the way there, even without\nunderstanding temporal dynamics.</p>\n<p>We already know the CBD will have lots of theft next month. That's not what\nwe're trying to predict. The real value is in the margins: the cells that go\nfrom quiet to active, or the months where a normally stable area spikes. That's\nwhere deep learning might actually add something over simple baselines.</p>\n<p>Speaking of which, we need baselines. Otherwise we won't know if ConvLSTM is\nactually clever or just expensive. That's next.</p>\n","date_published":"Thu, 02 Apr 2026 00:00:00 GMT"},{"id":"https://jonnonz.com/posts/built-an-sms-gateway-with-a-20-dollar-android-phone/","url":"https://jonnonz.com/posts/built-an-sms-gateway-with-a-20-dollar-android-phone/","title":"How I Built an SMS Gateway with a $20 Android Phone","content_html":"<p>Twilio charges around $0.05–0.06 per SMS round-trip. Doesn't sound like much\nuntil you're building an MVP that sends reminders, confirmations, and\nnotifications — suddenly you're looking at $50/month for a thousand messages.\nFor an app that's not making money yet, that's a dumb tax.</p>\n<p>Here's what I did instead: grabbed a cheap Android phone, installed an\nopen-source app called\n<a href=\"https://github.com/capcom6/android-sms-gateway\">SMS Gateway for Android</a>, and\nturned it into a full SMS gateway with a REST API. My SMS costs dropped to\nwhatever my mobile plan charges — which on plenty of prepaid plans is zero.\nUnlimited texts.</p>\n<p>This post walks through exactly how to wire it into a Next.js app, from first\ninstall to receiving webhooks. The whole thing took an afternoon.</p>\n<hr>\n<h2>What You're Building</h2>\n<p>By the end of this you'll have:</p>\n<ul>\n<li>An Android phone acting as your SMS gateway</li>\n<li>A webhook endpoint receiving inbound SMS in real-time</li>\n<li>Outbound SMS sent via a simple REST API call</li>\n<li>A provider abstraction so you can swap between SMS Gateway, Twilio, or console\nlogging</li>\n</ul>\n<h2>Prerequisites</h2>\n<ul>\n<li>An Android phone (5.0+) with a SIM card</li>\n<li>A Next.js app (I'm using 15 with App Router, but any backend works)</li>\n<li>Node.js 18+</li>\n<li>ngrok for testing with cloud mode</li>\n</ul>\n<hr>\n<h2>Install SMS Gateway on Android</h2>\n<ol>\n<li>\n<p>Install <strong>SMS Gateway for Android</strong> from the\n<a href=\"https://play.google.com/store/apps/details?id=me.capcom.smsgateway\">Google Play Store</a>\nor grab the APK from\n<a href=\"https://github.com/capcom6/android-sms-gateway/releases\">GitHub Releases</a></p>\n</li>\n<li>\n<p>Open the app and <strong>grant SMS permissions</strong> when prompted</p>\n</li>\n<li>\n<p>You'll see the main screen with toggles for Local Server and Cloud Server:</p>\n</li>\n</ol>\n<p><img src=\"https://jonnonz.com/img/posts/sms-gateway/screenshot.png\" alt=\"SMS Gateway main screen\"></p>\n<p>The app supports two modes — local and cloud. Both work well, and I'll cover\neach.</p>\n<hr>\n<h2>Local Server Mode</h2>\n<p>Local mode runs an HTTP server directly on the phone. Your backend talks to it\nover your local network. No cloud dependency, no third-party servers — the\nsimplest setup.</p>\n<h3>Configure It</h3>\n<p><img src=\"https://jonnonz.com/img/posts/sms-gateway/local-server.png\" alt=\"Local server settings\"> <em>Local server\nconfiguration</em></p>\n<ol>\n<li>Toggle <strong>&quot;Local Server&quot;</strong> on</li>\n<li>Go to <strong>Settings &gt; Local Server</strong> to configure:\n<ul>\n<li><strong>Port:</strong> 1024–65535 (default <code>8080</code>)</li>\n<li><strong>Username:</strong> minimum 3 characters</li>\n<li><strong>Password:</strong> minimum 8 characters</li>\n</ul>\n</li>\n<li>Tap <strong>&quot;Offline&quot;</strong> — it changes to <strong>&quot;Online&quot;</strong></li>\n<li>Note the <strong>local IP address</strong> displayed (e.g. <code>192.168.1.50</code>)</li>\n</ol>\n<p>Your phone is now running an HTTP server. Verify it:</p>\n<pre><code class=\"language-bash\"># Health check\ncurl http://192.168.1.50:8080/health\n\n# Swagger docs\nopen http://192.168.1.50:8080/docs\n</code></pre>\n<h3>Send Your First SMS</h3>\n<pre><code class=\"language-bash\">curl -X POST http://192.168.1.50:8080/message \\\n  -u &quot;admin:yourpassword&quot; \\\n  -H &quot;Content-Type: application/json&quot; \\\n  -d '{\n    &quot;textMessage&quot;: { &quot;text&quot;: &quot;Hello from my SMS gateway!&quot; },\n    &quot;phoneNumbers&quot;: [&quot;+15551234567&quot;]\n  }'\n</code></pre>\n<p>That's it. The phone sends the SMS from its own number, using your mobile plan's\nrates.</p>\n<h3>Register a Webhook for Inbound SMS</h3>\n<p>To receive SMS messages as webhooks:</p>\n<pre><code class=\"language-bash\">curl -X POST http://192.168.1.50:8080/webhooks \\\n  -u &quot;admin:yourpassword&quot; \\\n  -H &quot;Content-Type: application/json&quot; \\\n  -d '{\n    &quot;id&quot;: &quot;my-webhook&quot;,\n    &quot;url&quot;: &quot;http://192.168.1.100:4000/api/sms/webhook&quot;,\n    &quot;event&quot;: &quot;sms:received&quot;\n  }'\n</code></pre>\n<p>Replace <code>192.168.1.100</code> with your dev machine's local IP. Both devices need to\nbe on the same WiFi network.</p>\n<h3>Local Mode Gotchas</h3>\n<ul>\n<li><strong>AP isolation:</strong> Many routers — especially mesh networks and office WiFi —\nblock device-to-device traffic. If you can't reach the phone, check your\nrouter settings for &quot;AP isolation&quot; or &quot;client isolation&quot; and disable it. This\none caught me out for a good 20 minutes.</li>\n<li><strong>Battery optimisation:</strong> Android will kill the background server to save\nbattery. Disable battery optimisation for SMS Gateway in your phone settings.\n<a href=\"https://dontkillmyapp.com/\">dontkillmyapp.com</a> has device-specific\ninstructions — genuinely useful site.</li>\n<li><strong>Keep it plugged in:</strong> During development and in production, the phone lives\non a charger. It's not going anywhere.</li>\n</ul>\n<hr>\n<h2>Cloud Server Mode</h2>\n<p>Cloud mode is easier to set up and works from anywhere — no local network\nrequired. The phone connects to SMS Gateway's cloud relay (<code>api.sms-gate.app</code>),\nand your backend talks to the same cloud API.</p>\n<p><img src=\"https://jonnonz.com/img/posts/sms-gateway/cloud-server.png\" alt=\"Cloud server settings\"> <em>Cloud server\nconfiguration</em></p>\n<h3>Enable It</h3>\n<ol>\n<li>Toggle <strong>&quot;Cloud Server&quot;</strong> on in the app</li>\n<li>Tap <strong>&quot;Offline&quot;</strong> — it connects and registers automatically</li>\n<li>A <strong>username</strong> and <strong>password</strong> are auto-generated (visible in the Cloud\nServer section)</li>\n<li>Note these credentials — you'll need them for API calls</li>\n</ol>\n<p>The cloud uses a hybrid push architecture: Firebase Cloud Messaging as the\nprimary channel, Server-Sent Events as fallback, and 15-minute polling as a last\nresort. It's well thought through.</p>\n<h3>Send an SMS via Cloud API</h3>\n<pre><code class=\"language-bash\">curl -X POST https://api.sms-gate.app/3rdparty/v1/messages \\\n  -u &quot;YOUR_USERNAME:YOUR_PASSWORD&quot; \\\n  -H &quot;Content-Type: application/json&quot; \\\n  -d '{\n    &quot;textMessage&quot;: { &quot;text&quot;: &quot;Hello from the cloud!&quot; },\n    &quot;phoneNumbers&quot;: [&quot;+15551234567&quot;]\n  }'\n</code></pre>\n<h3>Register a Webhook (Cloud Mode)</h3>\n<p>Your webhook URL <strong>must be HTTPS</strong> in cloud mode. For local development, use\nngrok:</p>\n<pre><code class=\"language-bash\"># Start ngrok tunnel to your dev server\nngrok http 4000\n# Output: https://abc123.ngrok.app\n\n# Register the webhook\ncurl -X POST https://api.sms-gate.app/3rdparty/v1/webhooks \\\n  -u &quot;YOUR_USERNAME:YOUR_PASSWORD&quot; \\\n  -H &quot;Content-Type: application/json&quot; \\\n  -d '{\n    &quot;url&quot;: &quot;https://abc123.ngrok.app/api/sms/webhook&quot;,\n    &quot;event&quot;: &quot;sms:received&quot;\n  }'\n</code></pre>\n<h3>Manage Webhooks</h3>\n<pre><code class=\"language-bash\"># List webhooks\ncurl -u &quot;YOUR_USERNAME:YOUR_PASSWORD&quot; \\\n  https://api.sms-gate.app/3rdparty/v1/webhooks\n\n# Delete a webhook\ncurl -X DELETE -u &quot;YOUR_USERNAME:YOUR_PASSWORD&quot; \\\n  https://api.sms-gate.app/3rdparty/v1/webhooks/WEBHOOK_ID\n</code></pre>\n<hr>\n<h2>The Code — Next.js Integration</h2>\n<p>Here's how I integrated SMS Gateway into a Next.js app with a clean provider\nabstraction. The idea is simple — swap providers without touching business\nlogic.</p>\n<h3>Provider Interface</h3>\n<pre><code class=\"language-typescript\">// src/lib/sms/provider.ts\n\nexport interface InboundSms {\n  from: string;\n  body: string;\n  receivedAt?: Date;\n}\n\nexport interface SmsProvider {\n  send(to: string, body: string): Promise&lt;string&gt;;\n  parseWebhook(req: Request): Promise&lt;InboundSms | null&gt;;\n  webhookResponse(replyText?: string): Response;\n}\n\nexport async function getSmsProvider(): Promise&lt;SmsProvider&gt; {\n  const provider = process.env.SMS_PROVIDER || &quot;sms-gate&quot;;\n\n  switch (provider) {\n    case &quot;sms-gate&quot;: {\n      const { SmsGateProvider } = await import(&quot;./sms-gate&quot;);\n      return new SmsGateProvider();\n    }\n    case &quot;console&quot;: {\n      const { ConsoleProvider } = await import(&quot;./console&quot;);\n      return new ConsoleProvider();\n    }\n    default:\n      throw new Error(`Unknown SMS provider: ${provider}`);\n  }\n}\n</code></pre>\n<h3>SMS Gate Provider</h3>\n<p>The provider handles both local and cloud API differences:</p>\n<pre><code class=\"language-typescript\">// src/lib/sms/sms-gate.ts\n\nimport type { InboundSms, SmsProvider } from &quot;./provider&quot;;\n\nconst SMSGATE_URL = process.env.SMSGATE_URL || &quot;http://localhost:8080&quot;;\nconst SMSGATE_USER = process.env.SMSGATE_USER || &quot;&quot;;\nconst SMSGATE_PASSWORD = process.env.SMSGATE_PASSWORD || &quot;&quot;;\n\nexport class SmsGateProvider implements SmsProvider {\n  private headers(): Record&lt;string, string&gt; {\n    const auth = Buffer.from(\n      `${SMSGATE_USER}:${SMSGATE_PASSWORD}`,\n    ).toString(&quot;base64&quot;);\n    return {\n      &quot;Content-Type&quot;: &quot;application/json&quot;,\n      Authorization: `Basic ${auth}`,\n    };\n  }\n\n  async send(to: string, body: string): Promise&lt;string&gt; {\n    const isCloud = SMSGATE_URL.includes(&quot;api.sms-gate.app&quot;);\n    const endpoint = isCloud\n      ? `${SMSGATE_URL}/3rdparty/v1/messages`\n      : `${SMSGATE_URL}/api/3rdparty/v1/message`;\n    const payload = isCloud\n      ? { textMessage: { text: body }, phoneNumbers: [to] }\n      : { phoneNumbers: [to], message: body };\n\n    const res = await fetch(endpoint, {\n      method: &quot;POST&quot;,\n      headers: this.headers(),\n      body: JSON.stringify(payload),\n    });\n\n    if (!res.ok) {\n      const err = await res.text();\n      throw new Error(`SMS Gate send failed: ${res.status} ${err}`);\n    }\n\n    const data = await res.json();\n    return data.id || &quot;sent&quot;;\n  }\n\n  async parseWebhook(req: Request): Promise&lt;InboundSms | null&gt; {\n    try {\n      const body = await req.json();\n\n      if (body.event !== &quot;sms:received&quot; || !body.payload) {\n        return null;\n      }\n\n      const { phoneNumber, message, receivedAt } = body.payload;\n      if (!phoneNumber || !message) return null;\n\n      return {\n        from: phoneNumber,\n        body: message,\n        receivedAt: receivedAt ? new Date(receivedAt) : new Date(),\n      };\n    } catch {\n      return null;\n    }\n  }\n\n  webhookResponse(): Response {\n    return new Response(JSON.stringify({ ok: true }), {\n      headers: { &quot;Content-Type&quot;: &quot;application/json&quot; },\n    });\n  }\n}\n</code></pre>\n<h3>Webhook Route</h3>\n<p>A basic webhook handler that receives inbound SMS and replies:</p>\n<pre><code class=\"language-typescript\">// src/app/api/sms/webhook/route.ts\n\nimport { NextRequest } from &quot;next/server&quot;;\nimport { getSmsProvider } from &quot;@/lib/sms/provider&quot;;\n\nexport async function POST(req: NextRequest) {\n  const provider = await getSmsProvider();\n  const sms = await provider.parseWebhook(req);\n\n  if (!sms) {\n    return new Response(&quot;Bad request&quot;, { status: 400 });\n  }\n\n  const { from, body } = sms;\n\n  // Look up the sender — replace with your own user lookup\n  const user = await findUserByPhone(from);\n\n  if (!user) {\n    await provider.send(from, &quot;Hey! Text us back once you've signed up.&quot;);\n    return provider.webhookResponse();\n  }\n\n  // Known user — do whatever your app needs\n  console.log(`[SMS from ${from}]: ${body}`);\n  await provider.send(from, &quot;Got it — we're on it!&quot;);\n  return provider.webhookResponse();\n}\n</code></pre>\n<h3>Console Provider (for Testing)</h3>\n<p>For local development without a phone:</p>\n<pre><code class=\"language-typescript\">// src/lib/sms/console.ts\n\nimport type { InboundSms, SmsProvider } from &quot;./provider&quot;;\n\nexport class ConsoleProvider implements SmsProvider {\n  async send(to: string, body: string): Promise&lt;string&gt; {\n    console.log(`[SMS -&gt; ${to}] ${body}`);\n    return `console-${Date.now()}`;\n  }\n\n  async parseWebhook(req: Request): Promise&lt;InboundSms | null&gt; {\n    const data = await req.json();\n    return {\n      from: data.from || &quot;+15550000000&quot;,\n      body: data.body || &quot;&quot;,\n      receivedAt: new Date(),\n    };\n  }\n\n  webhookResponse(): Response {\n    return new Response(JSON.stringify({ ok: true }), {\n      headers: { &quot;Content-Type&quot;: &quot;application/json&quot; },\n    });\n  }\n}\n</code></pre>\n<h3>Environment Variables</h3>\n<pre><code class=\"language-bash\"># .env\n\n# Provider: &quot;sms-gate&quot; | &quot;console&quot;\nSMS_PROVIDER=sms-gate\n\n# Local mode\nSMSGATE_URL=http://192.168.1.50:8080\nSMSGATE_USER=admin\nSMSGATE_PASSWORD=yourpassword\n\n# Cloud mode\n# SMSGATE_URL=https://api.sms-gate.app\n# SMSGATE_USER=auto-generated-username\n# SMSGATE_PASSWORD=auto-generated-password\n</code></pre>\n<hr>\n<h2>Webhook Payload Reference</h2>\n<p>When someone texts your Android phone, SMS Gateway sends a POST to your webhook\nURL:</p>\n<pre><code class=\"language-json\">{\n  &quot;id&quot;: &quot;Ey6ECgOkVVFjz3CL48B8C&quot;,\n  &quot;webhookId&quot;: &quot;LreFUt-Z3sSq0JufY9uWB&quot;,\n  &quot;deviceId&quot;: &quot;your-device-id&quot;,\n  &quot;event&quot;: &quot;sms:received&quot;,\n  &quot;payload&quot;: {\n    &quot;messageId&quot;: &quot;abc123&quot;,\n    &quot;message&quot;: &quot;Hello!&quot;,\n    &quot;sender&quot;: &quot;+15551234567&quot;,\n    &quot;recipient&quot;: &quot;+15559876543&quot;,\n    &quot;simNumber&quot;: 1,\n    &quot;receivedAt&quot;: &quot;2026-04-01T12:41:59.000+00:00&quot;\n  }\n}\n</code></pre>\n<h3>Available Events</h3>\n<table>\n<thead>\n<tr>\n<th>Event</th>\n<th>Description</th>\n</tr>\n</thead>\n<tbody>\n<tr>\n<td><code>sms:received</code></td>\n<td>Inbound SMS received</td>\n</tr>\n<tr>\n<td><code>sms:sent</code></td>\n<td>Outbound SMS sent</td>\n</tr>\n<tr>\n<td><code>sms:delivered</code></td>\n<td>Outbound SMS confirmed delivered</td>\n</tr>\n<tr>\n<td><code>sms:failed</code></td>\n<td>Outbound SMS failed</td>\n</tr>\n<tr>\n<td><code>system:ping</code></td>\n<td>Heartbeat — device still alive</td>\n</tr>\n</tbody>\n</table>\n<h3>Webhook Security</h3>\n<p>SMS Gateway signs webhook payloads with HMAC-SHA256. Two headers are included:</p>\n<ul>\n<li><code>X-Signature</code> — hex-encoded HMAC-SHA256 signature</li>\n<li><code>X-Timestamp</code> — Unix timestamp used in signing</li>\n</ul>\n<pre><code class=\"language-typescript\">import crypto from &quot;crypto&quot;;\n\nfunction verifyWebhook(\n  signingKey: string,\n  payload: string,\n  timestamp: string,\n  signature: string,\n): boolean {\n  const expected = crypto\n    .createHmac(&quot;sha256&quot;, signingKey)\n    .update(payload + timestamp)\n    .digest(&quot;hex&quot;);\n  return crypto.timingSafeEqual(\n    Buffer.from(expected, &quot;hex&quot;),\n    Buffer.from(signature, &quot;hex&quot;),\n  );\n}\n</code></pre>\n<h3>Retry Behaviour</h3>\n<p>If your server doesn't respond 2xx within 30 seconds, SMS Gateway retries with\nexponential backoff — starting at 10 seconds, doubling each time, up to 14\nattempts (~2 days). Solid default behaviour, you don't need to configure\nanything.</p>\n<hr>\n<h2>Testing the Full Flow</h2>\n<h3>1. Start Your Dev Server</h3>\n<pre><code class=\"language-bash\">npm run dev\n# Next.js running at http://localhost:4000\n</code></pre>\n<h3>2. Expose It (Cloud Mode)</h3>\n<pre><code class=\"language-bash\">ngrok http 4000\n# https://abc123.ngrok.app -&gt; http://localhost:4000\n</code></pre>\n<h3>3. Register the Webhook</h3>\n<pre><code class=\"language-bash\">curl -X POST https://api.sms-gate.app/3rdparty/v1/webhooks \\\n  -u &quot;USERNAME:PASSWORD&quot; \\\n  -H &quot;Content-Type: application/json&quot; \\\n  -d '{\n    &quot;url&quot;: &quot;https://abc123.ngrok.app/api/sms/webhook&quot;,\n    &quot;event&quot;: &quot;sms:received&quot;\n  }'\n</code></pre>\n<h3>4. Send a Text</h3>\n<p>Text your Android phone from another phone. You should see:</p>\n<ol>\n<li>SMS Gateway receives the text</li>\n<li>Webhook fires to your ngrok URL</li>\n<li>Your Next.js server processes it</li>\n<li>A reply SMS is sent back via the API</li>\n<li>The sender's phone receives the reply</li>\n</ol>\n<p>That moment when the reply lands on your phone — genuinely satisfying.</p>\n<h3>Test Without a Phone</h3>\n<pre><code class=\"language-bash\"># Simulate an inbound SMS with the console provider\nSMS_PROVIDER=console npm run dev\n\ncurl -X POST http://localhost:4000/api/sms/webhook \\\n  -H &quot;Content-Type: application/json&quot; \\\n  -d '{&quot;from&quot;: &quot;+15551234567&quot;, &quot;body&quot;: &quot;Hello&quot;}'\n</code></pre>\n<hr>\n<h2>Production Considerations</h2>\n<h3>The Phone Setup</h3>\n<ul>\n<li><strong>Dedicated device:</strong> Use a cheap Android phone ($20) with a prepaid SIM. It\nsits on a charger plugged into power and WiFi. That's its whole life now.</li>\n<li><strong>Battery optimisation off:</strong> Disable battery optimisation for SMS Gateway or\nAndroid will kill it. <a href=\"https://dontkillmyapp.com/\">dontkillmyapp.com</a> for your\nspecific device.</li>\n<li><strong>Auto-start:</strong> Enable &quot;start on boot&quot; in the SMS Gateway app settings.</li>\n<li><strong>Monitoring:</strong> Register a <code>system:ping</code> webhook to alert if the device goes\noffline.</li>\n</ul>\n<h3>Local vs Cloud</h3>\n<table>\n<thead>\n<tr>\n<th></th>\n<th>Local</th>\n<th>Cloud</th>\n</tr>\n</thead>\n<tbody>\n<tr>\n<td><strong>Latency</strong></td>\n<td>Lower (direct)</td>\n<td>Slightly higher (relay)</td>\n</tr>\n<tr>\n<td><strong>Network</strong></td>\n<td>Same network required</td>\n<td>Works from anywhere</td>\n</tr>\n<tr>\n<td><strong>Privacy</strong></td>\n<td>Messages never leave your network</td>\n<td>Messages transit through SMS Gateway's servers</td>\n</tr>\n<tr>\n<td><strong>Reliability</strong></td>\n<td>Depends on your network</td>\n<td>Adds FCM/SSE redundancy</td>\n</tr>\n<tr>\n<td><strong>Cost</strong></td>\n<td>Free</td>\n<td>Free (community tier)</td>\n</tr>\n</tbody>\n</table>\n<p>I use <strong>cloud mode in production</strong> because my server's hosted on Railway and\ncan't reach the phone's local network. For development on the same WiFi, local\nmode is simpler and faster.</p>\n<h3>Cost Comparison</h3>\n<table>\n<thead>\n<tr>\n<th>Provider</th>\n<th>SMS Cost</th>\n<th>Monthly (1,000 msgs)</th>\n</tr>\n</thead>\n<tbody>\n<tr>\n<td>Twilio</td>\n<td>~$0.05/msg</td>\n<td>~$50</td>\n</tr>\n<tr>\n<td>SMS Gateway + Prepaid SIM</td>\n<td>$0/msg (unlimited plan)</td>\n<td>~$8 (plan cost)</td>\n</tr>\n</tbody>\n</table>\n<p>That's an <strong>80%+ saving</strong>, and it scales linearly — 10,000 messages a month is\nstill just your plan cost.</p>\n<hr>\n<p>It's worth knowing this is a whole category now. <a href=\"https://httpsms.com/\">httpSMS</a>\nand <a href=\"https://textbee.dev/\">textbee</a> do similar things. I went with\n<a href=\"https://github.com/capcom6/android-sms-gateway\">SMS Gateway for Android</a>\nbecause the local mode is properly useful for development, the\n<a href=\"https://docs.sms-gate.app/\">documentation</a> is solid, and it's actively\nmaintained — v1.56.0 dropped in March 2026.</p>\n<p>For an MVP, the maths is obvious. A $20 phone and an $8/month plan gets you a\nprogrammable SMS gateway that you fully control. No per-message fees, no carrier\ncontracts, no vendor lock-in. If you outgrow it, swap the provider interface to\nTwilio and you're done — that's why the abstraction exists.</p>\n<p><strong>Links:</strong></p>\n<ul>\n<li><a href=\"https://github.com/capcom6/android-sms-gateway\">SMS Gateway for Android on GitHub</a></li>\n<li><a href=\"https://docs.sms-gate.app/\">SMS Gateway Documentation</a></li>\n<li><a href=\"https://play.google.com/store/apps/details?id=me.capcom.smsgateway\">Google Play Store listing</a></li>\n</ul>\n","date_published":"Thu, 02 Apr 2026 00:00:00 GMT"},{"id":"https://jonnonz.com/posts/wrangling-a-million-crime-records/","url":"https://jonnonz.com/posts/wrangling-a-million-crime-records/","title":"Wrangling a Million Crime Records","content_html":"<p>The very first thing NZ Police's crime dataset teaches you is that government\ndata is never straightforward.</p>\n<p>You download the CSV from\n<a href=\"https://www.police.govt.nz/about-us/publications-statistics/data-and-statistics/policedatanz/victimisation-time-and-place\">policedata.nz</a>,\nexpecting to do a quick <code>pd.read_csv()</code> and start exploring. Instead you get a\n503MB file encoded in UTF-16 Little Endian with tab delimiters. Not a regular\nCSV. Not even close. This is a legacy format from old Excel exports and most\ntools just silently corrupt it if you try to read it as UTF-8.</p>\n<pre><code class=\"language-python\">df = pd.read_csv(&quot;data.csv&quot;, encoding=&quot;utf-16-le&quot;, sep=&quot;\\t&quot;)\n</code></pre>\n<p>That one line took longer to figure out than I'd like to admit.</p>\n<h2>What's actually in here</h2>\n<p>Once you get past the encoding, there's a lot to work with. 1,154,102 rows\ncovering every reported victimisation in New Zealand from February 2022 through\nJanuary 2026. Each row tells you the crime type (ANZSOC Division), where it\nhappened (down to meshblock level), when it happened (month, day of week, hour\nof day), and sometimes what weapon was involved.</p>\n<p>There are 20 columns, but five of them are useless: three are duplicates of\n&quot;Year Month&quot; and two are constants that add zero information. Every area name\nand territorial authority has a trailing period stuck on the end: &quot;Auckland.&quot;,\n&quot;Woodglen.&quot;, &quot;Christchurch City.&quot;. A quirk of the export that'll break any\ngeographic join if you don't strip them.</p>\n<p>And meshblock IDs? Some are 6 digits, some are 7. Stats NZ boundary files use\n7-digit codes consistently, so shorter ones need zero-padding. The kind of thing\nthat's invisible until your join silently drops 19% of your records and you\nspend an afternoon figuring out why.</p>\n<h2>What the missing data tells you</h2>\n<p>That bit actually made me stop and think. 32.2% of records have the hour of day\nrecorded as 99 (unknown). Another 23.2% have the day of week as &quot;UNKNOWN&quot;.</p>\n<p>At first this looks like a data quality problem. But it's not. It's telling you\nsomething about the nature of the crime. If someone breaks into your house while\nyou're at work, you come home to find your stuff gone. Was it 9am or 2pm? You've\ngot no idea, and neither do the police.</p>\n<p>Property crimes (theft, burglary) make up the bulk of these unknowns. Assault,\nby contrast, almost always has a precise time because there's a victim present\nwhen it happens. The absence of data is itself a signal about what kind of crime\nyou're looking at.</p>\n<p>78.6% of location type values are &quot;.&quot; (effectively missing). That column is\nsparsely populated but still useful for the roughly one in five records that\nhave it.</p>\n<h2>Cleaning it up</h2>\n<p>We built a modular pipeline where each cleaning step is its own function.\nNothing fancy, just practical:</p>\n<pre><code class=\"language-python\">def ingest() -&gt; pd.DataFrame:\n    df = load_raw_csv(RAW_CSV)            # UTF-16 LE, tab-delimited\n    df = drop_redundant_columns(df)        # Remove 5 useless columns\n    df = rename_columns(df)                # snake_case everything\n    df = parse_dates(df)                   # &quot;July 2022&quot; → datetime\n    df = clean_strings(df)                 # Strip trailing periods\n    df = clean_meshblocks(df)              # Zero-pad to 7 digits\n    df = encode_unknowns(df)              # 99 → NaN, &quot;UNKNOWN&quot; → NaN\n    df = map_crime_types(df)               # ANZSOC Division → short enum\n    return df\n</code></pre>\n<p>Each function does one thing. If something breaks, you know exactly where. If\nsomeone wants to understand the pipeline, they can read it top to bottom in\nabout thirty seconds. I've been bitten enough times by monolithic data scripts\nthat I'm allergic to them now.</p>\n<p>The crime type mapping turns six ANZSOC Division values into short enums:</p>\n<table>\n<thead>\n<tr>\n<th>Crime Type</th>\n<th>Count</th>\n<th>Share</th>\n</tr>\n</thead>\n<tbody>\n<tr>\n<td>Theft</td>\n<td>761,977</td>\n<td>66.0%</td>\n</tr>\n<tr>\n<td>Burglary</td>\n<td>247,034</td>\n<td>21.4%</td>\n</tr>\n<tr>\n<td>Assault</td>\n<td>115,383</td>\n<td>10.0%</td>\n</tr>\n<tr>\n<td>Robbery</td>\n<td>14,860</td>\n<td>1.3%</td>\n</tr>\n<tr>\n<td>Sexual</td>\n<td>13,943</td>\n<td>1.2%</td>\n</tr>\n<tr>\n<td>Harm</td>\n<td>905</td>\n<td>0.1%</td>\n</tr>\n</tbody>\n</table>\n<p>That 66% theft number is going to haunt us when we get to model training. Any\nloss function you throw at this data will overwhelmingly optimise for predicting\ntheft, because that's two-thirds of everything. The class imbalance is real and\nit matters.</p>\n<h2>503MB to 6.3MB</h2>\n<p>The cleaned output goes to\n<a href=\"https://www.datacamp.com/tutorial/apache-parquet\">Apache Parquet</a> with snappy\ncompression. The result?</p>\n<ul>\n<li><strong>Input</strong>: 503MB CSV (UTF-16, 20 columns)</li>\n<li><strong>Output</strong>: 6.3MB Parquet (21 columns including derived fields)</li>\n<li><strong>Compression</strong>: ~80x</li>\n</ul>\n<p>That's not a typo. Parquet's columnar storage is dramatically more efficient\nthan row-oriented CSV, especially when you've got columns full of repeated\nvalues like crime types and territorial authorities. The file loads in under a\nsecond compared to 3+ seconds for the CSV. When you're iterating on analysis and\nloading this data hundreds of times, that adds up fast.</p>\n<p>The 21 output columns include the original 16 we kept plus five derived ones: a\nproper datetime, year, month, day-of-week as an integer, and the short crime\ntype enum.</p>\n<h2>Sanity checks</h2>\n<p>Before calling the data clean, we verify everything that matters:</p>\n<ul>\n<li>Row count: 1,154,102 (all rows preserved, nothing dropped)</li>\n<li>No nulls in key columns: crime_type, date, area_unit, territorial_authority,\nmeshblock</li>\n<li>Date range: Feb 2022 to Jan 2026 (all 48 months present)</li>\n<li>Auckland: 412,669 records, 36% of total (exactly where it should be)</li>\n<li>Theft: 761,977 records, 66% (as expected)</li>\n<li>No trailing periods anywhere in area names</li>\n<li>All meshblock IDs are 7 digits</li>\n<li>Max hour value is 23 (no more 99s leaking through)</li>\n</ul>\n<p>You want these checks automated and running every time you regenerate the data.\nFuture you will thank past you when something upstream changes and a check\ncatches it.</p>\n<h2>What's next</h2>\n<p>We've got clean, compressed crime data, but the records only have meshblock IDs\nand area unit names. No coordinates. No shapes on a map. In the next post, we'll\ndownload Stats NZ geographic boundary files and join them to our crime records,\ngiving every victimisation a place in physical space.</p>\n","date_published":"Thu, 26 Mar 2026 00:00:00 GMT"},{"id":"https://jonnonz.com/posts/crime-as-video/","url":"https://jonnonz.com/posts/crime-as-video/","title":"Crime as Video","content_html":"<p>This is where the project gets properly fun.</p>\n<p>We've got 1.15 million clean crime records. Every one of them has coordinates:\neither precise meshblock centroids or area unit fallbacks from Part 2. But a bag\nof lat/lon points isn't what a neural network wants. ConvLSTM and ST-ResNet are\nfundamentally image-processing architectures. They expect regular 2D grids, rows\nand columns, like pixels in a photograph.</p>\n<p>So our job now is to convert the messy reality of crime locations into clean,\nregular &quot;crime images&quot; that a convolutional network can actually consume. And\nonce you see it framed that way, crime prediction becomes video prediction. Each\nmonth is a frame. Each grid cell is a pixel. The brightness is the crime count.</p>\n<h2>Choosing 500m</h2>\n<p>This is the single most consequential decision in the entire data pipeline. Get\nthe grid resolution wrong and everything downstream suffers.</p>\n<p>Too fine (say 100m cells) and the vast majority of cells are empty in any given\nmonth. The model sees an ocean of zeros with occasional spikes, which is\nincredibly hard to learn from. Too coarse (say 2km) and you've blurred away the\nspatial patterns you're trying to detect. &quot;Auckland CBD&quot; and &quot;Ponsonby&quot; become\nthe same cell, which is useless.</p>\n<p>We computed Auckland's urban crime extent from the meshblock centroids (5th to\n95th percentile to exclude outliers like Great Barrier Island):</p>\n<table>\n<thead>\n<tr>\n<th>Metric</th>\n<th>Value</th>\n</tr>\n</thead>\n<tbody>\n<tr>\n<td>Urban extent</td>\n<td>27.7 km × 36.9 km</td>\n</tr>\n<tr>\n<td>Grid resolution</td>\n<td>500m × 500m</td>\n</tr>\n<tr>\n<td>Grid dimensions</td>\n<td>77 rows × 59 columns</td>\n</tr>\n<tr>\n<td>Total cells</td>\n<td>4,543</td>\n</tr>\n</tbody>\n</table>\n<p>At 500m, each cell covers roughly a few city blocks. That's fine enough to\ndistinguish a commercial strip from a residential street, but coarse enough that\nmost cells accumulate at least some crime over the 48-month period. It's a sweet\nspot, and it's consistent with what\n<a href=\"https://arxiv.org/abs/2502.07465\">recent crime forecasting research</a> uses for\nsimilar models in US cities.</p>\n<h2>Simple maths, no spatial joins</h2>\n<p>Working in NZTM2000 (the coordinate system we set up in Part 2, where units are\nmetres) makes the next bit easy. Assigning a crime to a grid cell is just floor\ndivision:</p>\n<pre><code class=\"language-python\">grid_j = floor((x - xmin) / 500)  # column index\ngrid_i = floor((y - ymin) / 500)  # row index\n</code></pre>\n<p>No spatial joins, no polygon intersection, no geopandas overhead. Just\narithmetic. It processes all 400k Auckland records in under a second.</p>\n<p>For the ~22% of Auckland records that didn't get meshblock coordinates in Part\n2, we fall back to area unit centroids converted to NZTM2000. Those records land\nat the centre of their suburb rather than their exact location. Less precise,\nbut dropping them entirely would be worse.</p>\n<p>The result: 354,387 of 412,669 Auckland records (86.2%) fall within the grid.\nThe remaining 14% are in Auckland's outer fringes (Great Barrier Island, rural\nRodney, the edges of the Waitakere Ranges) beyond our urban bounding box. That's\nfine. We're modelling urban crime patterns, not rural ones.</p>\n<h2>The 4D tensor</h2>\n<p>With every crime assigned to a cell, we aggregate by grid position, month, and\ncrime type:</p>\n<pre><code>(grid_i, grid_j, month, crime_type) → sum(victimisations)\n</code></pre>\n<p>This gives us a 4D tensor:</p>\n<table>\n<thead>\n<tr>\n<th>Dimension</th>\n<th>Size</th>\n<th>Meaning</th>\n</tr>\n</thead>\n<tbody>\n<tr>\n<td>T (time)</td>\n<td>48</td>\n<td>Months: Feb 2022 – Jan 2026</td>\n</tr>\n<tr>\n<td>H (height)</td>\n<td>77</td>\n<td>Grid rows (south → north)</td>\n</tr>\n<tr>\n<td>W (width)</td>\n<td>59</td>\n<td>Grid columns (west → east)</td>\n</tr>\n<tr>\n<td>C (channels)</td>\n<td>6</td>\n<td>Crime types: theft, burglary, assault, robbery, sexual, harm</td>\n</tr>\n</tbody>\n</table>\n<p>Think of it as a 48-frame video with 6 colour channels. A regular video has 3\nchannels: red, green, blue. Ours has 6: theft, burglary, assault, robbery,\nsexual offences, harm. Each pixel's brightness in a given channel tells you how\nmany of that crime type happened in that 500m cell during that month.</p>\n<p>I genuinely love this framing. It takes a complicated spatial-temporal\nprediction problem and maps it onto something that decades of computer vision\nresearch already knows how to handle.</p>\n<h2>91.7% zeros</h2>\n<p>The tensor is overwhelmingly empty. 91.7% of all cells are zero.</p>\n<p>This makes complete sense if you think about it. Most 500m squares in Auckland\ndon't have a single reported crime in any given month. Crime clusters:\ncommercial corridors, transport hubs, specific residential pockets. The non-zero\n8.3% is where all the signal lives.</p>\n<p>The sparsity does create a training challenge though. If the model just\npredicted zero everywhere, it'd be right 91.7% of the time. Useless, but\ntechnically accurate. That's why we'll use <code>log1p</code> normalisation during\ntraining. It compresses the range from [0, 50+] to [0, ~4], giving the model a\nmore balanced gradient to learn from. And it's why the loss function needs to\ncare more about the non-zero cells than the empty ones.</p>\n<p>The upside of all those zeros is storage. The\n<a href=\"https://numpy.org/doc/stable/reference/generated/numpy.savez_compressed.html\">compressed numpy format</a>\nhandles sparse data beautifully. The full 4D tensor saves to just 0.2 MB.\nCompare that to the 21.9 MB Parquet from Part 2.</p>\n<h2>Train, validate, test</h2>\n<p>We split the 48 months temporally. No shuffling, no random sampling:</p>\n<table>\n<thead>\n<tr>\n<th>Set</th>\n<th>Months</th>\n<th>Range</th>\n</tr>\n</thead>\n<tbody>\n<tr>\n<td>Train</td>\n<td>36</td>\n<td>Feb 2022 – Jan 2025</td>\n</tr>\n<tr>\n<td>Validation</td>\n<td>6</td>\n<td>Feb 2025 – Jul 2025</td>\n</tr>\n<tr>\n<td>Test</td>\n<td>6</td>\n<td>Aug 2025 – Jan 2026</td>\n</tr>\n</tbody>\n</table>\n<p>The model trains on three years, tunes on six months, and gets evaluated on the\nmost recent six months it's never seen. There's no spatial leakage either. We\ndon't hold out specific grid cells. The model has to predict all locations for\nfuture months simultaneously.</p>\n<p>This is the only honest way to evaluate a time-series model. If you randomly\nshuffle months into train and test, the model can memorise seasonal patterns and\nlook brilliant without actually learning anything useful about temporal\ndynamics.</p>\n<h2>What the tensor reveals</h2>\n<p>Even at this aggregate level, clear patterns jump out.</p>\n<p>February tends to be the quietest month (~7–8k victimisations across Auckland),\nwhile October through January (spring and early summer) consistently peaks at\n8.5–9.5k. 2023 was the peak year across the board, with a gradual decline\nthrough 2024 and into 2025.</p>\n<p>Theft accounts for 72% of the tensor values (283k victimisations), burglary 17%\n(68k), and assault 9% (34k). That theft dominance from Part 1, the 66% figure,\ngets even more pronounced when you focus on Auckland, because theft clusters\nharder in urban areas than other crime types do.</p>\n<h2>What's next</h2>\n<p>The tensor is built. The model input is ready. But before throwing deep learning\nat anything, we need to properly understand what patterns actually exist in this\ndata: when does crime peak, where does it cluster, and how do different crime\ntypes behave differently. Next post: exploratory data analysis.</p>\n","date_published":"Thu, 26 Mar 2026 00:00:00 GMT"},{"id":"https://jonnonz.com/posts/giving-crime-a-place-on-the-map/","url":"https://jonnonz.com/posts/giving-crime-a-place-on-the-map/","title":"Giving Crime a Place on the Map","content_html":"<p>A crime record that says &quot;Woodglen, meshblock 0284305&quot; is useless for spatial\nmodelling. It's a name and a number. You can't plot it, you can't measure\ndistances from it, and you definitely can't feed it to a neural network that\nthinks in grid cells.</p>\n<p>To do anything spatial, every record needs actual coordinates: latitude,\nlongitude, or ideally metres on a proper projection. That means downloading\nStats NZ's geographic boundary files and joining them to our crime data.</p>\n<h2>NZ's geographic hierarchy</h2>\n<p>New Zealand has a neat nested system of geographic units maintained by\n<a href=\"https://datafinder.stats.govt.nz/layer/92197-meshblock-2018-generalised/\">Stats NZ</a>:</p>\n<pre><code class=\"language-mermaid\">graph TD\n    A[&quot;Region (16)&quot;] --&gt; B[&quot;Territorial Authority (67)&quot;]\n    B --&gt; C[&quot;Area Unit / SA2 (~2,000)&quot;]\n    C --&gt; D[&quot;Meshblock (~53,000)&quot;]\n</code></pre>\n<p>Regions are the big ones: Auckland, Canterbury, Wellington. Territorial\nauthorities are your cities and districts. Area units are roughly suburb-sized.\nAnd meshblocks are the smallest unit, about 100 people each, roughly a city\nblock. Our crime data uses area units and meshblocks, so those are the layers we\nneed.</p>\n<p>There's a gotcha here. Stats NZ replaced &quot;Area Units&quot; with &quot;Statistical Area 2&quot;\n(SA2) in 2018 as part of a geographic classification overhaul. But the NZ Police\ncrime data still uses the old area unit names. So we need the <strong>2017 vintage</strong>\nboundary files, not the current ones. Use the wrong vintage and your join\nsilently fails on hundreds of area units. Ask me how I know.</p>\n<h2>Three boundary files</h2>\n<p>We downloaded three layers from\n<a href=\"https://datafinder.stats.govt.nz/\">Stats NZ DataFinder</a> via their WFS API:</p>\n<table>\n<thead>\n<tr>\n<th>Layer</th>\n<th>Features</th>\n<th>Size</th>\n</tr>\n</thead>\n<tbody>\n<tr>\n<td>Area Unit 2017 (generalised)</td>\n<td>2,004</td>\n<td>88 MB</td>\n</tr>\n<tr>\n<td>Meshblock 2018 (generalised)</td>\n<td>53,589</td>\n<td>213 MB</td>\n</tr>\n<tr>\n<td>Territorial Authority 2023</td>\n<td>68</td>\n<td>34 MB</td>\n</tr>\n</tbody>\n</table>\n<p>All three come in <a href=\"https://epsg.io/2193\">EPSG:2193</a>, which is\n<a href=\"https://www.linz.govt.nz/guidance/geodetic-system/coordinate-systems-used-new-zealand/projections/new-zealand-transverse-mercator-2000-nztm2000\">NZTM2000</a>,\nNew Zealand's official projected coordinate system. The units are metres, not\ndegrees. This matters a lot later when we need to build a &quot;500m grid&quot;. You want\nthat to be 500 actual metres, not some approximation based on latitude.</p>\n<p>We use generalised (simplified) versions rather than high-definition. The\nfull-resolution meshblock layer is over a gigabyte. For centroid calculations\nand spatial joins, the generalised versions are more than accurate enough.</p>\n<h2>The area unit join: 99.4%</h2>\n<p>Joining crime records to area unit boundaries by name was almost perfect.\n1,146,721 of 1,154,102 records matched, 99.4%.</p>\n<p>Only two area unit codes failed:</p>\n<ul>\n<li><code>999999</code>: the official &quot;unspecified&quot; catch-all (7,331 records)</li>\n<li><code>-29</code>: a straight-up data entry error (50 records)</li>\n</ul>\n<p>That's a genuinely excellent result. The unmatched records aren't a bug in our\npipeline. They're unlocatable crimes that the police couldn't assign to a\nspecific area. Nothing we can do about those, and nothing we should try to do.</p>\n<h2>The meshblock join: 81.2%</h2>\n<p>The meshblock join came in lower at 81.2%, with 937,604 records matched out of\n1,154,102.</p>\n<p>This is expected and it's fine. Here's why: NZ meshblock boundaries get revised\nwith every census. We're using 2018 boundaries, but our crime data runs through\nJanuary 2026. Any crime from 2023 onwards might reference a 2023-vintage\nmeshblock code that simply doesn't exist in the 2018 file. Some meshblocks get\nsplit, some get merged, some get renumbered entirely.</p>\n<p>81.2% still gives us fine-grained coordinates for the vast majority of records.\nFor the ~19% that miss, we fall back to the area unit centroid. It's less\nprecise (suburb-level instead of block-level) but better than dropping the\nrecords entirely.</p>\n<h2>Two coordinate systems</h2>\n<p>This is one of those things that seems like a minor detail but will bite you\nhard if you get it wrong. We use two coordinate reference systems throughout the\nproject:</p>\n<p><strong>NZTM2000 (EPSG:2193)</strong> for all spatial analysis. The units are metres, which\nmakes grid construction trivial: a 500m cell is literally 500 units on each\naxis. Distance calculations are straightforward. No need to worry about the fact\nthat a degree of longitude means different things at different latitudes.</p>\n<p><strong>WGS84 (EPSG:4326)</strong> for the frontend dashboard only. deck.gl and MapLibre\nexpect coordinates in degrees (latitude/longitude), which is the standard for\nweb mapping.</p>\n<p>The rule is simple: do everything in NZTM2000, convert to WGS84 at the very end\nwhen exporting for the dashboard. Mixing coordinate systems mid-pipeline is a\nrecipe for bugs that are incredibly annoying to track down.</p>\n<h2>The output</h2>\n<p>Each crime record now has up to 8 new geographic columns: area unit centroids,\nmeshblock centroids, and areas in both coordinate systems. The enriched dataset\nsaves as <code>crimes_with_geo.parquet</code> at 21.9 MB with 29 columns.</p>\n<p>Quick sanity check: Auckland's mean crime centroid lands at lat -36.90, lon\n174.78. Right in the middle of the urban area. If that number had come back as\nsomewhere in the Waikato, we'd know something went wrong.</p>\n<h2>What's next</h2>\n<p>Every crime record now has a place in physical space. But individual points\naren't what the neural network needs. It needs a regular grid. In the next post,\nwe'll overlay a 500m × 500m grid on Auckland, count crimes per cell per month,\nand build the 4D tensor that turns crime prediction into a video prediction\nproblem.</p>\n","date_published":"Thu, 26 Mar 2026 00:00:00 GMT"},{"id":"https://jonnonz.com/posts/predicting-crime-in-aotearoa/","url":"https://jonnonz.com/posts/predicting-crime-in-aotearoa/","title":"Predicting Crime in Aotearoa","content_html":"<p>NZ Police publish every recorded victimisation in the country, over a million\nrecords, and most people have no idea.</p>\n<p>I stumbled across\n<a href=\"https://www.police.govt.nz/about-us/publications-statistics/data-and-statistics/policedatanz\">policedata.nz</a>\na while back and was surprised by how much is there. Every reported theft,\nassault, burglary, robbery, broken down by location, time of day, day of week,\nand month. All the way down to meshblock level, which is roughly a city block.\nUpdated monthly.\n<a href=\"https://www.police.govt.nz/about-us/publications-statistics/data-and-statistics/policedatanz\">Creative Commons licensed</a>.\nYou can just... use it.</p>\n<p>So naturally I started wondering: what happens if you point deep learning at\nthis?</p>\n<h2>A million rows of crime</h2>\n<p>The dataset I pulled covers February 2022 through January 2026. Four years,\n1,154,102 records across the whole country. The breakdown is roughly what you'd\nexpect: theft dominates at 66%, followed by burglary at 21% and assault at 10%.\nThe remaining sliver covers robbery, sexual offences, and harm/endangerment.</p>\n<p>What makes it interesting for modelling is the spatial granularity. Each record\nmaps to one of 42,778 meshblocks (tiny geographic units defined by Stats NZ).\nThat's detailed enough to see patterns at a neighbourhood level, not just\n&quot;Auckland has more crime than Tauranga&quot; (which, yeah, obviously).</p>\n<p>Auckland alone accounts for about 36% of all recorded crime. Then there's a long\ntail: Wellington, Christchurch, Hamilton, and then it drops off fast. NZ's urban\ngeography is weird like that. One mega-city and a bunch of mid-size towns.</p>\n<h2>The idea</h2>\n<p>The core question is pretty simple. Given the crime patterns of the last few\nmonths, can we predict what the next month looks like?</p>\n<p>This isn't Minority Report. Nobody's getting arrested for crimes they haven't\ncommitted. It's pattern recognition on publicly available statistics, the same\nkind of modelling people do with weather data or traffic flows.</p>\n<p>The neat trick is how you frame it. If you overlay a grid on a city (say 500m by\n500m cells) and count crimes per cell per month, you get something that looks a\nlot like a video. Each month is a frame. Each cell is a pixel. The brightness is\nthe crime count.</p>\n<p>Predicting next month's crime becomes a video prediction problem. And there are\nsome really cool deep learning architectures built exactly for that.</p>\n<h2>ConvLSTM and ST-ResNet</h2>\n<p>The two models I'm building are ConvLSTM and ST-ResNet. Don't worry if those\nsound like gibberish. The short version: they're neural networks designed to\nlearn patterns that are both spatial (where things cluster) and temporal (how\nthose clusters change over time).</p>\n<p><strong>ConvLSTM</strong> is the primary model. A standard LSTM network is great at learning\nsequences. It's the architecture behind a lot of language and time-series\nmodels. ConvLSTM swaps out the matrix multiplications for convolutions, which\nmeans it can process grid-structured data. Feed it the last six months of crime\ngrids and it learns both the shape of hotspots and how they evolve.\n<a href=\"https://arxiv.org/abs/2502.07465\">Recent research</a> has shown these work well\nfor crime forecasting across multiple US cities.</p>\n<p><strong>ST-ResNet</strong> takes a different angle. Instead of one sequential view, it\ncaptures three temporal perspectives: what happened recently, what happened at\nthe same time last year, and what's the long-term trend. Each gets its own\nbranch of residual convolutional networks, and a learned fusion layer combines\nthem. The\n<a href=\"https://ojs.aaai.org/index.php/AAAI/article/view/10735\">original paper</a> was for\ncrowd flow prediction in Beijing, but the architecture\n<a href=\"https://www.nature.com/articles/s41598-025-24559-7\">translates well to crime data</a>.</p>\n<h2>Why NZ?</h2>\n<p>Almost all published crime prediction research uses US data. Chicago, Los\nAngeles, New York. A\n<a href=\"https://pmc.ncbi.nlm.nih.gov/articles/PMC7319308/\">systematic review of spatial crime forecasting</a>\nmakes this pretty clear. The models are well-studied, but they're trained on\nAmerican cities with American urban patterns.</p>\n<p>New Zealand doesn't look like that. Our cities are smaller, more spread out, and\nthe distribution is completely different. Auckland dominates in a way that no\nsingle US city does relative to the rest of the country. The spatial patterns\nhere are their own thing, and I couldn't find anyone who'd applied these deep\nlearning approaches to NZ data.</p>\n<p>That's what got me keen. Not because I think I'll beat the published benchmarks.\nThose researchers have GPUs and PhD students; I have a Ryzen 5 desktop with no\ngraphics card. But applying known techniques to new geography is useful work,\nand nobody else seems to have done it.</p>\n<h2>No GPU, no problem (mostly)</h2>\n<p>All of this runs on my desktop, an AMD Ryzen 5 5600GT with 12 threads and 30GB\nof RAM. No GPU at all. That sounds limiting, but the Auckland 500m grid works\nout to about 60 by 80 cells. The ConvLSTM model ends up around 5 million\nparameters, which trains in under an hour on CPU. You don't always need a beefy\nrig.</p>\n<p>It does mean being smart about model sizing and not going crazy with\nhyperparameter searches. But for a hobby project, it's more than enough.</p>\n<h2>What's coming</h2>\n<p>This is the first post in a ten-part series covering the whole project end to\nend.</p>\n<p><strong>Part 1: Data Acquisition and Exploration.</strong> We start with a 503MB CSV file\nfrom NZ Police that's UTF-16 encoded (because of course it is), has trailing\nperiods on area names, and 32% of records with unknown hour-of-day. We'll\nwrangle it into a clean, typed Parquet file and get our first look at what's\nactually in there.</p>\n<p><strong>Part 2: Geographic Data Pipeline.</strong> Crime records come with meshblock IDs, but\nno coordinates. We'll join them to Stats NZ geographic boundary files using\ngeopandas, giving every record a place on the map.</p>\n<p><strong>Part 3: Spatiotemporal Grid Construction.</strong> This is where it gets fun. We\noverlay a 500m by 500m grid on Auckland, count crimes per cell per month, and\nbuild the 4D tensors that feed the neural networks. Crime prediction becomes\nvideo prediction.</p>\n<p><strong>Part 4: Exploratory Data Analysis.</strong> Before throwing deep learning at\nanything, we need to understand what patterns actually exist. When does crime\npeak? Where does it cluster? How do different crime types behave differently?</p>\n<p><strong>Part 5: Baseline Models.</strong> Simple benchmarks (historical averages, naive\npersistence) so we know whether the deep learning is actually adding value or\njust being fancy for the sake of it.</p>\n<p><strong>Part 6: ConvLSTM Architecture.</strong> Building and training the primary model.\nThree ConvLSTM layers, six-month lookback window, learning spatial hotspots and\ntemporal dynamics simultaneously.</p>\n<p><strong>Part 7: ST-ResNet Architecture.</strong> The three-branch alternative that captures\ncloseness, periodicity, and long-term trend separately, then fuses them with\nlearned weights.</p>\n<p><strong>Part 8: Model Evaluation and Comparison.</strong> Which model wins? By how much? And\nmore importantly, where do they fail?</p>\n<p><strong>Part 9: Building the Dashboard.</strong> A 3D interactive map built with deck.gl\nwhere you can watch crime patterns evolve over time. Dark theme, extruded\ncolumns, time-lapse playback.</p>\n<p><strong>Part 10: Deployment and Reflections.</strong> Shipping to Vercel, what worked, what\ndidn't, and what I'd do differently next time.</p>\n<p>Every post will include code and real results. The whole codebase will be open\nsource. And I'll be upfront about the stuff that didn't work. Trust me, there's\nplenty of it.</p>\n<p>This is a hobby project. It's not a policing tool, it's not a product, and it's\ndefinitely not claiming to solve crime. It's just me being curious about what's\nsitting in a publicly available dataset and seeing how far you can push it with\nsome Python and a bit of patience.</p>\n","date_published":"Tue, 24 Mar 2026 00:00:00 GMT"}]}