Raising AI Kids: Issue 15

Small Models on 16GB: How Qwen3.6 and Gemma4 Run on a $600 Machine


David had just sat down with his coffee when Sam burst in holding his phone.

"Dad, look — I got it to summarize a whole article in like two seconds," Sam said, showing him something on screen. "It's local. No internet."

"Wait, on your phone?"

"No no, on the Mac Mini in the closet. I set up Ollama last night. It's running Qwen3.6. Took like twenty minutes to install."

David stared. "The little $600 computer?"

"The very same." Sam grinned. "17 billion parameters. Zero cloud cost." He paused. "And remember that 'digital guest room' thing you were talking about last week?" He gestured at the closet. "That's it. That's our DMZ."

David put down his coffee. "Show me."

And that was how Sam taught his dad about the most important AI trend most people haven't noticed yet — and the answer to a question David hadn't finished asking: How do you run your Family AI Operating System without sending your data to anyone?

Not in some research lab. On the shelf where the old printer used to live. For six hundred dollars.


What Are These Models, Really?

Two big releases changed the local AI landscape in early 2026: Qwen3.6 (from Alibaba's Qwen team, April 2026) and Google Gemma 4. Both are open-weight models — meaning anyone can download and run them. No API key, no per-token billing, no data leaving your house.

Let's be precise about what they are and what they aren't.

Qwen3.6 comes in several sizes. The one making headlines is the Qwen3.6-35B-A3B — a Mixture-of-Experts model with 35 billion total parameters but only about 3 billion "active" parameters working on any given token. That's the key: MoE models activate only a fraction of their brain for each token, making them dramatically more efficient than their parameter count suggests. It also ships with 262,144 token native context — extendable to over 1 million with YaRN — plus multimodal input support and an Apache 2.0 open-weight license.

On a Mac Mini M4 with 16GB of unified memory — the base $599 model — you can run this model quantized to 4-bit precision (Q4_K_M quantization). It fits. It works. You get about 17 tokens per second through Ollama, or up to 60 tokens per second on a machine with more RAM. That's not fast enough to feel like ChatGPT, but it's fast enough to be useful for real tasks.

Gemma 4 followed a few weeks later. Google's latest Gemma family includes a 26B MoE model (3.8B active params) and a 31B dense model, both under Apache 2.0 license. Early benchmarks on Apple Silicon show the 26B hitting 85 tokens per second on an M4 Pro MacBook Pro. One blogger swapped his Qwen setup for Gemma 4 and saw message classification drop from 8.5 seconds to 1.9 seconds. That's not a marginal improvement — that's a category change.

Key insight: MoE (Mixture-of-Experts) models are the reason a $600 Mac Mini can run a "35 billion parameter" model. Only a slice of the model activates per token, so memory requirements drop dramatically. This is the technical trick that makes local AI practical today.

The Architecture Trick: Why MoE Doesn't Slow You Down

"Wait — if the model has 35 billion parameters, how does it fit in 16GB of RAM?"

Good question. The trick is in how MoE is built. Think of it like this: when a token comes in, two things happen. First, the attention layers — the part that decides what to look at — touch everything. This stays on your GPU. Second, a tiny router decides: "Which 2 of my 32 experts should handle this token?" Then it loads just those 2 experts from system RAM, uses them, and moves on.

The other 30 experts? They sit in RAM doing nothing. They don't get loaded, they don't get used, they don't cost you anything.

This is why a model with 35 billion total parameters can run on a $600 Mac Mini. You're not loading all 35B into VRAM — you're only ever actively using the 3B or so that the router selects. The rest is in system RAM, and for most tokens, your machine barely notices it's there.

The numbers bear this out. Running Qwen3.6-35B-A3B with 18 MoE layers offloaded to CPU RAM uses about 14.6GB of VRAM on a modern GPU — with a 262,144 token context window. That's not an edge case; it's the design. Drop that to 18 layers instead of 61, and your GPU has room to breathe.

The catch: if you push throughput high — running multiple tokens in parallel so the router has to load different experts simultaneously — you'll feel the CPU-to-GPU transfer bottleneck. For a single conversation, it's seamless. For a busy classification pipeline running dozens of parallel requests, you want more GPU headroom or fewer parallel streams.


How to Run Them: Ollama and LM Studio

You don't need a computer science degree. Two tools have made local AI genuinely accessible:

Ollama (ollama.com) is the simplest path. One download, one command line to pull a model, one command to run it. For Qwen3.6-35B-A3B: ollama pull qwen3.6:35b. That's it. Then ollama run qwen3.6:35b and you're chatting with a local AI. It runs as a background service and exposes an OpenAI-compatible API, so lots of tools can use it.

LM Studio is the GUI option — a desktop app with a model browser, chat interface, and local server. Good if you prefer clicking to typing. Both are free. Both work on Mac, Windows, and Linux.

The catch: these tools don't come with models. You download them separately, and some models are 20GB+ in size. Your hard drive fills up fast if you're not careful. A 2TB external SSD costs about $150 and solves this problem permanently.

What about the little models? Qwen3.6 also comes in 2B, 4B, and 9B sizes that run on machines with far less RAM. Gemma 4 has 2B and 4B versions too. These fit easily on a laptop with 8GB. They can't do everything a 35B can, but for classification, short summaries, and fast triage work, they're surprisingly capable — and blazing fast (50-80 tokens/sec on Apple Silicon).

Quantization: What the Numbers Mean

When people say a model is "Q4_K_M" or "Q4_K_XL," they're talking about quantization — how the model's weights are stored in memory.

Full precision (32-bit float) gives maximum quality but needs enormous RAM. Quantization shrinks numbers from 32-bit to 4-bit — roughly 8x compression — with minimal noticeable quality loss for most tasks. Q4_K_M is the most common sweet spot: good quality, reasonable file size, wide tool support.

One developer tested UD-IQ2_XXS quantization (very aggressive, very small) on Qwen3.6-35B on a MacBook with 18GB RAM and said it performed "much better than expected." For simple classification tasks, aggressive quantization barely matters. For nuanced reasoning, you notice the drop. Know your use case.

Apple's MLX framework handles quantization differently and optimizes for Apple Silicon in ways that give real speedups. Benchmarks show MLX hitting 90-108 tokens/sec via HTTP on an M4 Max vs. 43 tok/sec for Ollama on the same hardware. The difference: MLX talks to Apple's GPUs natively; Ollama wraps llama.cpp with a Go layer that adds overhead.

Real numbers to know: On an M4 Max (128GB RAM), MLX Python API hits ~130 tok/s native and ~90-108 tok/s via HTTP. llama.cpp hits ~71 tok/s. Ollama hits ~43 tok/s. The differences matter less for classification (fast, short outputs) than for long-generation tasks. Ollama's simplicity is worth the speed trade for many use cases.

What These Models Can and Can't Do

Be honest about this with your kids. Local models at this size are genuinely capable for specific tasks. They're not Claude Opus. They're not GPT-4.2. They're something more like "very competent assistant for defined tasks."

Where local models shine:

Classification and triage: Is this message a question, a greeting, or an FYI? Should this email go in the urgent pile or the read-later pile? These are short, structured tasks that local models do well and fast — often under 2 seconds. One developer reduced message classification from 30 seconds to under 1 second just by adding "think": false to the API call. The model doesn't need to reason through simple classification.

Context compression: If someone sends a 500-word message, a local model can condense it to 30 words before your main AI sees it. Same understanding, fewer tokens — and that means lower cloud costs. One power user compresses his entire day's worth of automation signals before his nightshift Claude planning call, saving roughly 15x in tokens.

Signal processing and memory consolidation: Local models can cluster, merge, and defragment data. They're good at taking a pile of daily notes and turning them into structured summaries.

Privacy-sensitive work: No data leaves your house. This matters when you're processing personal messages, health information, or anything you'd rather not send to a cloud API.

Where they struggle:

Complex reasoning on novel problems. For anything requiring deep multi-step reasoning, you want a frontier model. A 35B model can follow instructions and produce good output, but it doesn't "think" the way a larger model does. Asking a local model to plan a complex project, debug subtle code bugs, or reason through an ambiguous situation often ends in plausible-sounding but wrong output.

Long context windows under load. Some models advertise 256K context, but on 16GB of RAM with other programs running, you're doing well to keep 4-8K stable without slow prefills. TurboQuant (a Google technique now in llama.cpp and MLX) helps significantly — 4x less active memory for equivalent accuracy — but it's still a constraint.


The Math That Makes This Worth It

Let's talk money, because this is where it gets interesting.

Claude Opus 4.6 costs $15 per million input tokens and $18 per million output tokens via API. If you're running an AI agent that processes hundreds of messages a day, that adds up fast. Claude Max subscription ($100/month for 5x usage) helps, but even that has limits.

One power user reported reducing Claude API calls by 30-40% by routing routine work through local models. Classification, compression, triage, and fallback tasks — all handled locally, zero per-token cost. His $600 Mac Mini paid for itself in two months of saved API calls.

For a family with kids who want to experiment with AI: the model runs locally, the cost is electricity (about $0.10/day if you run it constantly), and nothing goes to the cloud. A teenager can build a classification pipeline, try prompts, experiment with different models — all without burning through an API budget.

For context on cloud pricing: GPT-4.1 nano runs about $0.10 per million tokens (input) — cheap by AI standards. But if your kid runs 10,000 classification queries a day at 50 tokens each, that's $0.50/day just in API costs. Over a month, that's $15 in calls. The Mac Mini costs $0.10/day in electricity and handles unlimited queries.

Privacy and Safety: The Hidden Feature

Here's the part that doesn't show up in benchmarks: when AI runs locally, your data stays local.

Think about what your family sends to cloud AI APIs: messages, documents, notes, questions. For most uses, that's fine. But for younger kids experimenting, for sensitive family topics, for anything you'd rather not have in someone else's training data — local is different. The data goes to your Mac Mini's RAM, processes, and disappears when you turn it off. No third party, no training on your inputs, no accidental data exposure.

This is also a great sandbox for learning. A kid can try prompts, see what the model does well, see where it fails — without every experiment going to a cloud service. It's lower stakes, more private, and teaches the same lessons about AI capabilities and limitations.

Safety still applies: Local models can produce harmful outputs just like cloud models. They're not magically safe. If you're setting this up for kids, apply the same safety guidelines you'd use for any AI tool. The privacy advantage is real, but it's not a safety override.

What to Actually Buy (and What to Skip)

If you want to do this today and you're starting from scratch, here's what matters:

The Mac Mini M4 with 16GB is the entry point. At $599, it's the most cost-effective local AI machine available. It fits Qwen3.6-35B-A3B at Q4 quantization. It's not fast (17 tok/s), but it works. The 24GB version ($799) gives you headroom for higher-quality quantization or larger models. The M4 Pro versions are faster but cost significantly more.

Get an external SSD. Models are large. A 2TB NVMe SSD costs about $150 and means you never have to manage storage anxiety. Put your models on it and forget about space.

Skip the 8GB machines. You can run smaller models (2B, 4B, 9B) on 8GB, but you'll hit constant memory limits. The 16GB baseline is where "real" models start working.

On Windows, any modern gaming PC with an NVIDIA RTX 3060 or better handles these models well via llama.cpp. The GPU acceleration makes a real difference. But a dedicated Mac Mini setup is simpler and uses far less power — something to consider if the machine runs 24/7.


The Three-Tier System Smart People Are Using

Once you start using local models seriously, you end up building a routing system. One developer described it as three tiers:

Fast tier (2-4B models): Runs on every incoming message, classifies type and urgency in under 2 seconds. If it's a greeting or an FYI, the full AI never sees it. Simple gatekeeping that saves enormous amounts of downstream processing.

Primary tier (4-9B models): Handles summarization, context compression, and mid-complexity tasks. Compresses a 500-word message into 30 words for the main AI. Runs the morning summary. The sweet spot for most daily work.

Heavy tier (35B models): Complex compression, fallback when cloud AI is rate-limited or down at 3am, and anything requiring actual multi-step reasoning. Handles the nightshift summary that gets sent to the cloud AI for planning.

You don't need all three to get started. But even starting with just the fast tier — one small model running on every message — changes how you think about AI architecture. It stops being one big tool and becomes a pipeline of specialized helpers.


Do Now

This week, spend twenty minutes installing Ollama on a spare computer and running one small model. You don't need a Mac Mini — an old laptop with 16GB RAM works for the smaller models.

  1. Download Ollama from ollama.com (free)
  2. Open a terminal and run: ollama pull qwen3.6:4b (about 2.5GB, pulls in 5 minutes)
  3. Run it: ollama run qwen3.6:4b
  4. Ask it something simple: "Classify this message: 'Hey, want to grab lunch tomorrow?'" Watch what happens.

That's it. You've now got a local AI running. Show your kid. Let them try a few prompts. Notice how fast it responds, how nothing leaves your house, how there's no per-token meter running. That's the starting point. Everything else builds from here.

Sam went back to his Mac Mini setup and started piping different types of requests through different model sizes. He showed David how the 4B model handled quick classification, how the 35B handled the complex summaries, and how everything downstream got faster because the expensive AI only saw what was actually worth its time.

"This is basically a bouncer for our AI pipeline," Sam said.

"That's exactly what it is," David said. And he meant it as a compliment.