Saturday

18 April 2026 Vol 19

Gemma 4 just replaced my whole local LLM stack

Local LLMs have mostly been a novelty for me. I’ve used them ever since they became convenient, but still, mostly for the novelty. I’d run one, then sit there thinking, hey, this is happening on my computer! It’s cool. But for most practical use cases, that was about it.

Because when I’m serious about something, I still reach for a major chatbot like Claude, Gemini, or ChatGPT. You don’t even need to ask why. For a model to run at a reasonable speed on my computer or yours, it has to be small. Small models are nowhere near as capable or accurate as the big cloud ones. So if you actually want a local LLM to be practical, you usually need a ridiculous rig: a big GPU, a capable CPU, and lots and lots of RAM. Most of us don’t have that. After all, the AI infrastructure companies are busy buying all the RAM so we can’t. It’s an uphill battle.

OpenClaw running on a MacBook

OpenClaw is everything you’d expect

The “agent” hype finally earned it.

The story is not quite that bleak, though. Recently, Google dropped the Gemma 4 line of models. This is different from Gemini. It’s not the model powering the chatbot, but the two do share some DNA. The biggest surprise with Gemma 4 is that it’s fully open-source. An Apache license! That alone makes it stand out.

That’s not all, either. Gemma 4 has some very interesting training behind it with MoE. In simple terms, the mixture-of-experts setup here works differently. It lets Gemma behave with something closer to the precision of a 26B model, while running at the speed of a 4B one.

Gemma 4 also comes in smaller variants like E4B and E2B. These are meant for much smaller hardware, maybe even something like a Raspberry Pi. But the important thing is this: they’re not really comparable to the models we’ve had before at this size. That’s what makes them interesting.

Putting Gemma 4 to the test

Start to finish

I wanted to explain how Google pulled this off and why Gemma 4 feels so different, but honestly, it’s easier to just show it in action.

I already had Continue.dev set up in VS Code for local coding. I have OpenClaw set up with LM Studio. I also have Aider, in case I want something closer to Claude Code running locally. All of these can be made to work with Gemma 4 through LM Studio, and it’s pretty straightforward to set up thanks to LM Studio’s OpenAI-like API endpoint.

For this test, I’m using the Gemma 4 E4B model. For context, my setup has a 12GB RX 6700XT and 64GB of DDR4 RAM. I know my GPU can handle more than a tiny model, but I’m sticking with this one because I want to fully offload it to the GPU. I’ve tried the bigger brother, and it just wasn’t as fast.

As a basic test, let’s start with a writing prompt. This is one of the main perks of local LLMs too: you can write into them privately, without any of it being used for training. And with abliterated local models, you can also get around some of the restrictive censoring and refusal behavior you see in major chatbots.

So I asked Gemma to respond to a quote.

Respond to this paragraph. Argue against it, but use the exact same tone. Do not address it directly.

The response was good! I won’t get to the deeper argument of it. Gemma 4 E4B responded to that in 0.26 seconds. Though it did think for 5 seconds beforehand. That’s on my computer with the RX 6700XT.

Gemma 4 responding to a prompt on a MacBook in LM Studio Credit: Amir Bohlooli / MUO

Running on my M2 MacBook with 16GB of RAM, it answered in 1.21s and it was decent.

That should give you an idea of why this is exciting to me. One of the biggest prospects of a capable local LLM, at least for me, is using it with my journal. That’s why I’ve been so relentlessly trying to bring an LLM into Obsidian. I write a lot in my journal, and sometimes it’s nice to get another perspective on what I’ve written. I can’t ask a human because it’s deeply private. But a local LLM is private in a way cloud tools simply aren’t. For that exact reason, I can’t share that experiment here, so let’s do something else with it instead.

LM Studio logo

OS

Windows, macOS, Linux

Developer

Element Labs


Text isn’t all Gemma can do

Put the eyes to work

Gemma 4 2EB in LM Studio model picker
Amir Bohlooli / MUO

This time, I’m loading up the even leaner Gemma 4 E2B model. This one is only 4GB. You could run it on a phone.

The Gemma line checks all the right boxes for a local model stack: thinking, tools, and vision. Let’s use that last part.

I wanted to use Gemma’s vision capabilities to rename some photos and replace their filenames with natural descriptive text. I could’ve done this through VS Code, or with OpenClaw, or with Aider. There are a lot of ways to use the model for a coding task like this. But this time I used the LM Studio chat panel with a prompt like this:

I need a Python script that loops through all images in the current folder (jpg, jpeg, png, gif, webp, bmp), sends each one as a base64-encoded image to a local OpenAI-compatible API at http://127.0.0.1:1234/v1 using model google/gemma-4-e2b, and asks the model to describe the image in a specific natural phrase under 100 characters with no hyphens. The script should rename each file to that description, keeping the original file extension lowercase, skipping if a file with that name already exists, and printing progress as it goes.

Gemma responded in 0.54 seconds. Not even a full second. I had my code, and it even told me which pip package I needed to install to make it work. I installed it, ran the .py file, and it worked.

Local LLM renaming photos results
Amir Bohlooli / MUO

Lo and behold, my photos were renamed.

It didn’t like the .heic files from my iPhone, and it got the battery one wrong, but for everything else, it nailed it. It saved me the hassle of coming up with descriptions for a pile of photos. More importantly, it was fast. I didn’t have to upload my photos to someone else’s machine. That means, first, the photos stay on my machine, and second, there’s no upload bandwidth wasted sending them to the internet. That matters even more when you’re dealing with lots of large photos.

The tiniest Gemma 4 model is a very competent model for its size, but you do need to respect the limits. The biggest weakness in a local LLM stack is context size. I simply can’t match what Gemini or ChatGPT offer there. So if I want something ambitious, like a full solar system simulator in one response, the local model is going to struggle. But if I want it to debug code, it can absolutely do that. In one of those same tests, instead of using Claude to debug ChatGPT’s code, I used Gemma. And it found the bug on the first try. Not bad at all.

Local LLMs are practical now

I can see it now. I’ve already added a larger Gemma model as a failover in my OpenClaw setup. I’m definitely going to keep using that script to batch rename photos. I’m going to use a stronger and faster local model with my journals. Local LLMs are becoming more and more usable. I’m already thinking about building a local meeting transcriber and summarizer with it

My hardware is overkill for a tiny model like Gemma 4 E2B. But that’s part of the point. You do not need absurd hardware to get a smooth experience out of this thing. Most modern computers can run it just fine.

A local LLM is no longer just a toy you boot up to marvel at the fact that it works offline. It’s no longer just a novelty you show off because it feels futuristic. For the first time in a while, it feels like something I can actually fold into my daily setup and keep there.

Source link

QkNews Argent

Leave a Reply

Your email address will not be published. Required fields are marked *