Google Gemma 4: The Open Model Family That Changed Our Entire Stack

When we reviewed Gemma 2 9B a few weeks ago, we called it “the best small model for European business AI.” We meant it. It ran our scheduling tasks, fit on modest hardware, and handled instructions with surprising reliability. Then Google released Gemma 4 on April 2, 2026, and made Gemma 2 look like a rough draft.

This is the model family we have been waiting for. Not because it is perfect — it is not — but because it finally closes the gap between what small open models can do and what businesses actually need them to do. We replaced Gemma 2 in production within 48 hours. Here is what happened, what impressed us, and where the limits are.

Open source AI model comparison

Four Variants, Four Use Cases

Gemma 4 is not a single model. It is a family of four, each designed for a different tier of hardware and workload. This is Google doing what Google does best: scaling a single architecture across wildly different resource budgets.

Variant	Total Params	Effective / Active	Context	Audio	Ollama Size	Arena AI Rank
E2B	5.1B (w/ embeddings)	2.3B effective	128K	Yes	~3.5 GB	—
E4B	8B (w/ embeddings)	4.5B effective	128K	Yes	9.6 GB	—
26B MoE	25.2B total	3.8B active (8/128 experts)	256K	No	18 GB	#6 open
31B Dense	30.7B	30.7B (all dense)	256K	No	20 GB	#3 open

The E2B and E4B variants are multimodal — they accept text, images, and audio as input and produce text output. The 26B and 31B handle text and images only. All four support 140+ languages, configurable thinking modes, native function calling, structured JSON output, and system instructions.

xychart-beta
    title "Gemma 4 Variants — Parameters vs Memory"
    x-axis ["E2B (2.3B)", "E4B (4B)", "26B MoE", "31B Dense"]
    y-axis "Memory Required (GB)" 0 --> 25
    bar [4, 9.6, 18, 20]

That 31B Dense model sitting at #3 on the Arena AI leaderboard among all open models worldwide is not a typo. The 26B MoE variant holds #6. Google’s own claim that Gemma 4 “outcompetes models 20x its size” sounds like marketing until you see the benchmark numbers.

How to Run Each Variant with Ollama

Getting started takes one command per variant:

# E2B — ultralight, our scheduling workhorse
ollama pull gemma4:e2b

# E4B — medium-duty, content and briefings
ollama pull gemma4:e4b

# 26B MoE — heavy reasoning with sparse activation
ollama pull gemma4:26b

# 31B Dense — maximum quality, needs beefy hardware
ollama pull gemma4:31b

The E2B pulls in under 4 GB. The E4B sits at 9.6 GB — tight on a 16 GB machine but comfortable on 32 GB. The 26B and 31B variants need 18-20 GB of VRAM or unified memory, which puts them squarely in Mac Studio or dedicated GPU territory. For hardware guidance, see our edge AI hardware guide.

What We Actually Run in Production

At VORLUX AI, we run Gemma 4 E2B and E4B on a Mac Mini M4 as part of our local AI infrastructure. Here is how they fit:

Gemma 4 E2B is our primary scheduling model. It handles 58 orchestrator jobs — task routing, status updates, lightweight classification, and JSON-structured outputs for downstream agents. At 2.3B effective parameters, it is absurdly fast. Response times average under 800ms for typical scheduling prompts. It replaced Gemma 2 9B for these tasks and uses roughly half the memory.

Gemma 4 E4B is our medium-duty model for briefings, content drafting, and multi-step analysis. When a task needs more reasoning than E2B can provide but does not justify pulling in a 26B+ model, E4B handles it. The 128K context window means we can feed it full documents without chunking.

The Mac Mini M4 runs both simultaneously with room to spare. That would have been unthinkable a year ago.

What Gemma 4 Does Exceptionally Well

Function calling and structured output. Native support, not bolted on. We feed Gemma 4 a tool schema and it returns valid JSON function calls consistently. This matters enormously for agent orchestration — no more regex parsing of freeform text.

Instruction following. The configurable thinking modes let us toggle between fast responses (thinking off) and deliberate reasoning (thinking on) per request. For scheduling, we keep thinking off. For content analysis, we turn it on.

Multilingual performance. With 140+ languages, our Spanish and English workflows run on the same model without fine-tuning. For a consultancy based in Valencia serving Spanish SMEs, this is not a nice-to-have — it is essential.

Audio input on E2B/E4B. We have not deployed this in production yet, but the ability to process audio natively opens doors for meeting transcription, voice-driven workflows, and accessibility features without a separate speech-to-text pipeline.

Where It Falls Short — Honest Limits

We promised honest reviews and we meant it.

Deep reasoning and complex coding. For multi-step mathematical proofs or competitive programming challenges, Gemma 4 31B is strong but still trails Llama 3.3 70B and Qwen 2.5 72B Coder. If your primary workload is code generation, Qwen remains the better specialized choice. Gemma 4 is a generalist that happens to code well — it is not a coding specialist.

The 26B MoE trade-off. The Mixture-of-Experts architecture is brilliant for efficiency — only 3.8B of the 25.2B parameters activate per token. But MoE models can be unpredictable on tasks that fall between expert boundaries. We have seen occasional inconsistency on hybrid tasks that the 31B Dense handles cleanly.

No text generation from images or audio. Gemma 4 can understand images and audio as input, but it only generates text. If you need image generation or audio synthesis, you still need separate models.

VRAM pressure on larger variants. The 31B Dense at 20 GB is tight on a 32 GB Mac. Running it alongside other models requires careful memory management. Check our Q2 2026 model comparison for side-by-side VRAM budgets.

Gemma 4 vs Gemma 2: Is It Worth Upgrading?

Unequivocally yes. The E2B alone makes Gemma 2 9B redundant for most scheduling and classification tasks — it is faster, smaller, and more capable. The 128K context window (up from 8K on Gemma 2) eliminates the chunking workarounds we used to need. Function calling support means we deleted hundreds of lines of output-parsing code. And the multilingual quality jumped from “functional” to “genuinely good.”

If you are currently running Gemma 2, the migration path is straightforward: pull the new model, test your prompts, and switch. We did it in a weekend.

Who Should Use Which Variant?

E2B: Edge devices, scheduling, classification, IoT, mobile. Anything where speed and size matter more than depth.
E4B: SME workstations, content generation, briefings, customer support. The sweet spot for most business use cases.
26B MoE: Research, analysis, long-document processing. Great when you need 256K context but want to keep memory reasonable.
31B Dense: Maximum quality on demanding tasks. Translation, complex analysis, multi-turn reasoning. Worth it if you have the hardware.

Getting Started

Gemma 4 is available now on Ollama and Google’s official blog has the full technical details. All variants use the Gemma license, which permits commercial use.

If you want help deploying Gemma 4 on your own hardware — whether that is a single Mac Mini or a fleet of edge devices — that is exactly what we do. We build local AI systems for European SMEs that keep data on-premises and costs predictable. See our services or get in touch for a free consultation.

The era of needing massive cloud budgets for capable AI is ending. Gemma 4 is proof.

Google Gemma 4: The Open Model Family That Changed Our Entire Stack

Google Gemma 4: The Open Model Family That Changed Our Entire Stack

Four Variants, Four Use Cases

How to Run Each Variant with Ollama

What We Actually Run in Production

What Gemma 4 Does Exceptionally Well

Where It Falls Short — Honest Limits

Gemma 4 vs Gemma 2: Is It Worth Upgrading?

Who Should Use Which Variant?

Getting Started

Blog

Claude Code Subagents, MCP Tools, and Web Search: A Practical Guide for SMEs

EU AI Act August 2026 Deadline: Your 90-Day Action Plan for SMEs

Access exclusive resources

Start your sovereign AI deployment

VORLUX AI

Google Gemma 4: The Open Model Family That Changed Our Entire Stack

Four Variants, Four Use Cases

How to Run Each Variant with Ollama

What We Actually Run in Production

What Gemma 4 Does Exceptionally Well

Where It Falls Short — Honest Limits

Gemma 4 vs Gemma 2: Is It Worth Upgrading?

Who Should Use Which Variant?

Related reading

Getting Started

Blog

Claude Code Subagents, MCP Tools, and Web Search: A Practical Guide for SMEs

EU AI Act August 2026 Deadline: Your 90-Day Action Plan for SMEs

Access exclusive resources

Start your sovereign AI deployment

VORLUX AI