Google unveils Gemma 4 12B, an open multimodal model that runs without the cloud

Google has introduced Gemma 4 12B, an open multimodal AI model with 12 billion parameters designed to run locally on consumer hardware.

According to the company's blog, the model is built to operate directly on a user's device, requiring only a laptop with 16 GB of RAM or unified memory and no connection to the cloud. Gemma 4 12B sits in the middle of the Gemma 4 lineup, positioned between the lighter E4B version and the more powerful 26B model, which is built on a Mixture of Experts architecture.

The model's key technical distinction is its departure from the traditional approach to multimodal processing. Most multimodal models rely on separate encoder modules to translate images and audio into a format the neural network can understand before passing them into the main model. Gemma 4 12B feeds images and audio directly into the language model itself. A compact embedding module handles image input, while raw audio signals are projected into the same representational space as text. The result is lower latency, reduced memory requirements, and a simpler pipeline for working with different types of data.

Despite its smaller size, Gemma 4 12B performs close to the level of the much larger 26B model. It scored 77.2% on the MMLU Pro benchmark and 78.8% on GPQA Diamond. It is also the first mid-sized model in the Gemma family with native audio support, capable of recognizing speech, distinguishing between speakers, and analyzing video. In one of Google's demonstrations, the model parsed a five-minute clip from a Google I/O presentation. Gemma 4 12B also supports a 256,000-token context window, a step-by-step reasoning mode before answering, and external function calling, all of which are important for building AI agents.

The model is distributed under the open Apache 2.0 license and is available on major platforms including Hugging Face, Kaggle, Ollama, LM Studio, and Google AI Edge. According to Google, the Gemma 4 family has collectively been downloaded more than 150 million times. The practical significance of the release lies in the ability to run advanced AI locally, processing text, images, audio, and video without sending data to the cloud. The approach is particularly relevant for scenarios where privacy is critical, including healthcare, financial services, and the handling of internal corporate documents.