Wasm + LLMs: Running Small Models in the Browser with llama.cpp on WebAssembly
WasmHub Team
June 16, 2026 · 6 min read
Loading a 7-billion-parameter language model used to require a server rack. In 2026, it requires a browser tab and a static asset. The mechanism is WebAssembly: the same sandboxed, deterministic runtime that powers Shopify Functions and Fastly Compute also gives you a private LLM runtime that lives entirely on the user's machine. No backend, no API key, no telemetry leaving the page.
Why Wasm Is the Right Shape for In-Browser LLMs
Three things converged to make "run an LLM in the browser" plausible: small base models (Llama 3.2 1B, Qwen2.5 0.5B–3B, Gemma 3 1B, SmolLM), aggressive quantization (1.5-bit through 8-bit integer, with Q4_K_M as the practical default), and a portable runtime that could host the inference loop without an OS install. WebAssembly checks all three:
- Sandboxed and deterministic. A Wasm module can't read the filesystem, open a socket, or call the network unless the host gives it an explicit import. The model can't exfiltrate the prompt.
- No server round-trip. Tokens stream directly to the UI. No cold-start penalty, no rate limit, no per-token bill. The only network cost is the initial model download.
- Single static asset. A quantized GGUF file plus a
.wasmruntime, both served from a CDN. Hosting is a bucket and a<script>tag. - Uniform performance across platforms. The same binary runs in Chrome, Firefox, Safari, and Edge. Memory64 — the 64-bit addressing extension that lets a single module hold more than 4 GB of linear memory — is on by default in Chrome 133+ (Memory64 in 2026).
The cost is throughput. A pure-Wasm inference loop runs in the low single-digit tokens per second for a 7B Q4 model. That's enough for chat, autocomplete, summarization, and on-device classification. It is not enough for real-time voice or long-context reasoning. The honest tradeoff is in the closing section.
Two Paths in 2026
The "run an LLM in the browser via Wasm" world has two real options. Which one you pick depends on whether WebGPU is available.
Vanilla llama.cpp on Wasm. The llama.cpp project ships Wasm builds. Two community bindings make the browser path ergonomic: ngxson/wllama (a TypeScript wrapper on npm) and tangledgroup/llama-cpp-wasm. Both run the inference loop in pure Wasm — no GPU, no WebGPU, no navigator.gpu check at the door. Expect roughly 3–5 tokens/sec for a 7B Q4_K_M model on a 2023-class laptop. The same code path works in any modern browser.
WebLLM (MLC AI). WebLLM v0.2.83 (April 24 2026) uses Wasm for the model library and WebGPU for the heavy matmuls. Same model, same Q4 quantization, but the token rate jumps to 30+ tokens/sec on a discrete GPU. The catch: the user must be on a WebGPU-capable browser, and you have to handle the fallback case where navigator.gpu is undefined. (Chrome, Edge, and recent Safari Technology Preview builds qualify; Firefox is still landing it.)
For a post titled "Wasm + LLMs," the vanilla llama.cpp path is the honest anchor — it works everywhere. WebLLM is the fast lane for users who can take it.
A Worked Example: Loading Qwen2.5-1B in the Browser
Here's the entire stack: a static HTML page, the wllama package, and a GGUF model file hosted on the same origin. Twenty lines of JavaScript is enough to load a 1B-parameter model and stream a completion.
import { Wllama } from "@wllama/wllama";
const wllama = new Wllama(["/wllama.wasm"], { allowOffline: true });
await wllama.loadModelFromUrl(
"/models/qwen2.5-1b-instruct-q4_k_m.gguf",
{ n_ctx: 2048, n_threads: navigator.hardwareConcurrency }
);
const prompt = "<|im_start|>user\nExplain WebAssembly in one sentence.<|im_end|>\n<|im_start|>assistant\n";
for await (const token of wllama.createChatCompletion(prompt, {
n_predict: 128, temperature: 0.7,
})) {
process.stdout.write(token.response);
}The first await is the expensive one: the browser fetches a ~700 MB GGUF file and the Wasm runtime, compiles both, and mmap's the model into linear memory. The chat-completion call is a token-by-token async iterator — your UI code consumes it the same way it would consume a Server-Sent Events stream from a backend. No server, no API key, no prompt leaving the page.
The model's location matters. A CDN with HTTP/3 is the difference between a 30-second and a 2-minute cold load on a fresh visit. Service workers help: cache the GGUF in the Cache API on first load and the second visit is instant. The OPFS backend that WebLLM's documentation describes is worth porting to the vanilla path for the same reason.
The Size and Quantization Story
Quantization is what makes browser inference possible. A 7B model in FP16 is ~14 GB; in Q4_K_M it's ~4 GB; in Q2_K it drops below 3 GB at meaningful quality loss. The menu for 2026:
- 1B parameters, Q4_K_M: ~700 MB. Fits in 32-bit linear memory. Llama 3.2 1B, Qwen2.5-1.5B, Gemma 3 1B.
- 3B parameters, Q4_K_M: ~2 GB. The sweet spot for chat-quality output on a phone.
- 7B parameters, Q4_K_M: ~4 GB. Hits the 32-bit Wasm ceiling — you need Memory64 (
wasm64-unknown-unknown, enabled inwasm-pack0.15.0 andwasm-bindgen0.2.120). - 13B+ parameters: not practical in the browser yet.
The Memory64 in 2026 post walks through the Rust-toolchain changes; on the JavaScript side you only need to instantiate the module with the memory64 flag set on the memory descriptor.
Who Else Is Doing This
The vanilla-llama.cpp-on-Wasm path is one of several in-browser LLM stories in production:
- Hugging Face Transformers.js runs ONNX models in the browser. CPU inference defaults to Wasm (
dtype: 'q8'); WebGPU is opt-in. - ONNX Runtime Web powers Transformers.js and a long list of independent apps. Since v1.19.0, SIMD and threads are required; WebGPU and WebNN are experimental.
- WebLLM is the path most browser-based coding assistants have chosen: faster tokens, OpenAI-compatible streaming.
- WasmEdge 0.17.0 (May 18 2026) ships WASI-NN with a native llama.cpp plugin — the same Wasm module that runs in a browser can run on an edge gateway.
- Cloudflare Workers AI is the server-side complement: 50+ open-source models behind a pay-per-use API. The model packaging format overlaps; the runtime is V8 isolates.
When Not to Use This
Browser-side Wasm inference is the right answer for privacy-sensitive workloads (PII redaction, on-device summarization, personal-assistant chat), low-latency UI features (autocomplete, intent classification, prompt rewriting before a server call), and offline-capable apps (browser extensions, field tools). It's the wrong answer for anything that needs more than ~10 tokens/sec, anything processing more than a few thousand tokens of context, and anything serving more than one user per machine. For those, use a server. The Wasm runtime stays the same — you just host it where there's a GPU.
The interesting design question is no longer "can you run an LLM in the browser" — you can, in 30 lines of JavaScript. The interesting question is what to do with the privacy and zero-cost-inference properties that gives you. That answer is product-specific, and it's the part no runtime can give you.