llm webassembly llama.cpp inference browser

Wasm + LLMs: Running Small Models in the Browser with llama.cpp on WebAssembly

WasmHub Team

June 16, 2026 · 6 min read

Loading a 7-billion-parameter language model used to require a server rack. In 2026, it requires a browser tab and a static asset. The mechanism is WebAssembly: the same sandboxed, deterministic runtime that powers Shopify Functions and Fastly Compute also gives you a private LLM runtime that lives entirely on the user's machine. No backend, no API key, no telemetry leaving the page.

Why Wasm Is the Right Shape for In-Browser LLMs

Three things converged to make "run an LLM in the browser" plausible: small base models (Llama 3.2 1B, Qwen2.5 0.5B–3B, Gemma 3 1B, SmolLM), aggressive quantization (1.5-bit through 8-bit integer, with Q4_K_M as the practical default), and a portable runtime that could host the inference loop without an OS install. WebAssembly checks all three:

Sandboxed and deterministic. A Wasm module can't read the filesystem, open a socket, or call the network unless the host gives it an explicit import. The model can't exfiltrate the prompt.
No server round-trip. Tokens stream directly to the UI. No cold-start penalty, no rate limit, no per-token bill. The only network cost is the initial model download.
Single static asset. A quantized GGUF file plus a .wasm runtime, both served from a CDN. Hosting is a bucket and a <script> tag.
Uniform performance across platforms. The same binary runs in Chrome, Firefox, Safari, and Edge. Memory64 — the 64-bit addressing extension that lets a single module hold more than 4 GB of linear memory — is on by default in Chrome 133+ (Memory64 in 2026).

The cost is throughput. A pure-Wasm inference loop runs in the low single-digit tokens per second for a 7B Q4 model. That's enough for chat, autocomplete, summarization, and on-device classification. It is not enough for real-time voice or long-context reasoning. The honest tradeoff is in the closing section.

Two Paths in 2026

The "run an LLM in the browser via Wasm" world has two real options. Which one you pick depends on whether WebGPU is available.

Vanilla llama.cpp on Wasm. The llama.cpp project ships Wasm builds. Two community bindings make the browser path ergonomic: ngxson/wllama (a TypeScript wrapper on npm) and tangledgroup/llama-cpp-wasm. Both run the inference loop in pure Wasm — no GPU, no WebGPU, no navigator.gpu check at the door. Expect roughly 3–5 tokens/sec for a 7B Q4_K_M model on a 2023-class laptop. The same code path works in any modern browser.

WebLLM (MLC AI). WebLLM v0.2.83 (April 24 2026) uses Wasm for the model library and WebGPU for the heavy matmuls. Same model, same Q4 quantization, but the token rate jumps to 30+ tokens/sec on a discrete GPU. The catch: the user must be on a WebGPU-capable browser, and you have to handle the fallback case where navigator.gpu is undefined. (Chrome, Edge, and recent Safari Technology Preview builds qualify; Firefox is still landing it.)

For a post titled "Wasm + LLMs," the vanilla llama.cpp path is the honest anchor — it works everywhere. WebLLM is the fast lane for users who can take it.

A Worked Example: Loading Qwen2.5-1B in the Browser

Here's the entire stack: a static HTML page, the wllama package, and a GGUF model file hosted on the same origin. Twenty lines of JavaScript is enough to load a 1B-parameter model and stream a completion.

import { Wllama } from "@wllama/wllama";
 
const wllama = new Wllama(["/wllama.wasm"], { allowOffline: true });
 
await wllama.loadModelFromUrl(
  "/models/qwen2.5-1b-instruct-q4_k_m.gguf",
  { n_ctx: 2048, n_threads: navigator.hardwareConcurrency }
);
 
const prompt = "<|im_start|>user\nExplain WebAssembly in one sentence.<|im_end|>\n<|im_start|>assistant\n";
 
for await (const token of wllama.createChatCompletion(prompt, {
  n_predict: 128, temperature: 0.7,
})) {
  process.stdout.write(token.response);
}

The first await is the expensive one: the browser fetches a ~700 MB GGUF file and the Wasm runtime, compiles both, and mmap's the model into linear memory. The chat-completion call is a token-by-token async iterator — your UI code consumes it the same way it would consume a Server-Sent Events stream from a backend. No server, no API key, no prompt leaving the page.

The model's location matters. A CDN with HTTP/3 is the difference between a 30-second and a 2-minute cold load on a fresh visit. Service workers help: cache the GGUF in the Cache API on first load and the second visit is instant. The OPFS backend that WebLLM's documentation describes is worth porting to the vanilla path for the same reason.

The Size and Quantization Story

Quantization is what makes browser inference possible. A 7B model in FP16 is ~14 GB; in Q4_K_M it's ~4 GB; in Q2_K it drops below 3 GB at meaningful quality loss. The menu for 2026:

1B parameters, Q4_K_M: ~700 MB. Fits in 32-bit linear memory. Llama 3.2 1B, Qwen2.5-1.5B, Gemma 3 1B.
3B parameters, Q4_K_M: ~2 GB. The sweet spot for chat-quality output on a phone.
7B parameters, Q4_K_M: ~4 GB. Hits the 32-bit Wasm ceiling — you need Memory64 (wasm64-unknown-unknown, enabled in wasm-pack 0.15.0 and wasm-bindgen 0.2.120).
13B+ parameters: not practical in the browser yet.

The Memory64 in 2026 post walks through the Rust-toolchain changes; on the JavaScript side you only need to instantiate the module with the memory64 flag set on the memory descriptor.

Who Else Is Doing This

The vanilla-llama.cpp-on-Wasm path is one of several in-browser LLM stories in production:

Hugging Face Transformers.js runs ONNX models in the browser. CPU inference defaults to Wasm (dtype: 'q8'); WebGPU is opt-in.
ONNX Runtime Web powers Transformers.js and a long list of independent apps. Since v1.19.0, SIMD and threads are required; WebGPU and WebNN are experimental.
WebLLM is the path most browser-based coding assistants have chosen: faster tokens, OpenAI-compatible streaming.
WasmEdge 0.17.0 (May 18 2026) ships WASI-NN with a native llama.cpp plugin — the same Wasm module that runs in a browser can run on an edge gateway.
Cloudflare Workers AI is the server-side complement: 50+ open-source models behind a pay-per-use API. The model packaging format overlaps; the runtime is V8 isolates.

When Not to Use This

Browser-side Wasm inference is the right answer for privacy-sensitive workloads (PII redaction, on-device summarization, personal-assistant chat), low-latency UI features (autocomplete, intent classification, prompt rewriting before a server call), and offline-capable apps (browser extensions, field tools). It's the wrong answer for anything that needs more than ~10 tokens/sec, anything processing more than a few thousand tokens of context, and anything serving more than one user per machine. For those, use a server. The Wasm runtime stays the same — you just host it where there's a GPU.

The interesting design question is no longer "can you run an LLM in the browser" — you can, in 30 lines of JavaScript. The interesting question is what to do with the privacy and zero-cost-inference properties that gives you. That answer is product-specific, and it's the part no runtime can give you.

Tagged in:llm webassembly llama.cpp inference browser

llm webassembly llama.cpp inference browser

Wasm + LLMs: Running Small Models in the Browser with llama.cpp on WebAssembly

WasmHub Team

June 16, 2026 · 6 min read

Why Wasm Is the Right Shape for In-Browser LLMs

Sandboxed and deterministic. A Wasm module can't read the filesystem, open a socket, or call the network unless the host gives it an explicit import. The model can't exfiltrate the prompt.
No server round-trip. Tokens stream directly to the UI. No cold-start penalty, no rate limit, no per-token bill. The only network cost is the initial model download.
Single static asset. A quantized GGUF file plus a .wasm runtime, both served from a CDN. Hosting is a bucket and a <script> tag.
Uniform performance across platforms. The same binary runs in Chrome, Firefox, Safari, and Edge. Memory64 — the 64-bit addressing extension that lets a single module hold more than 4 GB of linear memory — is on by default in Chrome 133+ (Memory64 in 2026).

Two Paths in 2026

The "run an LLM in the browser via Wasm" world has two real options. Which one you pick depends on whether WebGPU is available.

For a post titled "Wasm + LLMs," the vanilla llama.cpp path is the honest anchor — it works everywhere. WebLLM is the fast lane for users who can take it.

A Worked Example: Loading Qwen2.5-1B in the Browser

import { Wllama } from "@wllama/wllama";
 
const wllama = new Wllama(["/wllama.wasm"], { allowOffline: true });
 
await wllama.loadModelFromUrl(
  "/models/qwen2.5-1b-instruct-q4_k_m.gguf",
  { n_ctx: 2048, n_threads: navigator.hardwareConcurrency }
);
 
const prompt = "<|im_start|>user\nExplain WebAssembly in one sentence.<|im_end|>\n<|im_start|>assistant\n";
 
for await (const token of wllama.createChatCompletion(prompt, {
  n_predict: 128, temperature: 0.7,
})) {
  process.stdout.write(token.response);
}

The Size and Quantization Story

Quantization is what makes browser inference possible. A 7B model in FP16 is ~14 GB; in Q4_K_M it's ~4 GB; in Q2_K it drops below 3 GB at meaningful quality loss. The menu for 2026:

1B parameters, Q4_K_M: ~700 MB. Fits in 32-bit linear memory. Llama 3.2 1B, Qwen2.5-1.5B, Gemma 3 1B.
3B parameters, Q4_K_M: ~2 GB. The sweet spot for chat-quality output on a phone.
7B parameters, Q4_K_M: ~4 GB. Hits the 32-bit Wasm ceiling — you need Memory64 (wasm64-unknown-unknown, enabled in wasm-pack 0.15.0 and wasm-bindgen 0.2.120).
13B+ parameters: not practical in the browser yet.

The Memory64 in 2026 post walks through the Rust-toolchain changes; on the JavaScript side you only need to instantiate the module with the memory64 flag set on the memory descriptor.

Who Else Is Doing This

The vanilla-llama.cpp-on-Wasm path is one of several in-browser LLM stories in production:

Hugging Face Transformers.js runs ONNX models in the browser. CPU inference defaults to Wasm (dtype: 'q8'); WebGPU is opt-in.
ONNX Runtime Web powers Transformers.js and a long list of independent apps. Since v1.19.0, SIMD and threads are required; WebGPU and WebNN are experimental.
WebLLM is the path most browser-based coding assistants have chosen: faster tokens, OpenAI-compatible streaming.
WasmEdge 0.17.0 (May 18 2026) ships WASI-NN with a native llama.cpp plugin — the same Wasm module that runs in a browser can run on an edge gateway.
Cloudflare Workers AI is the server-side complement: 50+ open-source models behind a pay-per-use API. The model packaging format overlaps; the runtime is V8 isolates.

When Not to Use This

Tagged in:llm webassembly llama.cpp inference browser

Wasm + LLMs: Running Small Models in the Browser with llama.cpp on WebAssembly

Why Wasm Is the Right Shape for In-Browser LLMs

Two Paths in 2026

A Worked Example: Loading Qwen2.5-1B in the Browser

The Size and Quantization Story

Who Else Is Doing This

When Not to Use This

Related Posts

Memory64 in 2026: Breaking the 4GB Ceiling in Browsers and Servers

The State of WebAssembly in 2026

WebAssembly Component Model in Production: A 2026 Field Report

Wasm + LLMs: Running Small Models in the Browser with llama.cpp on WebAssembly

Why Wasm Is the Right Shape for In-Browser LLMs

Two Paths in 2026

A Worked Example: Loading Qwen2.5-1B in the Browser

The Size and Quantization Story

Who Else Is Doing This

When Not to Use This

Related Posts

Memory64 in 2026: Breaking the 4GB Ceiling in Browsers and Servers

The State of WebAssembly in 2026

WebAssembly Component Model in Production: A 2026 Field Report