WebAssembly Performance Tips: Making Your Wasm Code Faster
WasmHub Team
April 9, 2025 · 7 min read
"Just compile to Wasm and it'll be fast" is a myth. WebAssembly provides a ceiling of near-native performance, but most modules hit nowhere near that ceiling without deliberate optimization. The good news is that the gains are often dramatic — it's not uncommon to see 3–10× improvements from applying a handful of techniques to a naïve first implementation.
This guide covers the optimization techniques that matter most in practice, roughly ordered from "do this first" to "reach for this when you've exhausted the easier wins."
1. Profile Before Optimizing
This rule applies to all optimization work, but it matters especially for Wasm because the bottlenecks are often not where you expect them to be.
Chrome DevTools has first-class Wasm profiling. Open the Performance panel, record a profile, and you'll see Wasm function names inline in the flame chart (assuming your build includes DWARF debug info):
# Rust: include debug info in release builds for profiling
# In Cargo.toml:
# [profile.release]
# debug = 1
cargo build --release
# Emscripten: include source maps
emcc my_app.c -O2 -g -o my_app.jsWasmtime includes a built-in profiler for server-side workloads:
wasmtime run --profile=jitdump my_module.wasm
# Produces a perf-compatible jitdump file for use with `perf report`Look for two things in the profile: hot functions (expected) and unexpectedly hot JS↔Wasm boundary crossings (very common and very fixable).
2. Minimize JavaScript/Wasm Boundary Crossings
Every call across the JS/Wasm boundary has overhead: the engine must validate arguments, potentially copy data, and switch execution contexts. For one-off calls this is negligible, but in a hot loop — say, processing video frames at 60 fps — thousands of boundary crossings per frame add up fast.
Anti-pattern: calling Wasm per pixel
// Slow — one boundary crossing per pixel
for (let i = 0; i < pixels.length; i += 4) {
const luma = Module._compute_luma(pixels[i], pixels[i+1], pixels[i+2])
pixels[i] = pixels[i+1] = pixels[i+2] = luma
}Better pattern: pass the entire buffer once
// Fast — one boundary crossing per frame
const ptr = Module._alloc(pixels.byteLength)
new Uint8Array(Module.HEAPU8.buffer, ptr, pixels.byteLength).set(pixels)
Module._process_frame(ptr, pixels.byteLength)
pixels.set(new Uint8Array(Module.HEAPU8.buffer, ptr, pixels.byteLength))
Module._free(ptr)The rule: keep loops inside the Wasm module. The JS side should call the module once per logical operation, not once per element.
3. Use Linear Memory Efficiently
WebAssembly's memory model is a flat array of bytes. How you lay out data in that array has a large effect on cache performance — the same cache-friendliness principles from native C apply here.
Prefer struct-of-arrays over array-of-structs for SIMD-friendly workloads:
// Array-of-structs — interleaved data, poor SIMD utilization
struct Particle { x: f32, y: f32, vx: f32, vy: f32, mass: f32 }
let particles: Vec<Particle> = vec![...];
// Struct-of-arrays — each field is contiguous, great for SIMD
struct ParticleSystem {
x: Vec<f32>,
y: Vec<f32>,
vx: Vec<f32>,
vy: Vec<f32>,
mass: Vec<f32>,
}When you update all x positions in a loop, the struct-of-arrays layout means you're reading a contiguous slice of memory — exactly what SIMD and prefetchers love.
Avoid fragmentation. Linear memory allocators can fragment over time if you're frequently allocating and freeing objects of different sizes. For hot paths, prefer arena allocators (bump-pointer allocation, free everything at once) over general-purpose malloc.
4. Enable SIMD
Fixed-width SIMD is a shipped Wasm proposal supported in all major browsers and runtimes. It lets you process 4× floats, 8× i16s, or 16× i8s in a single instruction — the difference between a loop that takes 10 ms and one that takes 2 ms.
In Rust, use the std::arch::wasm32 intrinsics or the higher-level packed_simd crate, or simply enable auto-vectorization:
# .cargo/config.toml — enable SIMD for the wasm32 target
[target.wasm32-unknown-unknown]
rustflags = ["-C", "target-feature=+simd128"]
[target.wasm32-wasip1]
rustflags = ["-C", "target-feature=+simd128"]With this flag set, LLVM's auto-vectorizer will emit SIMD instructions for eligible loops. For more control, use the intrinsics directly:
use std::arch::wasm32::*;
/// Sum 4 float32s at a time using SIMD
pub fn dot_product_simd(a: &[f32], b: &[f32]) -> f32 {
assert_eq!(a.len(), b.len());
assert_eq!(a.len() % 4, 0);
let mut acc = f32x4_splat(0.0);
for i in (0..a.len()).step_by(4) {
let va = v128_load(a[i..].as_ptr() as *const v128);
let vb = v128_load(b[i..].as_ptr() as *const v128);
acc = f32x4_add(acc, f32x4_mul(va, vb));
}
// Horizontal sum of the 4 lanes
f32x4_extract_lane::<0>(acc)
+ f32x4_extract_lane::<1>(acc)
+ f32x4_extract_lane::<2>(acc)
+ f32x4_extract_lane::<3>(acc)
}In Emscripten, pass -msimd128 to emcc. For C code written with SSE2/AVX intrinsics, Emscripten can translate many of them automatically via the wasm_simd128.h header.
5. Tune Compiler Optimization Flags
The default debug build of a Wasm module is 5–20× slower than a properly optimized release build. Make sure you're actually measuring the optimized binary.
Rust:
# Cargo.toml
[profile.release]
opt-level = 3
lto = "fat" # Link-time optimization across all crates
codegen-units = 1 # Single codegen unit — slower compile, better output
strip = "symbols" # Remove debug symbols for smaller binary
panic = "abort" # Replace unwinding with abort — saves ~10 KBEmscripten:
emcc my_app.c \
-O3 \ # Maximum optimization
--closure 1 \ # Minimize JS glue with Closure Compiler
-flto \ # Enable link-time optimization
-o my_app.jsPost-compilation: wasm-opt
wasm-opt from the Binaryen toolkit runs additional passes on the compiled .wasm binary, often yielding another 10–20% improvement in both speed and size:
wasm-opt -O4 --enable-simd input.wasm -o output.wasm
# Or use the shrink-focused preset:
wasm-opt -Oz input.wasm -o output.wasmEmscripten's -O3 already runs wasm-opt internally. For Rust's wasm-pack, add it to your build script manually or use the wasm-pack build --release flag which includes wasm-opt.
6. Use Streaming Instantiation
A subtle but measurable optimization for browser loads: use WebAssembly.instantiateStreaming instead of WebAssembly.instantiate. The streaming variant compiles the binary while it's still downloading, overlapping network transfer and compilation:
// Slow — downloads everything, then compiles
const bytes = await fetch('/module.wasm').then(r => r.arrayBuffer())
const { instance } = await WebAssembly.instantiate(bytes)
// Fast — compiles while downloading (requires correct MIME type)
const { instance } = await WebAssembly.instantiateStreaming(
fetch('/module.wasm') // server must return Content-Type: application/wasm
)For repeated instantiation (e.g., creating multiple workers running the same module), compile once to a WebAssembly.Module and instantiate multiple times:
// Compile once — expensive
const module = await WebAssembly.compileStreaming(fetch('/module.wasm'))
// Instantiate multiple times — cheap
const instances = await Promise.all(
Array.from({ length: 4 }, () => WebAssembly.instantiate(module, imports))
)7. Avoid Unnecessary Allocations in Hot Paths
Every malloc/free call in a hot path is overhead. For frequently-called functions that work on buffers, consider:
Pre-allocate and reuse:
pub struct ImageProcessor {
scratch_buffer: Vec<u8>, // reused across calls
}
impl ImageProcessor {
pub fn process(&mut self, input: &[u8], output: &mut [u8]) {
// scratch_buffer is reused — no allocation per call
self.scratch_buffer.resize(input.len(), 0);
// ... work using self.scratch_buffer ...
}
}Use stack allocation for small, known-size buffers:
// Stack-allocated — no heap involvement
let mut coefficients = [0.0f32; 64];
// Fill and use coefficients...For Emscripten-compiled C code, alloca works in Wasm and is a good option for temporary buffers in tight loops.
8. Benchmark Correctly
A few common measurement mistakes that produce misleading results:
Warm up the JIT before measuring. Even though Wasm doesn't have a JIT, browsers compile Wasm tiers lazily. Run your workload 3–5 times before recording numbers:
// Don't measure the first call — it may include compilation
for (let i = 0; i < 5; i++) {
processFrame(data) // warm up
}
// Now measure
const t0 = performance.now()
for (let i = 0; i < 100; i++) {
processFrame(data)
}
const t1 = performance.now()
console.log(`Average: ${(t1 - t0) / 100} ms`)Use performance.now() not Date.now(). performance.now() has sub-millisecond resolution and is not affected by clock adjustments. Date.now() has 1 ms resolution by default.
Measure in a Web Worker. The main thread can be preempted by rendering, event handlers, and GC. Workers give a more stable measurement environment.
Applying these techniques systematically — profile, eliminate boundary crossings, improve memory layout, enable SIMD, tune compiler flags — typically yields workloads that are 5–20× faster than a naïve first implementation. WebAssembly's performance ceiling is close to native; with the right techniques, you can get most of the way there.