Realays Logo Realays
← Back to Blog
Tech 1/8/2026

[Tech Series 02] From TensorFlow.js to WebLLM: Evolution of Web ML

Introducing the development process from TensorFlow.js, an early Web ML technology, to WebLLM, which runs Large Language Models (LLM) in the browser.

[Tech Series 02] From TensorFlow.js to WebLLM: Evolution of Web ML

From TensorFlow.js to WebLLM: Evolution of Web ML

Running machine learning in a web browser was unimaginable just a few years ago. JavaScript was perceived as slow, and browsers were not suitable for heavy computations. However, Web ML technology has evolved at an astonishing speed, breaking these prejudices.

1. The Dawn: Emergence of TensorFlow.js (2018)

In 2018, Google released TensorFlow.js to the world. This was like a Big Bang for Web ML, bringing deep learning, which was the exclusive domain of Python, into the JavaScript ecosystem.

Developers could now train models directly in the browser or load pre-trained models for inference. Initial applications focused on relatively lightweight models such as image classification, pose detection (PoseNet), and face recognition, but this was enough to prove the possibility that “AI can run on the web.”

2. The Leap: WebAssembly (Wasm) Acceleration

Despite improvements in JavaScript engines (V8, etc.), matrix operations for deep learning remained heavy. To break through this, WebAssembly (Wasm) technology was introduced.

WebAssembly compiles code written in C++ or Rust into a binary format executable in browsers, providing execution speeds close to native. This significantly improved TensorFlow.js backend performance, laying the foundation for running more complex and sophisticated models on the web.

3. The Revolution: WebGPU and Large Language Models (LLM)

WebLLM Visualized

Now, we are facing another revolution: the emergence of WebGPU. While WebGL “borrowed” APIs designed for graphics rendering for AI computations, WebGPU is a next-generation graphics standard designed from the ground up with general-purpose computing (Compute Shaders) in mind.

Through WebGPU, browsers can access high-performance GPU resources directly and efficiently. This dramatically increased computation speeds, finally making it possible to run Large Language Models (LLMs) with billions of parameters, like Llama 3 and Gemma, in a single browser tab through projects like WebLLM.

Realays stands at the peak of this technological advancement, actively adopting WebGPU and the latest Web ML technologies to provide users with fast and powerful web AI services they’ve never experienced before.

Performance Comparison by Generation

TensorFlow.js (2018)

  • Model Size: Mainly lightweight models under 10MB
  • Inference Speed: ~100-200ms for MobileNet (CPU)
  • Main Use Cases: Image classification, pose detection, object recognition

WebAssembly Accelerator (2020)

  • Performance Gain: 2-3x improvement in CPU computation speed
  • Model Support: Support for more complex CNN and RNN models
  • Main Use Cases: Real-time video processing, speech recognition

WebGPU + WebLLM (2023~)

  • Model Size: Multi-GB LLMs supported (Llama 3, Gemma, etc.)
  • Inference Speed: 10-100x improvement with GPU acceleration
  • Main Cases: Chatbots, text generation, image generation AI

Real Implementation Examples

1. Pose Recognition with PoseNet

import * as poseDetection from "@tensorflow-models/pose-detection";

async function detectPose(video) {
  // Load MoveNet model
  const detector = await poseDetection.createDetector(
    poseDetection.SupportedModels.MoveNet,
  );

  // Detect poses from video
  const poses = await detector.estimatePoses(video);

  return poses;
}

2. Chatbot with WebLLM

import { ChatModule } from "@mlc-ai/web-llm";

async function initChatbot() {
  const chat = new ChatModule();

  // Load Llama-3 model (requires WebGPU)
  await chat.reload("Llama-3-8B-Instruct-q4f32_1");

  // Generate conversation
  const response = await chat.generate("Hello!");
  console.log(response);
}

Technology Selection Guide

When to use TensorFlow.js?

  • Lightweight models (< 50MB)
  • Need broad browser compatibility
  • Sufficient performance on CPU

When to use WebGPU?

  • Large models (> 100MB)
  • Real-time processing required
  • Can target modern browsers

When to use WebAssembly?

  • CPU-intensive operations
  • Prepare for environments without GPU
  • Support legacy browsers

Performance Optimization Tips

1. Model Quantization

// Quantize model to int8, reducing size by 1/4
await tf.loadGraphModel("model.json", {
  quantizationBytes: 1, // 8-bit quantization
});

2. WebWorker Utilization

// Run inference in background without blocking main thread
const worker = new Worker("ml-worker.js");
worker.postMessage({ image: imageData });
worker.onmessage = (e) => {
  console.log("Prediction:", e.data);
};

3. Model Caching

// Cache model in IndexedDB
const model = await tf.loadLayersModel("indexeddb://my-model");
if (!model) {
  const model = await tf.loadLayersModel("https://cdn.com/model.json");
  await model.save("indexeddb://my-model");
}

Limitations and Solutions

Limitation 1: Browser Memory Constraints

Problem: Out of memory when loading large models Solutions:

  • Apply model quantization
  • Use streaming load approach
  • Dynamically load only necessary parts

Limitation 2: Initial Loading Time

Problem: Time consumed downloading models Solutions:

  • Progressive Loading
  • Service Worker caching
  • Use CDN

Limitation 3: GPU Access Restrictions

Problem: Not all browsers support WebGPU Solutions:

  • Implement fallback strategy
  • Support WebGL backend in parallel
  • Provide CPU-optimized version

Future Outlook

Expected Developments

  1. Larger Models: 100B+ parameter models running in browsers
  2. Multimodal: Simultaneous processing of text, images, and voice
  3. Federated Learning: Local training with user data, then aggregation
  4. Real-time Collaborative AI: P2P distribution of AI computation across multiple browsers

Industry Impact

  • Education: AI tutors for personalized learning
  • Healthcare: Real-time health monitoring
  • Creative: Browser-based AI design tools
  • Gaming: Intelligent NPC dialogue and behavior

Frequently Asked Questions

Q: How fast is it compared to server AI? A: Considering network latency, Edge AI is 5-10x faster for simple inference. However, servers still have advantages for complex models.

Q: When will WebGPU be available on all browsers? A: Chrome and Edge already support it. Safari and Firefox are expected to support it in 2024-2025.

Q: Does WebGPU work on mobile? A: Yes, it works on Android Chrome 94+. iOS has limited support in Safari.


In the next post, we’ll cover the Web ML development process step by step in detail.

Related Posts