Tech 1/8/2026

[Tech Series 02] From TensorFlow.js to WebLLM: Evolution of Web ML

Introducing the development process from TensorFlow.js, an early Web ML technology, to WebLLM, which runs Large Language Models (LLM) in the browser.

From TensorFlow.js to WebLLM: Evolution of Web ML

Running machine learning in a web browser was unimaginable just a few years ago. JavaScript was perceived as slow, and browsers were not suitable for heavy computations. However, Web ML technology has evolved at an astonishing speed, breaking these prejudices.

1. The Dawn: Emergence of TensorFlow.js (2018)

In 2018, Google released TensorFlow.js to the world. This was like a Big Bang for Web ML, bringing deep learning, which was the exclusive domain of Python, into the JavaScript ecosystem.

Developers could now train models directly in the browser or load pre-trained models for inference. Initial applications focused on relatively lightweight models such as image classification, pose detection (PoseNet), and face recognition, but this was enough to prove the possibility that “AI can run on the web.”

2. The Leap: WebAssembly (Wasm) Acceleration

Despite improvements in JavaScript engines (V8, etc.), matrix operations for deep learning remained heavy. To break through this, WebAssembly (Wasm) technology was introduced.

WebAssembly compiles code written in C++ or Rust into a binary format executable in browsers, providing execution speeds close to native. This significantly improved TensorFlow.js backend performance, laying the foundation for running more complex and sophisticated models on the web.

3. The Revolution: WebGPU and Large Language Models (LLM)

WebLLM Visualized

Now, we are facing another revolution: the emergence of WebGPU. While WebGL “borrowed” APIs designed for graphics rendering for AI computations, WebGPU is a next-generation graphics standard designed from the ground up with general-purpose computing (Compute Shaders) in mind.

Through WebGPU, browsers can access high-performance GPU resources directly and efficiently. This dramatically increased computation speeds, finally making it possible to run Large Language Models (LLMs) with billions of parameters, like Llama 3 and Gemma, in a single browser tab through projects like WebLLM.

Realays stands at the peak of this technological advancement, actively adopting WebGPU and the latest Web ML technologies to provide users with fast and powerful web AI services they’ve never experienced before.

Performance Comparison by Generation

TensorFlow.js (2018)

Model Size: Mainly lightweight models under 10MB
Inference Speed: ~100-200ms for MobileNet (CPU)
Main Use Cases: Image classification, pose detection, object recognition

WebAssembly Accelerator (2020)

Performance Gain: 2-3x improvement in CPU computation speed
Model Support: Support for more complex CNN and RNN models
Main Use Cases: Real-time video processing, speech recognition

WebGPU + WebLLM (2023~)

Model Size: Multi-GB LLMs supported (Llama 3, Gemma, etc.)
Inference Speed: 10-100x improvement with GPU acceleration
Main Cases: Chatbots, text generation, image generation AI

Real Implementation Examples

1. Pose Recognition with PoseNet

import * as poseDetection from "@tensorflow-models/pose-detection";

async function detectPose(video) {
  // Load MoveNet model
  const detector = await poseDetection.createDetector(
    poseDetection.SupportedModels.MoveNet,
  );

  // Detect poses from video
  const poses = await detector.estimatePoses(video);

  return poses;
}

2. Chatbot with WebLLM

import { ChatModule } from "@mlc-ai/web-llm";

async function initChatbot() {
  const chat = new ChatModule();

  // Load Llama-3 model (requires WebGPU)
  await chat.reload("Llama-3-8B-Instruct-q4f32_1");

  // Generate conversation
  const response = await chat.generate("Hello!");
  console.log(response);
}

Technology Selection Guide

When to use TensorFlow.js?

Lightweight models (< 50MB)
Need broad browser compatibility
Sufficient performance on CPU

When to use WebGPU?

Large models (> 100MB)
Real-time processing required
Can target modern browsers

When to use WebAssembly?

CPU-intensive operations
Prepare for environments without GPU
Support legacy browsers

Performance Optimization Tips

1. Model Quantization

// Quantize model to int8, reducing size by 1/4
await tf.loadGraphModel("model.json", {
  quantizationBytes: 1, // 8-bit quantization
});

2. WebWorker Utilization

// Run inference in background without blocking main thread
const worker = new Worker("ml-worker.js");
worker.postMessage({ image: imageData });
worker.onmessage = (e) => {
  console.log("Prediction:", e.data);
};

3. Model Caching

// Cache model in IndexedDB
const model = await tf.loadLayersModel("indexeddb://my-model");
if (!model) {
  const model = await tf.loadLayersModel("https://cdn.com/model.json");
  await model.save("indexeddb://my-model");
}

Limitations and Solutions

Limitation 1: Browser Memory Constraints

Problem: Out of memory when loading large models Solutions:

Apply model quantization
Use streaming load approach
Dynamically load only necessary parts

Limitation 2: Initial Loading Time

Problem: Time consumed downloading models Solutions:

Progressive Loading
Service Worker caching
Use CDN

Limitation 3: GPU Access Restrictions

Problem: Not all browsers support WebGPU Solutions:

Implement fallback strategy
Support WebGL backend in parallel
Provide CPU-optimized version

Future Outlook

Expected Developments

Larger Models: 100B+ parameter models running in browsers
Multimodal: Simultaneous processing of text, images, and voice
Federated Learning: Local training with user data, then aggregation
Real-time Collaborative AI: P2P distribution of AI computation across multiple browsers

Industry Impact

Education: AI tutors for personalized learning
Healthcare: Real-time health monitoring
Creative: Browser-based AI design tools
Gaming: Intelligent NPC dialogue and behavior

Frequently Asked Questions

Q: How fast is it compared to server AI? A: Considering network latency, Edge AI is 5-10x faster for simple inference. However, servers still have advantages for complex models.

Q: When will WebGPU be available on all browsers? A: Chrome and Edge already support it. Safari and Firefox are expected to support it in 2024-2025.

Q: Does WebGPU work on mobile? A: Yes, it works on Android Chrome 94+. iOS has limited support in Safari.

In the next post, we’ll cover the Web ML development process step by step in detail.

[Tech Series 02] From TensorFlow.js to WebLLM: Evolution of Web ML

From TensorFlow.js to WebLLM: Evolution of Web ML

1. The Dawn: Emergence of TensorFlow.js (2018)

2. The Leap: WebAssembly (Wasm) Acceleration

3. The Revolution: WebGPU and Large Language Models (LLM)

Performance Comparison by Generation

TensorFlow.js (2018)

WebAssembly Accelerator (2020)

WebGPU + WebLLM (2023~)

Real Implementation Examples

1. Pose Recognition with PoseNet

2. Chatbot with WebLLM

Technology Selection Guide

When to use TensorFlow.js?

When to use WebGPU?

When to use WebAssembly?

Performance Optimization Tips

1. Model Quantization

2. WebWorker Utilization

3. Model Caching

Limitations and Solutions

Limitation 1: Browser Memory Constraints

Limitation 2: Initial Loading Time

Limitation 3: GPU Access Restrictions

Future Outlook

Expected Developments

Industry Impact

Frequently Asked Questions

Related Posts

Gemini 1.5 Pro vs GPT-4o: In-Depth Comparison of Next-Gen Multimodal AIs

[Tech Series 01] Web Browser, the New Stage for AI: Edge Intelligence

[Tech Series 05] Realays AI Use Cases: FruitsFace & Dalendar