Realays Logo Realays
← Back to Blog
Tech 1/11/2026

[Tech Series 03] Web ML Development Process: From Model Conversion to Deployment

A step-by-step explanation of the development workflow for converting and integrating AI models created in Python into a web-executable form.

[Tech Series 03] Web ML Development Process: From Model Conversion to Deployment

Web ML Development Process: From Model Conversion to Deployment

To run AI models in a web browser, simply copying and pasting code is not enough. A systematic conversion and deployment pipeline is required to bring models developed in Python-centric AI research environments (PyTorch, TensorFlow) to JavaScript and WebGPU environments.

1. Model Training and Selection (Python Environment)

The beginning of every Web ML project is the same as any AI project.

  • Training: Train a model suitable for your purpose using PyTorch or TensorFlow.
  • Selection: Find a suitable pre-trained model for your project on model hubs like Hugging Face.

At this stage, you need to consider not just the model’s performance but also its size. Web environments have constraints on download speed and memory.

Considerations for Web ML

When selecting or training a model for web deployment, keep these factors in mind:

Model Size:

  • Smaller models (< 50MB): Can be loaded instantly
  • Medium models (50-200MB): Require progress indicators
  • Large models (> 200MB): Need streaming or chunked loading

Architecture Choices:

  • MobileNet: Optimized for mobile and web (lightweight)
  • EfficientNet: Good balance of accuracy and size
  • DistilBERT: Compressed version of BERT for NLP tasks
  • SqueezeNet: Extremely compact image classification

2. Model Conversion (Model Conversion)

Browsers cannot directly execute .pt (PyTorch) or .h5 (Keras) files. Therefore, converting (porting) to web-friendly formats is essential.

TensorFlow.js Converter

Converts TensorFlow’s SavedModel format into model.json and binary shard files that can be loaded on the web. Weight compression can also be performed during this process.

# Convert TensorFlow SavedModel to TensorFlow.js
tensorflowjs_converter \
  --input_format=tf_saved_model \
  --output_format=tfjs_graph_model \
  /path/to/saved_model \
  /path/to/tfjs_model

Benefits:

  • Direct integration with TensorFlow.js
  • Built-in quantization support
  • Optimized for web execution

ONNX (Open Neural Network Exchange)

PyTorch models are usually converted to an intermediate format called ONNX. Using ONNX Runtime Web, you can execute this format directly in the browser with very high compatibility.

# Export PyTorch model to ONNX
import torch

model = YourModel()
dummy_input = torch.randn(1, 3, 224, 224)
torch.onnx.export(model, dummy_input, "model.onnx")

Benefits:

  • Framework-agnostic (works with PyTorch, TensorFlow, etc.)
  • Excellent optimization tools
  • Wide industry adoption

MLC-LLM

For recently popular LLMs, use the TVM compiler-based MLC-LLM toolchain. This analyzes Python models, compiles them into WebGPU shader code, and packages weights in an efficient structure.

# Compile LLM for Web using MLC-LLM
mlc_llm compile model.json \
  --target webgpu \
  --quantization q4f16_1 \
  --output compiled_model

Benefits:

  • Enables billion-parameter models in browsers
  • Advanced quantization techniques
  • WebGPU optimization

3. Web Application Integration and Deployment

Web Development with AI

Converted models are deployed as static files to web servers or CDNs (Content Delivery Networks). Frontend applications then load and use them.

Loading Models in JavaScript

// TensorFlow.js model loading example
async function loadModel() {
  console.log("Starting model load...");

  // Load from CDN
  const model = await tf.loadLayersModel(
    "https://cdn.example.com/my-model/model.json",
  );

  console.log("Model loaded successfully!");
  return model;
}

// ONNX Runtime Web example
async function loadONNXModel() {
  const session = await ort.InferenceSession.create(
    "https://cdn.example.com/model.onnx",
  );
  return session;
}

User Experience (UX) Considerations

At this stage, developers must pay attention to user experience (UX). Model files can range from tens of MB to several GB, so it’s crucial to:

1. Display Loading Progress

async function loadModelWithProgress(url) {
  const response = await fetch(url);
  const reader = response.body.getReader();
  const contentLength = +response.headers.get("Content-Length");

  let receivedLength = 0;
  let chunks = [];

  while (true) {
    const { done, value } = await reader.read();

    if (done) break;

    chunks.push(value);
    receivedLength += value.length;

    // Update progress bar
    const progress = (receivedLength / contentLength) * 100;
    updateProgressBar(progress);
  }

  return new Blob(chunks);
}

2. Implement Caching Strategy

// Cache model using Service Worker
self.addEventListener("fetch", (event) => {
  if (event.request.url.includes("/models/")) {
    event.respondWith(
      caches.open("model-cache-v1").then((cache) => {
        return cache.match(event.request).then((response) => {
          return (
            response ||
            fetch(event.request).then((response) => {
              cache.put(event.request, response.clone());
              return response;
            })
          );
        });
      }),
    );
  }
});

3. Lazy Loading

Load models only when needed, not on initial page load:

// Load model on user interaction
button.addEventListener("click", async () => {
  if (!model) {
    showLoadingSpinner();
    model = await loadModel();
    hideLoadingSpinner();
  }

  const result = await model.predict(input);
  displayResult(result);
});

4. Optimization and Performance Tuning

Model Optimization Techniques

Quantization:

# Quantize TensorFlow.js model to 8-bit
tensorflowjs_converter \
  --input_format=tf_saved_model \
  --output_format=tfjs_graph_model \
  --quantization_bytes=1 \
  /path/to/saved_model \
  /path/to/quantized_model

Pruning: Remove unnecessary connections in the neural network to reduce size without significant accuracy loss.

Knowledge Distillation: Train a smaller “student” model to mimic a larger “teacher” model.

Deployment Best Practices

  1. Use CDN: Deploy models to a CDN for global low-latency access
  2. Enable Compression: Serve models with gzip or brotli compression
  3. Version Control: Implement model versioning for updates
  4. Fallback Strategy: Provide server-side inference as fallback
  5. Monitor Performance: Track loading times and inference speed

5. Testing and Validation

Browser Compatibility Testing

Test across different browsers and devices:

// Check WebGPU support
const hasWebGPU = "gpu" in navigator;

// Check WebGL support
const canvas = document.createElement("canvas");
const hasWebGL = !!(
  canvas.getContext("webgl") || canvas.getContext("experimental-webgl")
);

// Select appropriate backend
if (hasWebGPU) {
  await tf.setBackend("webgpu");
} else if (hasWebGL) {
  await tf.setBackend("webgl");
} else {
  await tf.setBackend("cpu");
}

Performance Benchmarking

async function benchmarkInference() {
  const warmupRuns = 5;
  const benchmarkRuns = 100;

  // Warmup
  for (let i = 0; i < warmupRuns; i++) {
    await model.predict(sampleInput);
  }

  // Benchmark
  const start = performance.now();
  for (let i = 0; i < benchmarkRuns; i++) {
    await model.predict(sampleInput);
  }
  const end = performance.now();

  const avgTime = (end - start) / benchmarkRuns;
  console.log(`Average inference time: ${avgTime.toFixed(2)}ms`);
}

Conclusion

The Web ML development process requires careful planning and execution at every stage—from model selection and conversion to deployment and optimization. The key to success is balancing model performance, file size, and user experience.

Success Checklist:

  • ✅ Choose or train appropriately sized models
  • ✅ Convert to web-compatible formats
  • ✅ Implement loading progress indicators
  • ✅ Enable caching for repeat visits
  • ✅ Optimize with quantization
  • ✅ Test across browsers and devices
  • ✅ Monitor real-world performance

By following this systematic approach, you can deliver AI-powered web experiences that are fast, reliable, and privacy-preserving.


In the next post, we’ll dive deep into production-level optimization techniques to make your Web ML applications blazing fast.

Related Posts