Tech 2026. 1. 14.

[Tech Series 04] 프로덕션 레벨 Web ML 최적화

Web ML을 실제 서비스에 적용할 때 마주치는 성능 문제를 해결하는 최적화 기법(양자화, 메모리 관리 등)을 다룹니다.

프로덕션 레벨 Web ML 최적화 기법

Web ML 기술을 단순히 구현하는 것을 넘어, 실제 사용자를 만족시키는 프로덕션 레벨의 서비스를 만들기 위해서는 치열한 최적화가 필요합니다. 브라우저 환경은 고성능 서버 대비 메모리, 연산력, 네트워크 대역폭 모두 제약이 있기 때문입니다.

1. 양자화 (Quantization): 크기와 속도의 마법

양자화는 모델의 가중치와 활성화 함수 값을 낮은 정밀도로 표현하여 모델 크기를 줄이고 추론 속도를 높이는 기법입니다.

Float32 → Int8 변환

일반적으로 AI 모델은 32비트 부동소수점(float32)으로 가중치를 저장합니다. 이를 8비트 정수(int8)로 변환하면:

효과:

모델 크기 75% 감소 (32비트 → 8비트)
메모리 사용량 감소
다운로드 시간 단축
추론 속도 향상 (정수 연산이 부동소수점보다 빠름)

# PyTorch 양자화 예시
import torch
import torch.quantization

# 학습된 모델 준비
model_fp32 = MyModel()
model_fp32.load_state_dict(torch.load('model.pth'))

# 양자화 설정
model_fp32.qconfig = torch.quantization.get_default_qconfig('fbgemm')torch.quantization.prepare(model_fp32, inplace=True)

# 보정 데이터로 양자화 파라미터 계산
for data in calibration_data:
    model_fp32(data)

# 양자화 모델로 변환
model_int8 = torch.quantization.convert(model_fp32, inplace=False)

# 저장
torch.save(model_int8.state_dict(), 'model_quantized.pth')

동적 양자화 (Dynamic Quantization)

가중치는 사전에 양자화하고, 활성화 값은 런타임에 동적으로 양자화합니다.

// TensorFlow.js 양자화 모델 사용
const model = await tf.loadGraphModel(
  "https://example.com/model_quantized/model.json",
);

// 양자화 모델의 추론
const result = await model.predict(inputTensor);

정확도 vs 크기 트레이드오프

양자화 수준	모델 크기	정확도 손실	추천 용도
Float32 (원본)	100%	0%	정밀도가 최우선
Float16	50%	~0.1%	고정밀 추론
Int8	25%	~1%	대부분의 경우
Int4	12.5%	~3%	경량화 우선

실전 팁: 먼저 int8로 양자화하고, 정확도를 측정한 뒤 허용 범위 내라면 추가 최적화를 고려합니다.

2. 메모리 관리: 브라우저의 한계와의 싸움

웹 브라우저는 탭당 메모리 제한이 있어 대용량 모델 실행 시 주의가 필요합니다.

Tensor 메모리 명시적 해제

TensorFlow.js는 자동 가비지 컬렉션을 하지 않으므로, 사용한 텐서를 명시적으로 해제해야 합니다.

// ❌ 나쁜 예: 메모리 누수
async function badInference(inputData) {
  const tensor = tf.tensor(inputData);
  const result = await model.predict(tensor);
  return result.arraySync();
  // tensor와 result가 메모리에 남음!
}

// ✅ 좋은 예: 명시적 해제
async function goodInference(inputData) {
  return tf.tidy(() => {
    const tensor = tf.tensor(inputData);
    const result = model.predict(tensor);
    const output = result.arraySync();
    // tf.tidy가 자동으로 중간 텐서 정리
    return output;
  });
}

메모리 사용량 모니터링

// 현재 메모리 사용량 확인
const memInfo = tf.memory();
console.log(`할당된 텐서 개수: ${memInfo.numTensors}`);
console.log(`사용 중인 바이트: ${memInfo.numBytes}`);
console.log(`누수 위험 텐서: ${memInfo.unreliable ? "있음" : "없음"}`);

// 정기적 메모리 체크
setInterval(() => {
  if (tf.memory().numTensors > 1000) {
    console.warn("메모리 누수 가능성!");
  }
}, 5000);

배치 크기 최적화

메모리가 부족할 때는 배치 크기를 줄입니다.

// 대량 데이터 처리 시 배치 분할
async function processBatch(dataArray, batchSize = 32) {
  const results = [];

  for (let i = 0; i < dataArray.length; i += batchSize) {
    const batch = dataArray.slice(i, i + batchSize);
    const batchTensor = tf.tensor(batch);
    const predictions = await model.predict(batchTensor);

    results.push(...(await predictions.array()));

    // 배치 처리 후 즉시 메모리 해제
    batchTensor.dispose();
    predictions.dispose();
  }

  return results;
}

3. Web Worker: UI 블로킹 방지

ML 추론은 연산 집약적이어서 메인 스레드에서 실행하면 UI가 멈출 수 있습니다.

Web Worker로 추론 분리

// ml-worker.js
importScripts("https://cdn.jsdelivr.net/npm/@tensorflow/tfjs");

let model;

self.addEventListener("message", async (e) => {
  const { type, data } = e.data;

  if (type === "load") {
    model = await tf.loadGraphModel(data.modelUrl);
    self.postMessage({ type: "loaded" });
  } else if (type === "predict") {
    const tensor = tf.tensor(data.input);
    const result = model.predict(tensor);
    const output = await result.array();

    tensor.dispose();
    result.dispose();

    self.postMessage({ type: "result", output });
  }
});

// main.js
const worker = new Worker("ml-worker.js");

// 모델 로드
worker.postMessage({
  type: "load",
  data: { modelUrl: "model.json" },
});

// 추론 실행
worker.postMessage({
  type: "predict",
  data: { input: imageData },
});

worker.addEventListener("message", (e) => {
  if (e.data.type === "result") {
    console.log("결과:", e.data.output);
    updateUI(e.data.output);
  }
});

OffscreenCanvas로 이미지 처리

Canvas를 Worker로 전송하여 이미지 전처리도 백그라운드에서 수행합니다.

// 메인 스레드
const canvas = document.getElementById("myCanvas");
const offscreen = canvas.transferControlToOffscreen();
worker.postMessage({ type: "canvas", canvas: offscreen }, [offscreen]);

4. WebGPU: 차세대 가속

WebGL의 후속인 WebGPU는 더 강력한 GPU 제어를 제공합니다.

WebGPU 백엔드 활성화

// TensorFlow.js에서 WebGPU 사용
await tf.setBackend("webgpu");
await tf.ready();

console.log("현재 백엔드:", tf.getBackend()); // 'webgpu'

성능 비교

백엔드	추론 시간 (ms)	특징
CPU	850	모든 브라우저 지원, 가장 느림
WebGL	45	대부분 브라우저 지원, 적절한 성능
WebGPU	12	Chrome/Edge만 지원, 최고 성능
WASM + SIMD	120	CPU보다 빠름, GPU 없어도 동작

5. 모델 구조 최적화

프루닝 (Pruning)

중요도가 낮은 가중치를 제거하여 모델을 경량화합니다.

import torch
import torch.nn.utils.prune as prune

# 특정 레이어 프루닝 (30% 가중치 제거)
prune.l1_unstructured(model.conv1, name='weight', amount=0.3)

# 전역 프루닝
parameters_to_prune = [
    (model.conv1, 'weight'),
    (model.conv2, 'weight'),
    (model.fc, 'weight'),
]

prune.global_unstructured(
    parameters_to_prune,
    pruning_method=prune.L1Unstructured,
    amount=0.4  # 전체 가중치의 40% 제거
)

지식 증류 (Knowledge Distillation)

큰 Teacher 모델의 지식을 작은 Student 모델로 전달합니다.

# Student 모델 학습
def distillation_loss(student_logits, teacher_logits, true_labels, temperature=3.0, alpha=0.5):
    soft_loss = F.kl_div(
        F.log_softmax(student_logits / temperature, dim=1),
        F.softmax(teacher_logits / temperature, dim=1),
        reduction='batchmean'
    ) * (temperature ** 2)

    hard_loss = F.cross_entropy(student_logits, true_labels)

    return alpha * soft_loss + (1 - alpha) * hard_loss

6. 실전 최적화 체크리스트

모델 레벨:

양자화 (int8 이상)
프루닝 (불필요한 가중치 제거)
지식 증류 (경량 모델로 전환)
레이어 융합 (Conv + BN + ReLU 통합)

런타임 레벨:

WebGPU 백엔드 사용 (지원 시)
Web Worker로 추론 분리
메모리 명시적 관리 (tf.tidy)
배치 크기 최적화

배포 레벨:

모델 파일 Gzip 압축
CDN 캐싱 (오래 유지)
Progressive Loading
Service Worker 오프라인 지원

마무리

프로덕션 Web ML은 단순히 “동작하는 것”과 “사용자가 만족하는 것” 사이의 간극을 메우는 작업입니다.

핵심 원칙:

측정하고 최적화하라: 추측하지 말고 프로파일링부터
사용자 경험 우선: 로딩 시간 < 3초, 추론 시간 < 100ms
점진적 개선: 한 번에 하나씩 최적화하고 효과 측정
폴백 전략: GPU 미지원 환경 대비 CPU 백엔드 준비

다음 편에서는 Realays가 실제 서비스에 이러한 기법들을 어떻게 적용했는지 사례를 공유하겠습니다.

[Tech Series 04] 프로덕션 레벨 Web ML 최적화

프로덕션 레벨 Web ML 최적화 기법

1. 양자화 (Quantization): 크기와 속도의 마법

Float32 → Int8 변환

동적 양자화 (Dynamic Quantization)

정확도 vs 크기 트레이드오프

2. 메모리 관리: 브라우저의 한계와의 싸움

Tensor 메모리 명시적 해제

메모리 사용량 모니터링

배치 크기 최적화

3. Web Worker: UI 블로킹 방지

Web Worker로 추론 분리

OffscreenCanvas로 이미지 처리

4. WebGPU: 차세대 가속

WebGPU 백엔드 활성화

성능 비교

5. 모델 구조 최적화

프루닝 (Pruning)

지식 증류 (Knowledge Distillation)

6. 실전 최적화 체크리스트

마무리

관련 포스트

Gemini 1.5 Pro vs GPT-4o: 차세대 멀티모달 AI 비교 분석

[Tech Series 01] 웹 브라우저, AI의 새로운 무대가 되다: Edge Intelligence

[Tech Series 02] TensorFlow.js에서 WebLLM까지: Web ML의 진화