Skip to content
Socaity Docs

Serverless vs Dedicated GPU

Choose the right compute model for your workload. Serverless scales to zero; dedicated gives you a warm, always-on GPU.

Two deployment paths

Pick the shape that matches your traffic. Both are deployed with the same APIPod CLI command — only the compute.mode in apipod.json differs.

Serverless

Scale to zero · Pay only when active

  • Auto-scale 0 → 100 workers
  • €0 when idle
  • Public or private endpoints
Provider:
RunPod

Dedicated

Always-on · Zero cold start

  • Always warm GPU memory
  • Fixed monthly pricing
  • Private by default
Providers:
Azure
Scaleway
RunPod
$ apipod --build --provider runpod

Quick Comparison

AspectDedicated GPUServerless GPU
BillingHourly, always runningScales to zero when idle; pay-per-call when running
Cold StartNone (always warm)5–20s after idle period
ScalingManual replica managementAutomatic 0–∞ replicas
Predictable latencyYes — consistent P95Only after warm-up
DevOps overheadMonitor & scale manuallyNone
Ideal traffic patternSteady, high-volumeBursty or unpredictable
Min monthly costFixed hourly rate × 24 × 30Zero when unused
SLA99.9% on Pro+99.9% on Pro+ (excludes cold start)

Cold Start Deep-Dive

A cold start happens when a serverless worker has scaled to zero and a new request arrives. The container must be pulled, launched, and the model weights loaded into GPU VRAM.

Container Pull (2–8s)

Docker image layers fetched from registry. Cached after first pull.

Container Start (1–3s)

Runtime initialised, Python environment loaded.

Model Load (2–10s)

Weights transferred from disk/S3 to GPU VRAM. Biggest variable.

Cost Scenarios

Which mode wins at each traffic level, holding GPU type and average inference time fixed. For current per-GPU rates and worked monthly examples, see socaity.ai/Pricing.

Usage PatternDaily RequestsDedicated cost shapeServerless cost shapeWinner
Hobby / Low traffic< 200Full hourly rate × 720Tiny — only active seconds
Serverless
Growing startup1,000 – 5,000Full hourly rate × 720Low — active seconds add up
Serverless
Scale-up20,000 – 50,000Full hourly rate × 720Medium — close to dedicated
Break-even
High-volume production> 100,000Multi-GPU hourly × 720High — active seconds dominate
Dedicated

Decision Guide

Answer these three questions to choose the right mode.

Question 1

Do you have > 50,000 requests per day?

Yes → Dedicated is likely cheaper. No → Serverless.

Question 2

Is latency critical — sub-second SLA?

Yes → Dedicated or min_replicas: 1. No → Serverless.

Question 3

Is traffic unpredictable or bursty?

Yes → Serverless handles bursts automatically. Dedicated does not.

Question 4

Do you want zero infrastructure management?

Yes → Serverless. Always. Dedicated requires active monitoring.

APIPod Configuration

Switch between modes in your apipod.json file.

Serverless (default)

apipod.json
{
  "service": {
    "name": "my-model",
    "provider": "socaity"
  },
  "compute": {
    "mode": "serverless",
    "gpu": "A4000",
    "min_replicas": 0,
    "max_replicas": 10
  }
}

Dedicated (always-on)

apipod.json
{
  "service": {
    "name": "my-model",
    "provider": "socaity"
  },
  "compute": {
    "mode": "dedicated",
    "gpu": "A4000",
    "replicas": 2,
    "health_check_path": "/health"
  }
}

SLA Summary

PlanModeUptime SLAP95 Latency Guarantee
Free
ServerlessNo SLANo guarantee
Plus
Serverless99.5%No guarantee
Pro
Serverless99.9%< 30s warm
Pro
Dedicated99.9%< 2s (warm)
Ultimate
Dedicated99.99%< 1s (warm)