Serverless vs Dedicated GPU
Choose the right compute model for your workload. Serverless scales to zero; dedicated gives you a warm, always-on GPU.
Pick the shape that matches your traffic. Both are deployed with the same APIPod CLI command — only the compute.mode in apipod.json differs.
Serverless
Scale to zero · Pay only when active
- Auto-scale 0 → 100 workers
- €0 when idle
- Public or private endpoints
Dedicated
Always-on · Zero cold start
- Always warm GPU memory
- Fixed monthly pricing
- Private by default
$ apipod --build --provider runpod| Aspect | Dedicated GPU | Serverless GPU |
|---|---|---|
| Billing | Hourly, always running | Scales to zero when idle; pay-per-call when running |
| Cold Start | None (always warm) | 5–20s after idle period |
| Scaling | Manual replica management | Automatic 0–∞ replicas |
| Predictable latency | Yes — consistent P95 | Only after warm-up |
| DevOps overhead | Monitor & scale manually | None |
| Ideal traffic pattern | Steady, high-volume | Bursty or unpredictable |
| Min monthly cost | Fixed hourly rate × 24 × 30 | Zero when unused |
| SLA | 99.9% on Pro+ | 99.9% on Pro+ (excludes cold start) |
A cold start happens when a serverless worker has scaled to zero and a new request arrives. The container must be pulled, launched, and the model weights loaded into GPU VRAM.
Container Pull (2–8s)
Docker image layers fetched from registry. Cached after first pull.
Container Start (1–3s)
Runtime initialised, Python environment loaded.
Model Load (2–10s)
Weights transferred from disk/S3 to GPU VRAM. Biggest variable.
min_replicas: 1 in your APIPod config to keep one warm worker running. This eliminates cold starts at a fixed hourly cost. Which mode wins at each traffic level, holding GPU type and average inference time fixed. For current per-GPU rates and worked monthly examples, see socaity.ai/Pricing.
| Usage Pattern | Daily Requests | Dedicated cost shape | Serverless cost shape | Winner |
|---|---|---|---|---|
| Hobby / Low traffic | < 200 | Full hourly rate × 720 | Tiny — only active seconds | Serverless |
| Growing startup | 1,000 – 5,000 | Full hourly rate × 720 | Low — active seconds add up | Serverless |
| Scale-up | 20,000 – 50,000 | Full hourly rate × 720 | Medium — close to dedicated | Break-even |
| High-volume production | > 100,000 | Multi-GPU hourly × 720 | High — active seconds dominate | Dedicated |
Answer these three questions to choose the right mode.
Question 1
Do you have > 50,000 requests per day?
Question 2
Is latency critical — sub-second SLA?
Question 3
Is traffic unpredictable or bursty?
Question 4
Do you want zero infrastructure management?
Switch between modes in your apipod.json file.
Serverless (default)
{
"service": {
"name": "my-model",
"provider": "socaity"
},
"compute": {
"mode": "serverless",
"gpu": "A4000",
"min_replicas": 0,
"max_replicas": 10
}
}Dedicated (always-on)
{
"service": {
"name": "my-model",
"provider": "socaity"
},
"compute": {
"mode": "dedicated",
"gpu": "A4000",
"replicas": 2,
"health_check_path": "/health"
}
}| Plan | Mode | Uptime SLA | P95 Latency Guarantee |
|---|---|---|---|
Free | Serverless | No SLA | No guarantee |
Plus | Serverless | 99.5% | No guarantee |
Pro | Serverless | 99.9% | < 30s warm |
Pro | Dedicated | 99.9% | < 2s (warm) |
Ultimate | Dedicated | 99.99% | < 1s (warm) |