Docs
Concepts
Serverless Vs Dedicated

Serverless vs Dedicated GPU

Serverless scales workers to zero between requests and bills only for active GPU seconds. Dedicated keeps a warm worker on permanently and bills a fixed monthly rate. This page tells you which one to pick.

We recommend serverless for most integrations. Start there. Move to dedicated only after you measure steady traffic above roughly 50,000 requests per day, or you need a P95 below the cold-start window.

Availability today: serverless runs on RunPod EU. Dedicated runs on RunPod EU through the Socaity dashboard and as a local FastAPI server via APIPod. Dedicated on Scaleway and Azure is on the roadmap (NotImplementedError in the open-source resolver today). Pick serverless if you need it live this quarter.

Two deployment paths

APIPod ships both backends in the same package. The orchestrator and compute mode you pass to the CLI (--compute serverless or --compute dedicated) decides which router runs. Your handler code does not change.

Serverless

Scale to zero · Pay only when active

Auto-scale 0 → 100 workers
€0 when idle
Public or private endpoints

Provider:

RunPod

Dedicated

Always-on · Zero cold start

Always warm GPU memory
Fixed monthly pricing
Private by default

Providers:

Azure

Scaleway

RunPod

$ apipod --build --provider runpod

Quick Comparison

Aspect	Dedicated GPU	Serverless GPU
Billing	Fixed monthly, always running	Scale to zero; pay per active GPU-second
Cold start	None (worker stays warm)	Under 4s for cached images; up to ~20s on first pull
Scaling	Single warm worker per deployment today	RunPod autoscales workers from zero on demand
Predictable latency	Yes, no cold-start tail	Yes once warm, with a cold-start tail after idle
DevOps overhead	You own the GPU instance lifecycle	RunPod handles the worker lifecycle
Ideal traffic pattern	Steady, high-volume, latency-sensitive	Bursty or unpredictable, including idle periods
Min monthly cost	Fixed hourly rate over 24 hours, 30 days	EUR 0 when idle
Availability today	RunPod EU via dashboard, FastAPI for local/self-hosted	RunPod EU

Cold start deep-dive

A cold start happens when a serverless worker has scaled to zero and the next request arrives. RunPod pulls the container, starts the runtime, and loads model weights into GPU VRAM before APIPod's handler runs. The first request after idle pays this latency; subsequent requests do not, until the worker scales back to zero. The Socaity hero quotes a cold start under 4 seconds for cached images and small models; the figures below show the wider envelope for first-pull or large-weights cases.

Container pull (2-8s)

RunPod fetches the Docker image layers. Cached on subsequent cold starts in the same region.

Container start (1-3s)

RunPod starts the runtime and APIPod loads the Python environment baked into the image.

Model load (2-10s)

Your handler transfers weights from disk into GPU VRAM. The dominant variable; large models or remote storage push the upper bound.

Keep-alive: configure a non-zero minimum-worker count on the RunPod endpoint to keep one worker warm. This trades the cold-start window for a fixed hourly cost. Set this in the Socaity dashboard or directly on the RunPod console; APIPod itself does not own the autoscaling policy.

Cost scenarios

These scenarios show which mode wins at each traffic level, holding GPU type and average inference time fixed. For current per-GPU rates and worked monthly examples, see socaity.ai/Pricing.

Usage Pattern	Daily Requests	Dedicated cost shape	Serverless cost shape	Winner
Hobby / low traffic	< 200	Full hourly rate over 720 hours	Pennies, only active seconds bill	Serverless
Growing startup	1,000 to 5,000	Full hourly rate over 720 hours	Low, active seconds add up	Serverless
Scale-up	20,000 to 50,000	Full hourly rate over 720 hours	Medium, approaching dedicated	Break-even
High-volume production	> 100,000	Multi-GPU hourly over 720 hours	High, active seconds dominate	Dedicated

Decision guide

Run through these four questions in order. The first "yes" that lands you on dedicated is your answer; otherwise default to serverless.

Question 1

Do you have steady traffic above 50,000 requests per day?

Yes: dedicated is usually cheaper. No: serverless.

Question 2

Do you need a P95 below the cold-start window?

Yes: dedicated, or keep one warm worker on the RunPod endpoint. No: serverless.

Question 3

Is traffic unpredictable or bursty?

Yes: serverless. RunPod scales workers from zero per request.

Question 4

Do you want zero infrastructure management?

Yes: serverless. Dedicated still requires you to size and monitor the worker.

APIPod configuration

Switch between modes by passing flags to the apipod CLI. The same handler code runs in either backend; only the resolver behind it changes.

Serverless on RunPod

# Start a serverless backend (RunPod handler) locally for testing.
socaity start \
  --orchestrator socaity \
  --compute serverless \
  --provider runpod

Dedicated (FastAPI)

# Start a dedicated, always-on FastAPI server bound to localhost.
socaity start \
  --orchestrator local \
  --compute dedicated \
  --provider localhost \
  --host 0.0.0.0 \
  --port 8000

EU Hosting & Data Sovereignty

Job System