Cold starts
Serverless workers spin down when idle. The first request after a period of inactivity pays a startup cost. Here is what that cost is and how to control it.
See also:Serverless vs Dedicated · Job system · Billing (cold starts are not charged)
When a serverless worker has scaled to zero and a new request arrives, the runtime has to do three things before it can serve the request: download the container image layers, start the container and initialise the Python environment, then load the model weights from disk or object storage into GPU VRAM. The cumulative time for those three steps is the cold start. Requests that arrive at an already-running worker — warm starts — skip all of it.
Industry-typical cold starts on GPU serverless range from 10s to 30s depending on image size and weight volume. Socaity targets under 4s by combining image layer caching, weight prefetching, and regional warm pools. Provider baselines (without those mitigations) vary: RunPod typically 5–15s, Azure 6–18s, Scaleway 8–20s. (Source: providers.vue deep comparison table and HostingLandingHero.vue stat strip.)
<4s cold start figure is the target for Socaity-hosted models with caching active. Your own APIPod deployments depend on your image size and weight volume — measure before relying on a number. Container pull (2–8s)
Docker image layers fetched from the registry. Cached after the first pull — only changed layers are re-downloaded.
Container start (1–3s)
Runtime initialised, Python environment loaded. The cost here is fixed regardless of model size.
Model load (2–10s)
Weights transferred from disk or object storage into GPU VRAM. The biggest variable — a 7B model loads faster than a 70B model.
- Pre-warmed worker pools — a small pool of initialised workers is kept ready in each active region, absorbing the first wave of cold requests without a full cold-start penalty.
- Image layer caching — container layers are cached at the registry level. After the first pull, only changed layers are downloaded on subsequent starts.
- Weight prefetching — model weights for catalog models are pre-staged on regional block storage, removing the S3 round-trip from the startup path.
- Regional warm pools — warm pools are maintained per region. Requests routed to
eu-west-1draw from the EU warm pool, not from a global queue.
| Use case | Cold start matters? | Recommendation |
|---|---|---|
| Real-time chat | Yes | Use dedicated GPU or set min_replicas: 1 to keep one warm worker. |
| Batch image generation | No | Serverless is fine. The cold start amortises across the batch. |
| Long-running jobs (video, audio) | No | A 4s startup on a 5-minute job is under 1.5% overhead. |
| Webhook receiver with SLA | Yes | Dedicated, or serverless with min_replicas: 1 and a warm-up cron. |
Log the time between job submission and the first byte of response in your APIPod handler. Compare it against your warm-request P95 to see the actual cold-start delta for your image and weight size.
import time
from apipod import APIPod
app = APIPod()
@app.endpoint("/generate")
def run(prompt: str) -> dict:
t0 = time.perf_counter()
# --- your model inference here ---
result = model.generate(prompt)
# ---------------------------------
elapsed_ms = (time.perf_counter() - t0) * 1000
return {
"result": result,
"inference_ms": round(elapsed_ms, 1),
}Pattern 1
Minimum replica count
Set min_replicas: 1 in apipod.json to keep one worker permanently warm. You pay the hourly GPU rate for that worker whether it is busy or not — trade idle cost for zero cold starts.
Pattern 2
Warm-up cron
Send a lightweight no-op request to your service every few minutes via a cron job. Keeps the worker alive without the fixed cost of min_replicas. The worker still scales down after a longer idle window.
Pattern 3
Request-time prefetch
For bursty workloads, submit a warm-up job at the start of a session before the real request arrives. The warm-up pays the cold start; the real request arrives warm.