Skip to content
Socaity Docs

Cold starts

Serverless workers spin down when idle. The first request after a period of inactivity pays a startup cost. Here is what that cost is and how to control it.

See also:Serverless vs Dedicated · Job system · Billing (cold starts are not charged)

What a cold start is

When a serverless worker has scaled to zero and a new request arrives, the runtime has to do three things before it can serve the request: download the container image layers, start the container and initialise the Python environment, then load the model weights from disk or object storage into GPU VRAM. The cumulative time for those three steps is the cold start. Requests that arrive at an already-running worker — warm starts — skip all of it.

Why it matters

Industry-typical cold starts on GPU serverless range from 10s to 30s depending on image size and weight volume. Socaity targets under 4s by combining image layer caching, weight prefetching, and regional warm pools. Provider baselines (without those mitigations) vary: RunPod typically 5–15s, Azure 6–18s, Scaleway 8–20s. (Source: providers.vue deep comparison table and HostingLandingHero.vue stat strip.)

The three phases

Container pull (2–8s)

Docker image layers fetched from the registry. Cached after the first pull — only changed layers are re-downloaded.

Container start (1–3s)

Runtime initialised, Python environment loaded. The cost here is fixed regardless of model size.

Model load (2–10s)

Weights transferred from disk or object storage into GPU VRAM. The biggest variable — a 7B model loads faster than a 70B model.

How Socaity reduces cold starts

  • Pre-warmed worker pools — a small pool of initialised workers is kept ready in each active region, absorbing the first wave of cold requests without a full cold-start penalty.
  • Image layer caching — container layers are cached at the registry level. After the first pull, only changed layers are downloaded on subsequent starts.
  • Weight prefetching — model weights for catalog models are pre-staged on regional block storage, removing the S3 round-trip from the startup path.
  • Regional warm pools — warm pools are maintained per region. Requests routed to eu-west-1 draw from the EU warm pool, not from a global queue.

When cold starts matter and when they do not

Use caseCold start matters?Recommendation
Real-time chat
Yes
Use dedicated GPU or set min_replicas: 1 to keep one warm worker.
Batch image generation
No
Serverless is fine. The cold start amortises across the batch.
Long-running jobs (video, audio)
No
A 4s startup on a 5-minute job is under 1.5% overhead.
Webhook receiver with SLA
Yes
Dedicated, or serverless with min_replicas: 1 and a warm-up cron.

How to measure cold starts in your service

Log the time between job submission and the first byte of response in your APIPod handler. Compare it against your warm-request P95 to see the actual cold-start delta for your image and weight size.

python — log first-request timing
import time
from apipod import APIPod

app = APIPod()

@app.endpoint("/generate")
def run(prompt: str) -> dict:
    t0 = time.perf_counter()

    # --- your model inference here ---
    result = model.generate(prompt)
    # ---------------------------------

    elapsed_ms = (time.perf_counter() - t0) * 1000
    return {
        "result": result,
        "inference_ms": round(elapsed_ms, 1),
    }

Mitigation patterns

Pattern 1

Minimum replica count

Set min_replicas: 1 in apipod.json to keep one worker permanently warm. You pay the hourly GPU rate for that worker whether it is busy or not — trade idle cost for zero cold starts.

Pattern 2

Warm-up cron

Send a lightweight no-op request to your service every few minutes via a cron job. Keeps the worker alive without the fixed cost of min_replicas. The worker still scales down after a longer idle window.

Pattern 3

Request-time prefetch

For bursty workloads, submit a warm-up job at the start of a session before the real request arrives. The warm-up pays the cold start; the real request arrives warm.