Cold starts

A cold start is the time between a request reaching a scaled-to-zero serverless endpoint and the worker being ready to run inference. The runtime layer (RunPod, in Socaity's case) handles the cold start. Socaity orchestrates which workers run and where. The live site advertises cold starts under 4 seconds for catalog models in the EU region.

See also:Serverless vs Dedicated, Job system, Billing (cold starts are not charged).

What a cold start is

When a serverless worker has scaled to zero and a new request arrives, the runtime does three things before it can serve that request. It pulls the container image layers, starts the container and initialises the Python environment, then loads the model weights from disk or object storage into GPU VRAM. The cumulative time for those three steps is the cold start. Requests that land on an already-running worker (warm starts) skip all of it.

It helps to know how the layers split. In serverless mode APIPod does very little: it exposes a handler(job) function that RunPod's worker process invokes. Pulling images, caching weights, keeping workers warm, and autoscaling all live at the RunPod layer. The numbers below are what the runtime layer delivers when Socaity routes a request to it.

Why it matters

A typical GPU serverless cold start runs 10 to 30 seconds depending on image size and weight volume. Socaity's hosted catalog targets under 4 seconds by routing to RunPod endpoints with image caching enabled and weights pre-staged on regional block storage. Custom APIPod deployments on RunPod can hit similar numbers, but the floor depends on your image size, your weight size, and whether you have pre-warmed workers configured on the RunPod endpoint.

The <4s cold start figure applies to Socaity-hosted catalog models. Your own APIPod deployments depend on image size, weight volume, and RunPod endpoint configuration. Measure before relying on a number.

The three phases

Container pull (2–8s)

Docker image layers fetched from the registry. Cached after the first pull, so only changed layers are re-downloaded on subsequent starts.

Container start (1–3s)

Runtime initialised, Python environment loaded. The cost here is roughly fixed regardless of model size.

Model load (2–10s)

Weights transferred from disk or object storage into GPU VRAM. This is the dominant variable. A 7B model loads faster than a 70B model.

How the runtime layer reduces cold starts

The mitigations below are RunPod-layer features. Socaity selects catalog endpoints that have them configured; APIPod users configure them on their own RunPod endpoints through the RunPod console.

Pre-warmed workers. A RunPod endpoint can keep one or more workers always-on through its active-workers setting. The first request after idle skips the pull and load phases entirely.
Image layer caching. RunPod caches container layers at the registry level. After the first pull, only changed layers are downloaded on subsequent starts.
Weight pre-staging. Model weights for Socaity catalog models are staged on regional block storage, removing the object-store round-trip from the load phase.
Regional routing. Socaity routes EU traffic to EU endpoints. The cold start is paid on a warm-eligible worker in the same region, not on a cross-continental pull.

When cold starts matter and when they do not

Use case	Cold start matters?	Recommendation
Real-time chat	Yes	Use dedicated GPU, or configure active workers on the RunPod endpoint to keep one worker always warm.
Batch image generation	No	Serverless is fine. The cold start amortises across the batch.
Long-running jobs (video, audio)	No	A 4s startup on a 5-minute job is under 1.5% overhead.
Latency-sensitive webhook receiver	Yes	Dedicated, or serverless with one active worker held warm and a warm-up cron as a safety net.

How to measure cold starts in your service

Log the time between job submission and the first byte of response in your APIPod handler. Compare it against your warm-request P95 to see the actual cold-start delta for your image and weight size.

import time
from apipod import APIPod

app = APIPod()

@app.endpoint("/generate")
def run(prompt: str) -> dict:
    # Time only the inference. Container pull and model load
    # happen before this function is invoked, so they are
    # not visible to your handler.
    t0 = time.perf_counter()
    result = model.generate(prompt)
    elapsed_ms = (time.perf_counter() - t0) * 1000

    return {
        "result": result,
        "inference_ms": round(elapsed_ms, 1),
    }

Mitigation patterns

Pattern 1

Active workers on the RunPod endpoint

Keep one or more workers always-on through your RunPod endpoint settings. You pay the hourly GPU rate for those workers whether they are busy or not. The trade is idle cost for zero cold starts on the first N concurrent requests.

Pattern 2

Warm-up cron

Send a lightweight no-op request to your service every few minutes via a cron job. The worker stays alive without the fixed cost of an active worker. It still scales down after a longer idle window.

Pattern 3

Request-time prefetch

For bursty workloads, submit a warm-up job at the start of a session before the real request arrives. The warm-up pays the cold start; the real request lands on the now-warm worker.

Next steps

Concept

Serverless vs Dedicated GPU

When the cold-start hit makes dedicated cheaper than serverless.

Concept

Job system

How Socaity queues, polls, and finishes a single inference call.

Platform

Billing

Why cold-start seconds are not charged on active-only billing.

What is MaaS

Pricing model