Deploy Serverless

Deploy your APIPod service to serverless GPU. Pay only for the GPU seconds you use, scale to zero when idle, and handle traffic spikes automatically.

The flow at a glance

Four steps end-to-end. The first one runs on your laptop, the rest happen on Socaity.

Build

on your laptop

Push to registry

Docker Hub

Deploy

socaity.ai dashboard

Call

SDK or Test API

Build

on your laptop

Push to registry

Docker Hub

Deploy

socaity.ai dashboard

Call

SDK or Test API

Before you open the wizard

Get four things in place before you start. None are Socaity-specific. If you have run a Docker container on any cloud, most of this is already done.

For the local setup that socaity build needs (Docker daemon, disk space, Apple Silicon notes, HF_TOKEN for gated weights), see the Prerequisites in APIPod Getting Started.

A Socaity account

An APIPod service image

Build it with socaity build. See Build. The output is a Docker image with your service and its runtime.

A Docker registry account

Docker Hub is the simplest. Push your image so Socaity can pull it. Public images need no credentials; private images need a Personal Access Token.

Billing set up

Deploys consume GPU time. Add a payment method at /account/billing. Idle time is free.

Available today

Deploy from the socaity.ai dashboard

The supported path today. Build locally with the socaity CLI, push to a registry, then deploy from socaity.ai/account/hosting.

Step 1Build and push the image

Build your service image with socaity build, then push it to a registry (Docker Hub or any public/private registry Socaity can pull from).

# 1. Build your service image
socaity build

# 2. Authenticate to your registry, then push (Docker Hub shown).
#    docker push fails without a prior docker login.
docker login
docker tag apipod-myservice yourhandle/myservice:latest
docker push yourhandle/myservice:latest

# 4. Open https://socaity.ai/account/hosting and follow Step 2 below.

Step 2Open the deploy wizard

The wizard at socaity.ai/account/hosting walks you through three sub-steps. No provider console needed. Socaity handles the RunPod, Scaleway, and Azure integration for you.

2.1 Source

Tell Socaity where to pull the container.

Docker image: paste a fully-qualified reference such as yourhandle/myservice:latest. The wizard validates the username/repository:tag format inline.
GitHub repository: coming soon. Deploy straight from a push, no Docker step.

Private images. If your image is private on Docker Hub, expand Docker Hub authentication and paste a Personal Access Token with read scope. Public images need nothing.

2.2 Configure

Set the service identity and reserve the hardware it runs on.

Field	What to set
Service name	Human-readable label shown in My Services. Used to identify the endpoint.
GPU hardware	Pick the smallest class that fits your model. Larger GPUs cost more per second.
Container disk	In GB. At least the size of your model weights plus runtime overhead. A 27B FP8 model needs roughly 30 GB.
Environment variables	Optional key/value pairs (HF_TOKEN, OPENAI_API_KEY, custom settings). Treated as secrets, never baked into the image.

Smaller GPU plus smallest disk equals cheaper service. You can resize later by redeploying.

2.3 Deploy

Review the summary and hit Deploy Service. The wizard creates the deployment order and redirects you to the service detail page, where provisioning starts.

Step 3Watch provisioning

The service detail page polls the backend and shows progress in four checkpoints. First-time deploys take longer because the image has to be pulled and the model weights loaded to GPU.

Hosting record persisted

Instant. The service exists in Socaity. You can see it in My Services even while the pod is still booting.

OpenAPI spec fetched from pod

The worker pulls your image, boots the container, and reports the API schema. First-time deploys take several minutes on a cold start.

Service validated

The schema is parsed and the endpoint is registered. From this point on you can call the service from the SDK.

Endpoints registered

The dashboard shows the endpoint URL and a Test API button. Use the SDK or curl to send your first request.

Service Validated means your endpoint is registered and callable from the SDK. The endpoint URL appears at the top of the service detail page and under the Endpoints tab.

Step 4Call your service

Hit the Test API button on the service detail page to send a request from the browser, or copy the endpoint URL and call it from the SDK. See Python SDK or SocaitySDK CLI.

Available today

Optimize and tune

These knobs live in your apipod-deploy/apipod.json and take effect on the next build.

Cold start tuning

Serverless containers spin down after a configurable idle period. Use these strategies to keep cold starts under 5 seconds for production workloads.

Reduce image size

Remove unused model weights and test fixtures from your Docker image. Every GB saved cuts cold-start by ~2s on a typical 1 Gbit/s pull.

Increase idle timeout

Set idle_timeout to 300 to keep containers warm for 5 minutes after the last request. Balances cold-start vs. idle cost.

Cache models on network storage

Mount a persistent volume and download model weights once. Subsequent container starts skip the download.

Use keep-warm workers

Set min_workers = 1 in apipod.json. One permanently warm worker eliminates cold starts entirely at a fixed daily cost.

Bake weights vs download at runtime

Baking model weights into the image gives a faster cold start but a bigger image. Downloading at runtime keeps the image small but makes the first boot slow. For scale-to-zero, prefer baking.

{
  "serverless": {
    "min_workers": 1,
    "max_workers": 10,
    "idle_timeout": 60
  }
}

Keep-warm workers are billed at the same rate as active jobs while running. See socaity.ai/Pricing for current per-GPU rates.

Bake weights vs download at runtime

Where the model weights live decides how long the first request takes. Two paths:

Approach	Cold start	Image size	Update story
Bake into the image	Fast. Weights are already on disk when the container boots.	Large. Tens of GB for big models.	Rebuild + repush on every weight change.
Download at runtime	Slow on first boot (minutes for big models). Subsequent boots benefit from the worker's local cache.	Small. Just code and runtime.	Update by changing the weight source, no image rebuild.

For gated Hugging Face weights set HF_TOKEN in the wizard's Environment variables (Step 2.2). Mount a persistent volume to cache downloaded weights across cold starts.

GPU sizing by model

Rough starting point. VRAM headroom depends on quantisation (FP16, FP8, GPTQ), batch size, and KV cache. Verify on a small load test before pinning a class.

Model size	Suggested GPU	Notes
<7B params	`T4 / A10G`	Small LLMs, image gen, embeddings. 16-24 GB VRAM is enough at FP16.
7B-13B	`A100 40GB`	Llama 3 8B, Mistral 7B, Gemma 7B. Room for batching and KV cache.
13B-30B	`A100 80GB / H100`	Larger LLMs at FP16. FP8 quantisation halves VRAM and fits on A100 40GB.
30B-70B	`H100 80GB`	Llama 3 70B, Qwen 72B. FP8 or GPTQ recommended.
70B+	`8x H100 (multi-GPU)`	Frontier models. Needs tensor parallelism. Coordinate with Socaity for capacity.

Environment variables in production

Two ways to feed runtime secrets to your service. Pick one and stick to it. Never bake secrets into the image.

From the dashboard: add them under Environment variables in Step 2.2 above. Stored encrypted, injected at container start.
From apipod.json: useful when the same set of variables travels with the project. The dashboard merges its values on top.

{
  "env": {
    "HF_TOKEN": "hf_xxxxxxxxxxxxxxxx",
    "OPENAI_API_KEY": "sk-xxxxxxxxxxxxxxxx"
  }
}

Pricing

Pricing for serverless deployments is pay-per-call, based on the active billing mode of the underlying service and the GPU type chosen at deploy time. Storage and network egress are billed separately by the provider. Current rates per GPU and per service are published at socaity.ai/Pricing.

Idle cost is zero. The container scales to zero after the idle timeout. You are never billed for waiting.

Coming soon

Unified `socaity deploy` CLI

A one-shot CLI to build, push, and deploy in a single command. Everything below describes the planned shape. None of it ships today. Use the dashboard flow above for now.

Status today. socaity build is live and handles the image build (it delegates to APIPod under the hood). The unified socaity deploy command does not exist yet. The flag tables and config files below are reference for what will land.

Planned flags

Flag	Default	Description
`--serverless`	`false`	Deploy as a serverless endpoint (scales to zero).
`--provider <name>`	`socaity`	Target provider: runpod \| socaity.
`--gpu <type>`	`A100`	GPU class to provision: T4 \| A10G \| A100 \| H100.
`--min-workers <n>`	`0`	Minimum warm workers (0 = full scale-to-zero).
`--max-workers <n>`	`10`	Maximum concurrent replicas.
`--idle-timeout <s>`	`60`	Seconds of idle before the container scales to zero.
`--name <str>`	`project name`	Human-readable name shown in the dashboard.

Planned provider configuration

Provider credentials will be configurable once with socaity login or via environment variables. The CLI will read from ~/.socaity/config.toml.

# Interactive login (saves credentials to ~/.socaity/config.toml)
socaity login

# Or set environment variables directly
export RUNPOD_API_KEY="your-runpod-key"
export SOCAITY_API_KEY="your-socaity-key"

[providers.runpod]
api_key = "rp_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"

[providers.socaity]
api_key = "sk_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"

[defaults]
provider = "runpod"
gpu      = "A100"

Build

Deploy Dedicated

Deploy Serverless

The flow at a glance

Before you open the wizard

Deploy from the socaity.ai dashboard

Step 1Build and push the image

Step 2Open the deploy wizard

2.1 Source

2.2 Configure

2.3 Deploy

Step 3Watch provisioning

Step 4Call your service

Optimize and tune

Cold start tuning

Bake weights vs download at runtime

GPU sizing by model

Environment variables in production

Pricing

Unified socaity deploy CLI

Planned flags

Planned provider configuration

Unified `socaity deploy` CLI