We are still building this section. Content may be incomplete.

Monitor and Optimize

Advanced

10 min

Track job latency in the Socaity SDK, read cost and endpoint health in the Socaity dashboard, and pick the smallest GPU that meets your latency target. This page covers what you can observe today and where to find it.

Where each signal lives. The Python SDK exposes per-job timing via job.runtime_info only. Cost, billing history, replica counts, and endpoint health live in the Socaity dashboard. There is no per-call cost field on the SDK response object.

Video tutorial: coming soon.

Read Job Timing from the SDK

Every job returns a runtime_info tuple of (delay_seconds, execution_seconds) for RunPod- and Replicate-backed services. delay_seconds is queue and cold-start time. execution_seconds is the billable GPU window. Read both after the job finishes:

The SDK does not expose cost, credits, or tokens_used on the response object. To attribute spend, log execution_seconds per call and reconcile against the dashboard's per-endpoint daily total.

Polling and Timeout Defaults

The SDK polls job status every 1 second with a total timeout of 3,600 seconds. The poller tolerates up to four consecutive errors before raising (it raises on the fifth). get_result(timeout_s=...) returns None on timeout. It does not raise.

The Socaity Dashboard

Open https://socaity.ai/account for the panels the SDK does not expose. Bookmark these:

Endpoints: replica count, GPU class, observed cold-start percentiles.
Jobs: per-job status, duration, and billed amount (computed server-side, not on the SDK response).
Cost: daily and monthly spend by endpoint, end-of-month projection.
Alerts: spend caps and latency thresholds.

Check Endpoint Health (APIPod)

Every APIPod FastAPI service exposes GET /health. The endpoint returns one of: INITIALIZING, BOOTING, RUNNING, BUSY, ERROR. Use it from any uptime monitor or load-balancer probe:

Cold Starts: Where the Mitigations Live

A cold start is the gap between a request arriving and a warm worker being ready to serve it. APIPod itself does not implement autoscaling or warm-pool management. RunPod owns the lifecycle for serverless deployments: cold start, scale-to-zero, scale-up, and the warm-pool count are all RunPod settings, not APIPod settings. Three levers move the cold-start curve:

Lever (set on RunPod)	Effect	Cost impact
No always-warm workers	Full serverless. Cold start on every idle period.	Lowest
One always-warm worker	Zero cold starts for single-stream traffic.	Medium
N always-warm workers	Cold starts only above N concurrent requests.	Higher

Configure the warm-pool count on the RunPod endpoint that backs your Socaity deployment. The Socaity dashboard surfaces the observed cold-start percentiles per endpoint so you can verify the change took effect.

GPU Right-Sizing

Picking the correct GPU class is the highest-impact cost lever. The rule: pick the smallest GPU where VRAM is not the bottleneck and inference time meets your latency target.

Overpaying for VRAM is the most common waste pattern. A model that uses 8 GB of VRAM does not benefit from an A100 (80 GB). Use an A10G (24 GB) instead.

GPU	VRAM	Best for
T4	16 GB	Small models up to 8 GB VRAM, batch inference
A10G	24 GB	Most diffusion models, SDXL, FLUX Schnell
A100 (40 GB)	40 GB	Large LLMs, video generation
A100 (80 GB)	80 GB	Very large models, multi-modal inference
H100	80 GB	Highest throughput, training, FLUX Dev

Current per-GPU pricing lives at socaity.ai/Pricing.

Spend Alerts

Set a monthly spend cap or a per-endpoint alert from the Alerts panel in the dashboard. Socaity emails you (and optionally pauses the endpoint) when the threshold is hit.

Optimisation Checklist

Log job.runtime_info per call. Reconcile against the dashboard's daily cost total.
Profile VRAM usage and downgrade the GPU class if headroom exceeds 40%.
Keep at least one warm replica on RunPod for latency-sensitive endpoints with steady traffic.
Batch requests where possible. GPU throughput scales with batch size up to a point.
Pre-extract embeddings and voice profiles. Do not re-process the same inputs.
Use deterministic seeds for cache-able outputs.
Tune the RunPod scale-down delay so warm replicas stay alive after a burst without billing for long idle periods.

Next steps

Job System: how queue time and execution time map onto runtime_info.
APIPod lifecycle: the five health states returned by /health.
Deploy to Cloud: pick a provider and a GPU class for a new endpoint.
Social Media Pipeline: an end-to-end tutorial that uses the patterns on this page.

Deploy to Cloud

Social Media Pipeline