Monitor and Optimize
Use the Socaity dashboard to track costs, diagnose cold starts, right-size your GPU selection, and cut spending without sacrificing throughput.
Open https://socaity.ai/account to see a live view of your running endpoints, job history, cost accrual, and replica status. Key panels to bookmark:
- Endpoints β replica count, GPU class, cold-start p50/p95.
- Jobs β status, duration, billed cost per job.
- Cost β daily/monthly spend by endpoint, projection to end of month.
- Alerts β set spend caps and latency thresholds.
A cold start occurs when a request arrives and no warm replica is available. The platform must pull the container image, allocate a GPU, and run your model initialisation. Three levers control cold-start frequency:
| Lever | Effect | Cost impact |
|---|---|---|
| No always-warm replicas | Full serverless β cold start on every idle period | Lowest |
| One always-warm replica | Zero cold starts for single-stream traffic | Medium |
| N always-warm replicas | Cold starts only above N concurrent requests | Higher |
Configure these in the dashboard, per endpoint. Cold-start percentiles (p50, p95, p99) are visible in the Endpoints panel and update in near-real-time.
Choosing the correct GPU class is the highest-impact cost optimisation. The guiding principle: pick the smallest GPU where VRAM is not the bottleneck and inference time meets your SLA.
| GPU | VRAM | Best for |
|---|---|---|
| T4 | 16 GB | Small models β€ 8 GB VRAM, batch inference |
| A10G | 24 GB | Most diffusion models, SDXL, FLUX Schnell |
| A100 (40 GB) | 40 GB | Large LLMs, video generation |
| A100 (80 GB) | 80 GB | Very large models, multi-modal inference |
| H100 | 80 GB | Highest throughput, training, FLUX Dev |
Current per-GPU pricing lives at socaity.ai/Pricing.
Set a monthly spend cap or a per-endpoint alert from the Alerts panel in the dashboard. The platform will email you (and optionally pause the endpoint) when the threshold is hit.
- Profile VRAM usage and downgrade GPU class if headroom > 40%.
- Keep at least one warm replica for latency-sensitive endpoints with steady traffic.
- Batch requests where possible β GPU throughput scales with batch size up to a point.
- Pre-extract embeddings and voice profiles; avoid re-processing the same inputs.
- Use deterministic seeds for cache-able outputs.
- Tune the scale-down delay so warm replicas stay alive after a burst without paying for long idle periods.
- Where to find job history, cost data, and endpoint health in the dashboard
- How cold starts work and the three levers that control them
- How to right-size GPU selection by VRAM and SLA
- How to set spend alerts to prevent runaway costs