Skip to content
Socaity Docs

Monitor and Optimize

Advanced
10 min

Use the Socaity dashboard to track costs, diagnose cold starts, right-size your GPU selection, and cut spending without sacrificing throughput.

The Socaity Dashboard

Open https://socaity.ai/account to see a live view of your running endpoints, job history, cost accrual, and replica status. Key panels to bookmark:

  • Endpoints β€” replica count, GPU class, cold-start p50/p95.
  • Jobs β€” status, duration, billed cost per job.
  • Cost β€” daily/monthly spend by endpoint, projection to end of month.
  • Alerts β€” set spend caps and latency thresholds.

Understanding Cold Starts

A cold start occurs when a request arrives and no warm replica is available. The platform must pull the container image, allocate a GPU, and run your model initialisation. Three levers control cold-start frequency:

LeverEffectCost impact
No always-warm replicasFull serverless β€” cold start on every idle period
Lowest
One always-warm replicaZero cold starts for single-stream traffic
Medium
N always-warm replicasCold starts only above N concurrent requests
Higher

Configure these in the dashboard, per endpoint. Cold-start percentiles (p50, p95, p99) are visible in the Endpoints panel and update in near-real-time.

GPU Right-Sizing

Choosing the correct GPU class is the highest-impact cost optimisation. The guiding principle: pick the smallest GPU where VRAM is not the bottleneck and inference time meets your SLA.

GPUVRAMBest for
T416 GBSmall models ≀ 8 GB VRAM, batch inference
A10G24 GBMost diffusion models, SDXL, FLUX Schnell
A100 (40 GB)40 GBLarge LLMs, video generation
A100 (80 GB)80 GBVery large models, multi-modal inference
H10080 GBHighest throughput, training, FLUX Dev

Current per-GPU pricing lives at socaity.ai/Pricing.

Spend Alerts

Set a monthly spend cap or a per-endpoint alert from the Alerts panel in the dashboard. The platform will email you (and optionally pause the endpoint) when the threshold is hit.

Optimisation Checklist

  • Profile VRAM usage and downgrade GPU class if headroom > 40%.
  • Keep at least one warm replica for latency-sensitive endpoints with steady traffic.
  • Batch requests where possible β€” GPU throughput scales with batch size up to a point.
  • Pre-extract embeddings and voice profiles; avoid re-processing the same inputs.
  • Use deterministic seeds for cache-able outputs.
  • Tune the scale-down delay so warm replicas stay alive after a burst without paying for long idle periods.

What You Learned

  • Where to find job history, cost data, and endpoint health in the dashboard
  • How cold starts work and the three levers that control them
  • How to right-size GPU selection by VRAM and SLA
  • How to set spend alerts to prevent runaway costs