Deploy an LLM API

Intermediate

30 min

Wrap a Hugging Face chat model with APIPod, run it locally, and deploy it as an OpenAI-compatible /chat endpoint — without writing a single line of server code. We use Qwen2.5-0.5B-Instruct for the walkthrough; swap in any other causal LM the same way.

Alpha SDK — Code examples on this page are illustrative. The Socaity SDK is in alpha and APIs may change. Always check the Python SDK reference for current syntax.

🎬 Video Tutorial — Coming soon.

Step 1 — Set Up the Python Environment

LLM dependencies have strict version constraints. The combinations below are the ones we have validated end-to-end on macOS (Intel) and Linux. Other versions may work but are not officially supported.

Python 3.10, 3.11 or 3.12 only. Python 3.14 does not yet ship torch wheels, so the install step will fail.

# pick one of 3.10 / 3.11 / 3.12 — 3.14 is NOT supported yet
python3.12 -m venv .venv
source .venv/bin/activate

Step 2 — Install Dependencies

The pins below are required. transformers 5.x needs torch >= 2.4, which has no Intel-Mac wheel — so we stay on transformers 4.x and torch 2.2. accelerate is required by the Hugging Face loader even though nothing tells you so.

pip install -r requirements.txt

apipod>=1.0
torch>=2.2,<2.3
transformers>=4.40,<4.46
accelerate>=0.26
numpy<2

Step 3 — Write the Service File

One Python file does everything. The model loads once in lifespan, not on every request — this is the most common pitfall when wrapping LLMs. The endpoint accepts and returns the OpenAI chat schema, so any OpenAI-compatible client works out of the box.

# service.py
from contextlib import asynccontextmanager
from datetime import datetime
import uuid

import torch
from apipod import APIPod
from apipod.common import schemas
from fastapi import FastAPI
from transformers import AutoModelForCausalLM, AutoTokenizer


class State:
    model = None
    tokenizer = None

state = State()


@asynccontextmanager
async def lifespan(app: FastAPI):
    name = "Qwen/Qwen2.5-0.5B-Instruct"
    device = "cuda" if torch.cuda.is_available() else "cpu"
    dtype = torch.float16 if device == "cuda" else torch.float32

    state.tokenizer = AutoTokenizer.from_pretrained(name)
    state.model = AutoModelForCausalLM.from_pretrained(
        name,
        torch_dtype=dtype,
        device_map="auto" if device == "cuda" else None,
        low_cpu_mem_usage=True,
    )
    if device == "cpu":
        state.model = state.model.to(device)
    yield
    if torch.cuda.is_available():
        torch.cuda.empty_cache()


app = APIPod(backend="fastapi", lifespan=lifespan)


@app.endpoint(path="/chat")
def chat(payload: schemas.ChatCompletionRequest):
    messages = [m.model_dump() for m in payload.messages]
    prompt = state.tokenizer.apply_chat_template(
        messages, tokenize=False, add_generation_prompt=True
    )
    inputs = state.tokenizer(prompt, return_tensors="pt").to(state.model.device)

    with torch.no_grad():
        output = state.model.generate(
            **inputs,
            max_new_tokens=payload.max_tokens or 512,
            temperature=payload.temperature or 0.7,
            do_sample=True,
            pad_token_id=state.tokenizer.eos_token_id,
        )

    reply = state.tokenizer.decode(
        output[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True
    ).strip()

    return schemas.ChatCompletionResponse(
        id=f"chatcmpl-{uuid.uuid4().hex[:8]}",
        object="chat.completion",
        created=int(datetime.now().timestamp()),
        model=payload.model or "Qwen/Qwen2.5-0.5B-Instruct",
        choices=[schemas.ChatCompletionChoice(
            index=0,
            message=schemas.ChatCompletionMessage(role="assistant", content=reply),
            finish_reason="stop",
        )],
        usage=schemas.Usage(prompt_tokens=0, completion_tokens=0, total_tokens=0),
    )


if __name__ == "__main__":
    app.start()

The request and response classes live in apipod.common.schemas. They include ChatCompletionRequest, ChatCompletionResponse, EmbeddingRequest, and the streaming counterparts.

Step 4 — Serve Locally

Run the service on your machine. The first start downloads the model weights (a few hundred MB for Qwen2.5-0.5B) and loads them into memory.

python service.py
# INFO:     Uvicorn running on http://0.0.0.0:8000

In another terminal, send a chat request:

curl -s -X POST http://localhost:8000/chat \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [{"role": "user", "content": "Say hello in one short sentence."}],
    "max_tokens": 30
  }'

You should get a JSON response that looks like this:

{
  "id": "chatcmpl-b970afca",
  "object": "chat.completion",
  "created": 1779093208,
  "model": "Qwen/Qwen2.5-0.5B-Instruct",
  "choices": [{
    "index": 0,
    "message": {
      "role": "assistant",
      "content": "Hello! How can I assist you today?"
    },
    "finish_reason": "stop"
  }]
}

Visit http://localhost:8000/docs for an auto-generated Swagger UI you can use to try the endpoint from the browser.

Steps 5–7 — Coming Soon. Building a container image with apipod --build, deploying to a serverless GPU provider, and calling the live endpoint via the OpenAI client are validated for image and audio models today. The LLM path is being tested for the next release — the section below is a preview.

Step 5 — Build a Container Image (Coming Soon)

apipod --build will package your code, the pinned dependencies, and the Hugging Face cache into a container image ready for GPU deployment.

apipod --build service.py

Step 6 — Deploy to a Provider (Coming Soon)

The same apipod --build command will push directly to a serverless GPU provider — RunPod, Scaleway, or Azure depending on your account.

apipod --build service.py --provider runpod --compute serverless

After a successful deploy you will get a public endpoint URL:

✓ Image pushed to registry
✓ Endpoint registered
✓ Live at: https://api.socaity.ai/endpoints/qwen-chat/chat

Step 7 — Call Your Deployed Endpoint (Coming Soon)

The deployed endpoint will speak the OpenAI chat protocol, so you can point any OpenAI client at it. Just change the base_url and use your Socaity API key as the bearer token.

import os
from openai import OpenAI

client = OpenAI(
    base_url="https://api.socaity.ai/endpoints/qwen-chat",
    api_key=os.getenv("SOCAITY_API_KEY"),
)

response = client.chat.completions.create(
    model="Qwen/Qwen2.5-0.5B-Instruct",
    messages=[{"role": "user", "content": "Write a haiku about deployment."}],
    max_tokens=60,
)

print(response.choices[0].message.content)

What You Built

Pinned a Python environment that actually runs Hugging Face LLMs
Wrote a single-file APIPod service with an OpenAI-compatible /chat endpoint
Loaded the model once at startup using lifespan
Verified the service locally with curl and Swagger UI

Wrap Your Own Model

Multi-Model Pipeline