Skip to content
Socaity Docs

Deploy an LLM API

Intermediate
30 min

Wrap a Hugging Face chat model with APIPod, run it locally, and deploy it as an OpenAI-compatible /chat endpoint β€” without writing a single line of server code. We use Qwen2.5-0.5B-Instruct for the walkthrough; swap in any other causal LM the same way.

Step 1 β€” Set Up the Python Environment

LLM dependencies have strict version constraints. The combinations below are the ones we have validated end-to-end on macOS (Intel) and Linux. Other versions may work but are not officially supported.

terminal
# pick one of 3.10 / 3.11 / 3.12 β€” 3.14 is NOT supported yet
python3.12 -m venv .venv
source .venv/bin/activate

Step 2 β€” Install Dependencies

The pins below are required. transformers 5.x needs torch >= 2.4, which has no Intel-Mac wheel β€” so we stay on transformers 4.x and torch 2.2. accelerate is required by the Hugging Face loader even though nothing tells you so.

terminal
pip install -r requirements.txt
requirements.txt
apipod>=1.0
torch>=2.2,<2.3
transformers>=4.40,<4.46
accelerate>=0.26
numpy<2

Step 3 β€” Write the Service File

One Python file does everything. The model loads once in lifespan, not on every request β€” this is the most common pitfall when wrapping LLMs. The endpoint accepts and returns the OpenAI chat schema, so any OpenAI-compatible client works out of the box.

python
# service.py
from contextlib import asynccontextmanager
from datetime import datetime
import uuid

import torch
from apipod import APIPod
from apipod.common import schemas
from fastapi import FastAPI
from transformers import AutoModelForCausalLM, AutoTokenizer


class State:
    model = None
    tokenizer = None

state = State()


@asynccontextmanager
async def lifespan(app: FastAPI):
    name = "Qwen/Qwen2.5-0.5B-Instruct"
    device = "cuda" if torch.cuda.is_available() else "cpu"
    dtype = torch.float16 if device == "cuda" else torch.float32

    state.tokenizer = AutoTokenizer.from_pretrained(name)
    state.model = AutoModelForCausalLM.from_pretrained(
        name,
        torch_dtype=dtype,
        device_map="auto" if device == "cuda" else None,
        low_cpu_mem_usage=True,
    )
    if device == "cpu":
        state.model = state.model.to(device)
    yield
    if torch.cuda.is_available():
        torch.cuda.empty_cache()


app = APIPod(backend="fastapi", lifespan=lifespan)


@app.endpoint(path="/chat")
def chat(payload: schemas.ChatCompletionRequest):
    messages = [m.model_dump() for m in payload.messages]
    prompt = state.tokenizer.apply_chat_template(
        messages, tokenize=False, add_generation_prompt=True
    )
    inputs = state.tokenizer(prompt, return_tensors="pt").to(state.model.device)

    with torch.no_grad():
        output = state.model.generate(
            **inputs,
            max_new_tokens=payload.max_tokens or 512,
            temperature=payload.temperature or 0.7,
            do_sample=True,
            pad_token_id=state.tokenizer.eos_token_id,
        )

    reply = state.tokenizer.decode(
        output[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True
    ).strip()

    return schemas.ChatCompletionResponse(
        id=f"chatcmpl-{uuid.uuid4().hex[:8]}",
        object="chat.completion",
        created=int(datetime.now().timestamp()),
        model=payload.model or "Qwen/Qwen2.5-0.5B-Instruct",
        choices=[schemas.ChatCompletionChoice(
            index=0,
            message=schemas.ChatCompletionMessage(role="assistant", content=reply),
            finish_reason="stop",
        )],
        usage=schemas.Usage(prompt_tokens=0, completion_tokens=0, total_tokens=0),
    )


if __name__ == "__main__":
    app.start()

Step 4 β€” Serve Locally

Run the service on your machine. The first start downloads the model weights (a few hundred MB for Qwen2.5-0.5B) and loads them into memory.

terminal
python service.py
# INFO:     Uvicorn running on http://0.0.0.0:8000

In another terminal, send a chat request:

terminal
curl -s -X POST http://localhost:8000/chat \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [{"role": "user", "content": "Say hello in one short sentence."}],
    "max_tokens": 30
  }'

You should get a JSON response that looks like this:

json
{
  "id": "chatcmpl-b970afca",
  "object": "chat.completion",
  "created": 1779093208,
  "model": "Qwen/Qwen2.5-0.5B-Instruct",
  "choices": [{
    "index": 0,
    "message": {
      "role": "assistant",
      "content": "Hello! How can I assist you today?"
    },
    "finish_reason": "stop"
  }]
}

Step 5 β€” Build a Container Image (Coming Soon)

apipod --build will package your code, the pinned dependencies, and the Hugging Face cache into a container image ready for GPU deployment.

terminal
apipod --build service.py

Step 6 β€” Deploy to a Provider (Coming Soon)

The same apipod --build command will push directly to a serverless GPU provider β€” RunPod, Scaleway, or Azure depending on your account.

terminal
apipod --build service.py --provider runpod --compute serverless

After a successful deploy you will get a public endpoint URL:

terminal
βœ“ Image pushed to registry
βœ“ Endpoint registered
βœ“ Live at: https://api.socaity.ai/endpoints/qwen-chat/chat

Step 7 β€” Call Your Deployed Endpoint (Coming Soon)

The deployed endpoint will speak the OpenAI chat protocol, so you can point any OpenAI client at it. Just change the base_url and use your Socaity API key as the bearer token.

python
import os
from openai import OpenAI

client = OpenAI(
    base_url="https://api.socaity.ai/endpoints/qwen-chat",
    api_key=os.getenv("SOCAITY_API_KEY"),
)

response = client.chat.completions.create(
    model="Qwen/Qwen2.5-0.5B-Instruct",
    messages=[{"role": "user", "content": "Write a haiku about deployment."}],
    max_tokens=60,
)

print(response.choices[0].message.content)

What You Built

  • Pinned a Python environment that actually runs Hugging Face LLMs
  • Wrote a single-file APIPod service with an OpenAI-compatible /chat endpoint
  • Loaded the model once at startup using lifespan
  • Verified the service locally with curl and Swagger UI