Deploy an LLM API
Wrap a Hugging Face chat model with APIPod, run it locally, and deploy it as an OpenAI-compatible /chat endpoint β without writing a single line of server code. We use Qwen2.5-0.5B-Instruct for the walkthrough; swap in any other causal LM the same way.
LLM dependencies have strict version constraints. The combinations below are the ones we have validated end-to-end on macOS (Intel) and Linux. Other versions may work but are not officially supported.
torch wheels, so the install step will fail. # pick one of 3.10 / 3.11 / 3.12 β 3.14 is NOT supported yet
python3.12 -m venv .venv
source .venv/bin/activate The pins below are required. transformers 5.x needs torch >= 2.4, which has no Intel-Mac wheel β so we stay on transformers 4.x and torch 2.2. accelerate is required by the Hugging Face loader even though nothing tells you so.
pip install -r requirements.txtapipod>=1.0
torch>=2.2,<2.3
transformers>=4.40,<4.46
accelerate>=0.26
numpy<2 One Python file does everything. The model loads once in lifespan, not on every request β this is the most common pitfall when wrapping LLMs. The endpoint accepts and returns the OpenAI chat schema, so any OpenAI-compatible client works out of the box.
# service.py
from contextlib import asynccontextmanager
from datetime import datetime
import uuid
import torch
from apipod import APIPod
from apipod.common import schemas
from fastapi import FastAPI
from transformers import AutoModelForCausalLM, AutoTokenizer
class State:
model = None
tokenizer = None
state = State()
@asynccontextmanager
async def lifespan(app: FastAPI):
name = "Qwen/Qwen2.5-0.5B-Instruct"
device = "cuda" if torch.cuda.is_available() else "cpu"
dtype = torch.float16 if device == "cuda" else torch.float32
state.tokenizer = AutoTokenizer.from_pretrained(name)
state.model = AutoModelForCausalLM.from_pretrained(
name,
torch_dtype=dtype,
device_map="auto" if device == "cuda" else None,
low_cpu_mem_usage=True,
)
if device == "cpu":
state.model = state.model.to(device)
yield
if torch.cuda.is_available():
torch.cuda.empty_cache()
app = APIPod(backend="fastapi", lifespan=lifespan)
@app.endpoint(path="/chat")
def chat(payload: schemas.ChatCompletionRequest):
messages = [m.model_dump() for m in payload.messages]
prompt = state.tokenizer.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
)
inputs = state.tokenizer(prompt, return_tensors="pt").to(state.model.device)
with torch.no_grad():
output = state.model.generate(
**inputs,
max_new_tokens=payload.max_tokens or 512,
temperature=payload.temperature or 0.7,
do_sample=True,
pad_token_id=state.tokenizer.eos_token_id,
)
reply = state.tokenizer.decode(
output[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True
).strip()
return schemas.ChatCompletionResponse(
id=f"chatcmpl-{uuid.uuid4().hex[:8]}",
object="chat.completion",
created=int(datetime.now().timestamp()),
model=payload.model or "Qwen/Qwen2.5-0.5B-Instruct",
choices=[schemas.ChatCompletionChoice(
index=0,
message=schemas.ChatCompletionMessage(role="assistant", content=reply),
finish_reason="stop",
)],
usage=schemas.Usage(prompt_tokens=0, completion_tokens=0, total_tokens=0),
)
if __name__ == "__main__":
app.start()apipod.common.schemas. They include ChatCompletionRequest, ChatCompletionResponse, EmbeddingRequest, and the streaming counterparts. Run the service on your machine. The first start downloads the model weights (a few hundred MB for Qwen2.5-0.5B) and loads them into memory.
python service.py
# INFO: Uvicorn running on http://0.0.0.0:8000In another terminal, send a chat request:
curl -s -X POST http://localhost:8000/chat \
-H "Content-Type: application/json" \
-d '{
"messages": [{"role": "user", "content": "Say hello in one short sentence."}],
"max_tokens": 30
}'You should get a JSON response that looks like this:
{
"id": "chatcmpl-b970afca",
"object": "chat.completion",
"created": 1779093208,
"model": "Qwen/Qwen2.5-0.5B-Instruct",
"choices": [{
"index": 0,
"message": {
"role": "assistant",
"content": "Hello! How can I assist you today?"
},
"finish_reason": "stop"
}]
}apipod --build, deploying to a serverless GPU provider, and calling the live endpoint via the OpenAI client are validated for image and audio models today. The LLM path is being tested for the next release β the section below is a preview. apipod --build will package your code, the pinned dependencies, and the Hugging Face cache into a container image ready for GPU deployment.
apipod --build service.py The same apipod --build command will push directly to a serverless GPU provider β RunPod, Scaleway, or Azure depending on your account.
apipod --build service.py --provider runpod --compute serverlessAfter a successful deploy you will get a public endpoint URL:
β Image pushed to registry
β Endpoint registered
β Live at: https://api.socaity.ai/endpoints/qwen-chat/chat The deployed endpoint will speak the OpenAI chat protocol, so you can point any OpenAI client at it. Just change the base_url and use your Socaity API key as the bearer token.
import os
from openai import OpenAI
client = OpenAI(
base_url="https://api.socaity.ai/endpoints/qwen-chat",
api_key=os.getenv("SOCAITY_API_KEY"),
)
response = client.chat.completions.create(
model="Qwen/Qwen2.5-0.5B-Instruct",
messages=[{"role": "user", "content": "Write a haiku about deployment."}],
max_tokens=60,
)
print(response.choices[0].message.content)- Pinned a Python environment that actually runs Hugging Face LLMs
- Wrote a single-file APIPod service with an OpenAI-compatible
/chatendpoint - Loaded the model once at startup using
lifespan - Verified the service locally with
curland Swagger UI