Lip Sync a Portrait

Beginner

10 min

Turn a still portrait and a voice clip into a talking-head video. The face's mouth and expression follow the audio, frame by frame, and you get back a single MP4 you can drop straight into a video edit.

Hosted on Socaity. This tutorial uses lucataco/img-and-audio2video, a Replicate-backed model in the Socaity catalog. It runs on Socaity's cloud, so the only thing you need locally is the socaity SDK and a SOCAITY_API_KEY.

Uses:lucataco/img-and-audio2video (hosted, image + audio to video). To synthesise the voice line first, pair it with text-to-speech.

How It Works

You hand the model two files: a portrait image and an audio track. The model drives the mouth to match the spoken audio and hands back one finished video. Both inputs are required.

Good inputs: a clear, front-facing portrait where the face fills most of the frame, and a clean speech clip (a few seconds is plenty). The audio can be a recording or a synthesised voice line.

Prerequisites

Python 3.10 or newer.
A Socaity API key. Sign up to generate one.
A portrait image (face.png) and a speech clip (vocals.wav).

Step 1. Install the SDK

Install the socaity SDK, then install the model to pull down its typed client.

pip install socaity
socaity install lucataco/img-and-audio2video

Step 2. Set your API key

The SDK reads SOCAITY_API_KEY from the environment.

# macOS / Linux
export SOCAITY_API_KEY="sk_..."

# Windows (PowerShell)
$env:SOCAITY_API_KEY = "sk_..."

Step 3. Animate the portrait

Import img_and_audio2video from its vendor path, wrap each input file in the matching SDK media type, and call the client. The call returns a job; .get_result() blocks until the GPU finishes and returns a single VideoFile you can save to disk.

Supply your own face.png: a front-facing portrait where the face fills most of the frame. Any clear headshot works.

import os
from socaity import ImageFile, AudioFile
from socaity.sdk.replicate.lucataco import img_and_audio2video

# Leave api_key=None to fall back to the SOCAITY_API_KEY environment variable.
talk = img_and_audio2video(api_key=os.getenv("SOCAITY_API_KEY"))

# image: portrait to animate. audio: voice track that drives the lips. Both required.
video = talk(
    image=ImageFile().from_file("face.png"),
    audio=AudioFile().from_file("vocals.wav"),
).get_result()                  # single VideoFile, no [0] needed

video.save("talking.mp4")
print("Saved talking.mp4")

This model returns one video, so get_result() gives you the VideoFile directly. There is no list to index into.

Step 4. Generate the Voice Line First (optional)

No recording handy? Synthesise one on Socaity with a hosted text-to-speech model, save the WAV, then feed it into the step above. Both calls run on the cloud with the same API key.

import os
from socaity.sdk.replicate.jaaari import kokoro_82m

# Synthesise the voice line on Socaity, then save it as the audio input.
tts = kokoro_82m(api_key=os.getenv("SOCAITY_API_KEY"))
voice = tts(text="The vault is sealed for the night.", voice="af_bella").get_result()
voice.save("vocals.wav")        # feed this into img_and_audio2video above
print("Saved vocals.wav")

Parameters

Parameter	Type	Default	Description
`image`	`ImageFile \| str`	`required`	Portrait to animate. Wrap a local path with ImageFile().from_file(...).
`audio`	`AudioFile \| str`	`required`	Speech audio that drives the lip movement. Wrap a local path with AudioFile().from_file(...).

Tips

Use a clear, front-facing portrait. The closer the face is to the camera, the cleaner the lip motion.
Keep the speech clean, with minimal background music. Dry recordings track best.
Shorter clips return faster. Start with a few seconds while you dial in the inputs.
Reuse one client across many calls instead of re-instantiating it per video.

What You Built

Installed the hosted lucataco/img-and-audio2video model.
Loaded a portrait and a voice clip with ImageFile and AudioFile.
Generated a talking-head video in a single hosted call.
Optionally synthesised the voice line first with a hosted text-to-speech model.

Clone Any Voice

Reasoning with deepseek-v3