Lip Sync a Character
Take a voice recording and drive a 3D character's face from it — lip movement, full facial animation, and emotion blendshapes, exported as a USD animation file you can drop into Maya or Unreal Engine 5.
audio2face wraps NVIDIA's Audio2Face headless server and runs on your own NVIDIA hardware. It is not a hosted Socaity service — no API key or credits are required, but you need NVIDIA Audio2Face installed locally (or on a host you control) before this tutorial will run end to end. Uses: audio2face (local NVIDIA tool). Outputs a USD animation; pairs naturally with voice-clone when you want to generate the input audio first.
You hand audio2face a WAV file and a destination path. It analyses the audio for phonemes and emotion, drives the blendshapes of a target face rig, and writes the result to disk as a USD animation. Import that USD into your DCC of choice and the character lip-syncs the audio.
Common pairings: a TTS clip from speechcraft, a real voice actor recording, or any clean speech audio at 16 kHz or higher.
- NVIDIA Audio2Face installed locally with the headless server running
- An NVIDIA GPU (the headless server requires CUDA)
- A WAV audio file with clean speech (5–60 seconds is a good starting range)
- Optional: Maya or Unreal Engine 5 to view the resulting USD animation
audio2face ships inside the socaity package. For real-time streaming via gRPC add the [streaming] extra.
pip install socaity
# For real-time gRPC streaming:
pip install socaity[streaming] Audio2Face talks to a local NVIDIA headless server over HTTP and gRPC. Adjust ROOT_DIR, DEFAULT_OUTPUT_DIR and DEFAULT_AUDIO_STREAM_GRPC_PORT in settings.py so they match your installation. No SOCAITY_API_KEY is required.
import os
from socaity import audio2face
# Audio2Face talks to a local NVIDIA headless server.
# Edit ROOT_DIR / DEFAULT_OUTPUT_DIR / DEFAULT_AUDIO_STREAM_GRPC_PORT
# in the package's settings.py before the first call.
a2f = audio2face()audio2face_single reads the audio, analyses emotion automatically when emotion_auto_detect=True, and writes a USD animation file at the requested frame rate. 60 fps is a safe default for film and game pipelines; 30 fps is enough for web video.
# Drive a character face from a single voice recording
a2f.audio2face_single(
audio_file_path="./voice_line.wav",
output_path="./animation.usd",
fps=60,
emotion_auto_detect=True,
)
print("USD animation written to ./animation.usd") Auto-detection picks up the broad emotional tone of the audio, but you can override the blendshape weights when you need a specific performance — for example, a constant "concerned" delivery for a narration line. Pass update_settings=True to apply the values even when auto-detection is on.
# Bias the performance toward a specific emotional tone
a2f.set_emotion(
anger=0.0,
disgust=0.0,
fear=0.4,
sadness=0.7,
update_settings=True, # apply even with auto-detect on
)
a2f.audio2face_single(
audio_file_path="./voice_line.wav",
output_path="./animation_concerned.usd",
fps=60,
emotion_auto_detect=True,
) For dialogue-heavy pipelines, drop every line into a folder and let audio2face_folder generate one USD per file. One call, no per-file orchestration.
# One call processes every WAV in the folder
a2f.audio2face_folder(
input_folder="./dialogue_lines/",
output_folder="./animations/",
fps=60,
)
# Free the GPU when you finish
a2f.shutdown_a2f()| Method | Input | Output | Description |
|---|---|---|---|
audio2face_single | audio_file_path, output_path, fps, emotion_auto_detect | USD file | Generate a USD animation from a single audio file. |
audio2face_folder | input_folder, output_folder, fps | USD files | Batch-process every audio file in a folder. |
set_emotion | anger, disgust, fear, sadness, update_settings | — | Override blendshape emotion weights applied to the animation. |
stream_audio | audio_data, output_path, fps | USD file | Stream audio chunks via gRPC for real-time animation. Requires socaity[streaming]. |
shutdown_a2f | — | — | Shut down the headless server and free GPU memory. |
| Parameter | Type | Default | Description |
|---|---|---|---|
audio_file_path | str | — | Path to the input audio file. WAV at 16 kHz or higher is recommended. |
output_path | str | — | Destination path for the generated USD animation. |
fps | int | — | Frame rate. 30 for web video, 60 for film and games. Higher values cost more processing time. |
emotion_auto_detect | bool | — | When True, the model picks up emotional tone from the audio. When False, the values set via set_emotion are used instead. |
input_folder | str | — | audio2face_folder only. Source folder containing audio files. |
output_folder | str | — | audio2face_folder only. Destination folder for the generated USD files. |
anger / disgust / fear / sadness | float | — | set_emotion. Blendshape weights in the 0.0–1.0 range. |
update_settings | bool | — | set_emotion. When True, the supplied weights override auto-detection. |
- Use clean speech audio with minimal background music — emotion detection is more accurate.
- WAV at 16 kHz or higher is the safest format. Convert MP3s first if needed.
- Higher fps means smoother motion but longer processing time. 60 fps is a good film default.
- Call
shutdown_a2f()when you finish a batch so the headless server frees GPU memory. - For real-time use cases (live avatars, streaming dubs) reach for
stream_audio— it skips the intermediate WAV write.
- Generated a USD lip-sync animation from a single WAV file
- Overrode the auto-detected emotion with manual blendshape weights
- Batched an entire folder of dialogue lines into one USD per file
- Learned how to drop the resulting animation into a Maya or Unreal Engine 5 pipeline