Skip to content
Socaity Docs

Lip Sync a Character

Beginner
10 min

Take a voice recording and drive a 3D character's face from it — lip movement, full facial animation, and emotion blendshapes, exported as a USD animation file you can drop into Maya or Unreal Engine 5.

Uses: audio2face (local NVIDIA tool). Outputs a USD animation; pairs naturally with voice-clone when you want to generate the input audio first.

The Workflow

You hand audio2face a WAV file and a destination path. It analyses the audio for phonemes and emotion, drives the blendshapes of a target face rig, and writes the result to disk as a USD animation. Import that USD into your DCC of choice and the character lip-syncs the audio.

Common pairings: a TTS clip from speechcraft, a real voice actor recording, or any clean speech audio at 16 kHz or higher.

Prerequisites

  • NVIDIA Audio2Face installed locally with the headless server running
  • An NVIDIA GPU (the headless server requires CUDA)
  • A WAV audio file with clean speech (5–60 seconds is a good starting range)
  • Optional: Maya or Unreal Engine 5 to view the resulting USD animation

Step 1 — Install

audio2face ships inside the socaity package. For real-time streaming via gRPC add the [streaming] extra.

terminal
pip install socaity
# For real-time gRPC streaming:
pip install socaity[streaming]

Step 2 — Point at Your Headless Server

Audio2Face talks to a local NVIDIA headless server over HTTP and gRPC. Adjust ROOT_DIR, DEFAULT_OUTPUT_DIR and DEFAULT_AUDIO_STREAM_GRPC_PORT in settings.py so they match your installation. No SOCAITY_API_KEY is required.

python
import os
from socaity import audio2face

# Audio2Face talks to a local NVIDIA headless server.
# Edit ROOT_DIR / DEFAULT_OUTPUT_DIR / DEFAULT_AUDIO_STREAM_GRPC_PORT
# in the package's settings.py before the first call.
a2f = audio2face()

Step 3 — Drive a Face From a Single Audio File

audio2face_single reads the audio, analyses emotion automatically when emotion_auto_detect=True, and writes a USD animation file at the requested frame rate. 60 fps is a safe default for film and game pipelines; 30 fps is enough for web video.

python
# Drive a character face from a single voice recording
a2f.audio2face_single(
    audio_file_path="./voice_line.wav",
    output_path="./animation.usd",
    fps=60,
    emotion_auto_detect=True,
)
print("USD animation written to ./animation.usd")

Step 4 — Override the Emotion

Auto-detection picks up the broad emotional tone of the audio, but you can override the blendshape weights when you need a specific performance — for example, a constant "concerned" delivery for a narration line. Pass update_settings=True to apply the values even when auto-detection is on.

python
# Bias the performance toward a specific emotional tone
a2f.set_emotion(
    anger=0.0,
    disgust=0.0,
    fear=0.4,
    sadness=0.7,
    update_settings=True,  # apply even with auto-detect on
)

a2f.audio2face_single(
    audio_file_path="./voice_line.wav",
    output_path="./animation_concerned.usd",
    fps=60,
    emotion_auto_detect=True,
)

Step 5 — Batch a Whole Folder

For dialogue-heavy pipelines, drop every line into a folder and let audio2face_folder generate one USD per file. One call, no per-file orchestration.

python
# One call processes every WAV in the folder
a2f.audio2face_folder(
    input_folder="./dialogue_lines/",
    output_folder="./animations/",
    fps=60,
)

# Free the GPU when you finish
a2f.shutdown_a2f()

audio2face Method Reference

MethodInputOutputDescription
audio2face_singleaudio_file_path, output_path, fps, emotion_auto_detectUSD fileGenerate a USD animation from a single audio file.
audio2face_folderinput_folder, output_folder, fpsUSD filesBatch-process every audio file in a folder.
set_emotionanger, disgust, fear, sadness, update_settingsOverride blendshape emotion weights applied to the animation.
stream_audioaudio_data, output_path, fpsUSD fileStream audio chunks via gRPC for real-time animation. Requires socaity[streaming].
shutdown_a2fShut down the headless server and free GPU memory.

Parameters

ParameterTypeDefaultDescription
audio_file_pathstrPath to the input audio file. WAV at 16 kHz or higher is recommended.
output_pathstrDestination path for the generated USD animation.
fpsintFrame rate. 30 for web video, 60 for film and games. Higher values cost more processing time.
emotion_auto_detectboolWhen True, the model picks up emotional tone from the audio. When False, the values set via set_emotion are used instead.
input_folderstraudio2face_folder only. Source folder containing audio files.
output_folderstraudio2face_folder only. Destination folder for the generated USD files.
anger / disgust / fear / sadnessfloatset_emotion. Blendshape weights in the 0.0–1.0 range.
update_settingsboolset_emotion. When True, the supplied weights override auto-detection.

Tips

  • Use clean speech audio with minimal background music — emotion detection is more accurate.
  • WAV at 16 kHz or higher is the safest format. Convert MP3s first if needed.
  • Higher fps means smoother motion but longer processing time. 60 fps is a good film default.
  • Call shutdown_a2f() when you finish a batch so the headless server frees GPU memory.
  • For real-time use cases (live avatars, streaming dubs) reach for stream_audio — it skips the intermediate WAV write.

What You Built

  • Generated a USD lip-sync animation from a single WAV file
  • Overrode the auto-detected emotion with manual blendshape weights
  • Batched an entire folder of dialogue lines into one USD per file
  • Learned how to drop the resulting animation into a Maya or Unreal Engine 5 pipeline