Clone Any Voice

Beginner

5 min

Use the speechcraft module to convert text to speech, clone a voice from a sample clip, and transform one voice into another — all in a few lines of Python.

Alpha SDK — Code examples on this page are illustrative. The Socaity SDK is in alpha and APIs may change. Always check the Python SDK reference for current syntax.

🎬 Video Tutorial — Coming soon.

Uses: speechcraft (Socaity-hosted, official). Bark-based TTS, voice cloning, and voice-to-voice conversion across 13 languages. Live playground at socaity.ai/APIs/service/speechcraft.

Step 1 — Install and Initialise

SpeechCraft is bundled inside the socaity package. Import and instantiate it exactly like any other module.

pip install socaity

import os
from socaity import speechcraft

sc = speechcraft(api_key=os.getenv("SOCAITY_API_KEY"))

Step 2 — Text to Speech

text2voice converts a text string to an audio file. You can choose from a library of built-in voices or pass a custom voice name.

audio = sc.text2voice(
    text="Welcome to Socaity — the AI cloud built for builders.",
    voice="en_male_calm",        # built-in voice name
).get_result()

audio.save("welcome.mp3")
print("Saved welcome.mp3")

Step 3 — Clone a Voice from a Sample

Provide a short audio clip (at least 10 seconds of clean speech) and SpeechCraft will extract a voice embedding. Pass a voice_name to label it for reuse in subsequent calls.

# 1. Extract a voice embedding from a 15-second sample and name it
embedding = sc.voice2embedding(
    audio_file="./my_voice_sample.wav",
    voice_name="my_voice",
    save=True,             # persist server-side for reuse
).get_result()

# 2. Synthesise new speech using the cloned voice name
audio = sc.text2voice(
    text="This is my cloned voice speaking new text.",
    voice="my_voice",      # the name assigned above
).get_result()

audio.save("cloned_voice.mp3")

The voice_name you assign is the handle for reuse — pass it as the voice argument in any subsequent text2voice call.

Step 4 — Voice to Voice Conversion

voice2voice takes an existing audio recording and re-renders it in a different voice while preserving the original timing and emotion.

# Re-render an existing recording in a different voice
converted = sc.voice2voice(
    audio_file="./original_recording.wav",
    voice_name="en_female_warm",   # built-in voice name or a previously saved embedding name
).get_result()

converted.save("converted.mp3")
print("Voice conversion complete!")

JavaScript Alternative

JavaScript SDK — Early Development — High-level model methods such as SpeechCraft are not yet available in the JS SDK. Use the Python SDK for full model access. See the JavaScript SDK reference for the current feature set.

// The JavaScript SDK is in early development.
// High-level model methods are coming soon.
// For now, use the Python SDK for full model access.

import { socaity } from "socaity"

socaity.setApiKey(process.env.SOCAITY_API_KEY)
const models = await socaity.getAvailableModels()
console.log("Available models:", models)

SpeechCraft Method Reference

Method	Key Parameters	Output	Description
`text2voice`	text, voice	Audio	Generate speech from text using a built-in or cloned voice name.
`voice2embedding`	audio_file, voice_name, save	Embedding	Extract a voice embedding from an audio sample and assign it a name for reuse.
`voice2voice`	audio_file, voice_name	Audio	Re-render audio in a target voice, preserving timing and emotion.

Parameters

Most calls work with the defaults. Reach for these when you need to tune output quality or variability.

Parameter	Type	Default	Description
`text`	`str`	`—`	Text to synthesise. Supports inline tokens like [laughs], [sighs], [music] and ALL-CAPS for emphasis.
`voice`	`str \| MediaFile \| bytes`	`"en_speaker_3"`	Built-in speaker (e.g. "de_speaker_1") or a voice embedding produced by voice2embedding.
`audio_file`	`AudioFile \| str \| bytes`	`—`	Source audio sample. 7–15 seconds of clean, single-speaker audio works best.
`voice_name`	`str`	`"new_speaker"`	voice2embedding only. Name to assign to the extracted embedding for later reuse.
`save`	`bool`	`False`	voice2embedding only. Persist the embedding server-side. Subject to server policy.
`temp`	`float`	`0.7`	voice2voice only. Higher values produce more variation; lower values stay closer to the input.
`fine_temp`	`float`	`0.5`	text2voice fine codec temperature. Raise it for more varied delivery.
`coarse_temp`	`float`	`0.7`	text2voice coarse codec temperature.

Built-in voices follow the pattern en_speaker_3, de_speaker_1, etc. Supported languages: en, de, es, fr, hi, it, ja, ko, pl, pt, ru, tr, zh. Inline tokens like [laughs], [sighs], [music] and ALL-CAPS for emphasis are picked up by the model.

Tips for Best Results

Use a sample clip with minimal background noise — aim for SNR above 20 dB.
10–30 seconds of speech is the sweet spot; longer samples do not always improve quality.
The model is language-aware — pass the correct language code to improve prosody.
For real-time applications, poll the job status and stream the resulting audio file rather than waiting for full completion.
Cache extracted voice profiles as JSON; inference is billed per audio second, not per clone.

What You Built

Converted text to speech using a built-in voice
Extracted a voice profile from a sample audio clip
Synthesised new speech in the cloned voice
Converted an existing recording to a different voice using voice2voice

Face Swap in 10 Lines

Lip Sync a Character