Clone Any Voice
Use the speechcraft module to convert text to speech, clone a voice from a sample clip, and transform one voice into another β all in a few lines of Python.
Uses: speechcraft (Socaity-hosted, official). Bark-based TTS, voice cloning, and voice-to-voice conversion across 13 languages. Live playground at socaity.ai/APIs/service/speechcraft.
SpeechCraft is bundled inside the socaity package. Import and instantiate it exactly like any other module.
pip install socaityimport os
from socaity import speechcraft
sc = speechcraft(api_key=os.getenv("SOCAITY_API_KEY"))text2voice converts a text string to an audio file. You can choose from a library of built-in voices or pass a custom voice name.
audio = sc.text2voice(
text="Welcome to Socaity β the AI cloud built for builders.",
voice="en_male_calm", # built-in voice name
).get_result()
audio.save("welcome.mp3")
print("Saved welcome.mp3") Provide a short audio clip (at least 10 seconds of clean speech) and SpeechCraft will extract a voice embedding. Pass a voice_name to label it for reuse in subsequent calls.
# 1. Extract a voice embedding from a 15-second sample and name it
embedding = sc.voice2embedding(
audio_file="./my_voice_sample.wav",
voice_name="my_voice",
save=True, # persist server-side for reuse
).get_result()
# 2. Synthesise new speech using the cloned voice name
audio = sc.text2voice(
text="This is my cloned voice speaking new text.",
voice="my_voice", # the name assigned above
).get_result()
audio.save("cloned_voice.mp3")voice_name you assign is the handle for reuse β pass it as the voice argument in any subsequent text2voice call. voice2voice takes an existing audio recording and re-renders it in a different voice while preserving the original timing and emotion.
# Re-render an existing recording in a different voice
converted = sc.voice2voice(
audio_file="./original_recording.wav",
voice_name="en_female_warm", # built-in voice name or a previously saved embedding name
).get_result()
converted.save("converted.mp3")
print("Voice conversion complete!")SpeechCraft are not yet available in the JS SDK. Use the Python SDK for full model access. See the JavaScript SDK reference for the current feature set. // The JavaScript SDK is in early development.
// High-level model methods are coming soon.
// For now, use the Python SDK for full model access.
import { socaity } from "socaity"
socaity.setApiKey(process.env.SOCAITY_API_KEY)
const models = await socaity.getAvailableModels()
console.log("Available models:", models)| Method | Key Parameters | Output | Description |
|---|---|---|---|
text2voice | text, voice | Audio | Generate speech from text using a built-in or cloned voice name. |
voice2embedding | audio_file, voice_name, save | Embedding | Extract a voice embedding from an audio sample and assign it a name for reuse. |
voice2voice | audio_file, voice_name | Audio | Re-render audio in a target voice, preserving timing and emotion. |
Most calls work with the defaults. Reach for these when you need to tune output quality or variability.
| Parameter | Type | Default | Description |
|---|---|---|---|
text | str | β | Text to synthesise. Supports inline tokens like [laughs], [sighs], [music] and ALL-CAPS for emphasis. |
voice | str | MediaFile | bytes | "en_speaker_3" | Built-in speaker (e.g. "de_speaker_1") or a voice embedding produced by voice2embedding. |
audio_file | AudioFile | str | bytes | β | Source audio sample. 7β15 seconds of clean, single-speaker audio works best. |
voice_name | str | "new_speaker" | voice2embedding only. Name to assign to the extracted embedding for later reuse. |
save | bool | False | voice2embedding only. Persist the embedding server-side. Subject to server policy. |
temp | float | 0.7 | voice2voice only. Higher values produce more variation; lower values stay closer to the input. |
fine_temp | float | 0.5 | text2voice fine codec temperature. Raise it for more varied delivery. |
coarse_temp | float | 0.7 | text2voice coarse codec temperature. |
Built-in voices follow the pattern en_speaker_3, de_speaker_1, etc. Supported languages: en, de, es, fr, hi, it, ja, ko, pl, pt, ru, tr, zh. Inline tokens like [laughs], [sighs], [music] and ALL-CAPS for emphasis are picked up by the model.
- Use a sample clip with minimal background noise β aim for SNR above 20 dB.
- 10β30 seconds of speech is the sweet spot; longer samples do not always improve quality.
- The model is language-aware β pass the correct
languagecode to improve prosody. - For real-time applications, poll the job status and stream the resulting audio file rather than waiting for full completion.
- Cache extracted voice profiles as JSON; inference is billed per audio second, not per clone.
- Converted text to speech using a built-in voice
- Extracted a voice profile from a sample audio clip
- Synthesised new speech in the cloned voice
- Converted an existing recording to a different voice using
voice2voice