Skip to content
Socaity Docs

Clone Any Voice

Beginner
5 min

Use the speechcraft module to convert text to speech, clone a voice from a sample clip, and transform one voice into another β€” all in a few lines of Python.

Uses: speechcraft (Socaity-hosted, official). Bark-based TTS, voice cloning, and voice-to-voice conversion across 13 languages. Live playground at socaity.ai/APIs/service/speechcraft.

Step 1 β€” Install and Initialise

SpeechCraft is bundled inside the socaity package. Import and instantiate it exactly like any other module.

terminal
pip install socaity
python
import os
from socaity import speechcraft

sc = speechcraft(api_key=os.getenv("SOCAITY_API_KEY"))

Step 2 β€” Text to Speech

text2voice converts a text string to an audio file. You can choose from a library of built-in voices or pass a custom voice name.

python
audio = sc.text2voice(
    text="Welcome to Socaity β€” the AI cloud built for builders.",
    voice="en_male_calm",        # built-in voice name
).get_result()

audio.save("welcome.mp3")
print("Saved welcome.mp3")

Step 3 β€” Clone a Voice from a Sample

Provide a short audio clip (at least 10 seconds of clean speech) and SpeechCraft will extract a voice embedding. Pass a voice_name to label it for reuse in subsequent calls.

python
# 1. Extract a voice embedding from a 15-second sample and name it
embedding = sc.voice2embedding(
    audio_file="./my_voice_sample.wav",
    voice_name="my_voice",
    save=True,             # persist server-side for reuse
).get_result()

# 2. Synthesise new speech using the cloned voice name
audio = sc.text2voice(
    text="This is my cloned voice speaking new text.",
    voice="my_voice",      # the name assigned above
).get_result()

audio.save("cloned_voice.mp3")

Step 4 β€” Voice to Voice Conversion

voice2voice takes an existing audio recording and re-renders it in a different voice while preserving the original timing and emotion.

python
# Re-render an existing recording in a different voice
converted = sc.voice2voice(
    audio_file="./original_recording.wav",
    voice_name="en_female_warm",   # built-in voice name or a previously saved embedding name
).get_result()

converted.save("converted.mp3")
print("Voice conversion complete!")

JavaScript Alternative

javascript
// The JavaScript SDK is in early development.
// High-level model methods are coming soon.
// For now, use the Python SDK for full model access.

import { socaity } from "socaity"

socaity.setApiKey(process.env.SOCAITY_API_KEY)
const models = await socaity.getAvailableModels()
console.log("Available models:", models)

SpeechCraft Method Reference

MethodKey ParametersOutputDescription
text2voicetext, voiceAudioGenerate speech from text using a built-in or cloned voice name.
voice2embeddingaudio_file, voice_name, saveEmbeddingExtract a voice embedding from an audio sample and assign it a name for reuse.
voice2voiceaudio_file, voice_nameAudioRe-render audio in a target voice, preserving timing and emotion.

Parameters

Most calls work with the defaults. Reach for these when you need to tune output quality or variability.

ParameterTypeDefaultDescription
textstrβ€”Text to synthesise. Supports inline tokens like [laughs], [sighs], [music] and ALL-CAPS for emphasis.
voicestr | MediaFile | bytes"en_speaker_3"Built-in speaker (e.g. "de_speaker_1") or a voice embedding produced by voice2embedding.
audio_fileAudioFile | str | bytesβ€”Source audio sample. 7–15 seconds of clean, single-speaker audio works best.
voice_namestr"new_speaker"voice2embedding only. Name to assign to the extracted embedding for later reuse.
saveboolFalsevoice2embedding only. Persist the embedding server-side. Subject to server policy.
tempfloat0.7voice2voice only. Higher values produce more variation; lower values stay closer to the input.
fine_tempfloat0.5text2voice fine codec temperature. Raise it for more varied delivery.
coarse_tempfloat0.7text2voice coarse codec temperature.

Built-in voices follow the pattern en_speaker_3, de_speaker_1, etc. Supported languages: en, de, es, fr, hi, it, ja, ko, pl, pt, ru, tr, zh. Inline tokens like [laughs], [sighs], [music] and ALL-CAPS for emphasis are picked up by the model.

Tips for Best Results

  • Use a sample clip with minimal background noise β€” aim for SNR above 20 dB.
  • 10–30 seconds of speech is the sweet spot; longer samples do not always improve quality.
  • The model is language-aware β€” pass the correct language code to improve prosody.
  • For real-time applications, poll the job status and stream the resulting audio file rather than waiting for full completion.
  • Cache extracted voice profiles as JSON; inference is billed per audio second, not per clone.

What You Built

  • Converted text to speech using a built-in voice
  • Extracted a voice profile from a sample audio clip
  • Synthesised new speech in the cloned voice
  • Converted an existing recording to a different voice using voice2voice