Back

Open Sourcing the First LLM-Powered Tajik Voice

A New Chapter in Low-Resource Language AI Begins Here

27 июн. 2025 г.

TLDR:

We are open sourcing early checkpoint of first LLM Powered Text-to-Speech. Trained only 35 hours of audio data, we demonstrate extremely high quality, aesthetically pleasing, speech generation.

You can try the demo here:

https://huggingface.co/spaces/re-skill/tajik-tts

Our model still learns to predict the next token, but we convert waveform audio into sequences of discrete tokens that can be treated similarly to text during training.

Background

Classical text-to-speech (TTS) models have long excelled at voice cloning and speech synthesis. They generally follow a two-stage process: first, a model like Tacotron converts text into an intermediate acoustic representation (such as a spectrogram), and then a vocoder transforms that representation into waveform audio. While these systems can produce lifelike voices, their primary focus has been on replicating a given speaker's sound, with limited capacity to engage in dynamic, context-aware conversations.

The advent of large language models (LLMs) offers a compelling opportunity to enhance these systems. By incorporating LLMs into TTS pipelines, we can leverage their sophisticated reasoning and contextual understanding to create truly conversational speech systems. Instead of merely cloning a voice, these enhanced systems can interpret context, adapt to dialogue flows, and generate responses that feel both natural and interactive. Essentially, LLMs open up a new dimension where synthesis isn't only about producing sound it's about enabling intelligent, context-aware conversation.

One practical way to integrate these capabilities is through a cascaded approach in a speech-to-speech system, which typically involves three distinct modules:

Speech-to-Text (STT): Converts the incoming speech into text.
Large Language Model (LLM): Processes and reasons on the transcribed text to understand context and generate a conversational response.
Text-to-Speech (TTS): Synthesizes the LLM-generated text back into natural-sounding speech.

This cascaded method combines the strengths of each specialized component. However, this approach has limitations.

One major challenge is that the LLM doesn't capture the full richness of the speech input.

Speech carries subtle cues intonation, rhythm, emotion, and prosodic nuancethat are often lost in the conversion to text. As a result, when an LLM processes transcribed text, it receives a significantly distilled representation of the original audio. This loss of detail can limit the model's ability to produce responses that fully mirror the expressive qualities of the initial speech, potentially resulting in synthetic output that feels less dynamic or contextually aware.

Integrating speech directly with an LLM could solve this challenge but presents significant difficulties. Unlike text, speech is a continuous, high-dimensional signal. LLMs are designed to work with discrete tokens, so converting speech into a format these models can process requires additional steps. Existing methods address this gap in two main ways:

Audio Encoders: Some approaches use dedicated audio encoders to transform continuous speech into discrete tokens before feeding them into an LLM. This method aims to preserve critical acoustic features (like intonationand rhythm) while converting the signal into a format that LLMs can natively understand.
Neural Codecs: Neural codecs convert waveform audio into sequences of discrete tokens that can be treated similarly to text. These codecs, such as those used in models like DAC, Encodec, or XCodec, enable both TTS and speech-to-speech systems to leverage the token-based architecture of LLMs. By bridging the gap between continuous audio signals and the discrete token sequences required by LLMs, neural codecs help streamline the integration process and enable more efficient model training and inference.

Despite these innovations, each method comes with trade-offs. Audio encoders must balance preserving critical information with the need for compact, discrete representations. Neural codecs, meanwhile, face challenges related to token rate since speech typically generates far more tokens per second than text and the potential loss of fine-grained acoustic details during quantization.

In summary, while classical TTS models provide a strong foundation for effective voice cloning and speech synthesis, integrating LLM reasoning significantly expands the potential use cases by enabling contextual, conversational interactions. The cascaded STT–LLM–TTS pipeline is a practical approach to achieve this integration, yet it carries inherent challenges such as error propagation between modules and difficulties in capturing the full richness of speech signals. Advances in audio encoders and neural codecs are crucial for overcoming these hurdles, paving the way for next-generation conversational speech systems that seamlessly combine natural language understanding with high-fidelity audio synthesis.

Technical Overview

Architecture of Model

Standing on the shoulder of giants, the weights were initialized from our existing base model Orpheus-3b-0.1-pretrained. We then extended training on this base model on language specific data. This data was in the form of text-speech pairs.

Built upon a Llama-3B backbone, Orpheus pairs the LLM with the Multi-Scale Neural Audio Codec SNAC. Neural codecs convert waveform audio into sequences of discrete tokens that can be treated similarly to text.

SNAC is an advanced Residual Vector Quantization (RVQ) codec that captures audio information across different temporal resolutions, aiming for efficient compression and detailed reconstruction.

Model still learns to predict the next token, token here is output we get from neural codec SNAC i.e converted waveform audio into sequences of discrete tokens that can be treated similarly to text.

SNAC samples tokens at different frequencies which we flatten as shown

To manage the multiple token streams from SNAC within a standard LLM framework, Orpheus employs a strategy of generating a flattened sequence of tokens (7 tokens per audio frame sequentially). It achieves low-latency streaming (~200ms to ~25-50 ms with input streaming of text into the KV cache) suitable for real-time applications by using an optimized decoding process involving a sliding window technique on the SNAC decoder, ensuring smooth audio output without pops.

Orpheus particularly emphasizes generating expressive, emotive speech and supports zero-shot voice cloning from short audio prompts.

Conclusion

We hope this will serve as a strong base model for community to train even further. We will keep gathering more data to release even better Tajik speech synthesis models, if you want to donate data contact us at join@re-skill.io

https://huggingface.co/re-skill/orpheus-tj-early

You can also run locally on your Mac M-powered chip:

https://huggingface.co/re-skill/orpheus-tj-early-mlx-fp16