MOSS-TTS in ComfyUI 2026: Zero-Shot Voice Cloning From a 10-Second Clip on Your RTX or Mac
TL;DR: MOSS-TTS clones a voice from a clean 3–10 second clip with no reference transcript, runs locally under Apache 2.0, and slots into ComfyUI through a custom-node pack. The Local 1.7B model fits in roughly 5GB of VRAM and is the only variant fast enough for iterative work; the Delay 8B wants ~18GB and trades speed for a little more expressiveness across its 31 languages.
| MOSS-TTS Local 1.7B | MOSS-TTS Delay 8B | MOSS-TTS Nano 0.1B | |
|---|---|---|---|
| Best for | Day-to-day cloning on one GPU | Maximum stability, long-form narration | CPU-only / no-GPU machines |
| Hardware | ~5 GB VRAM (RTX 3060 12GB and up) | ~18 GB VRAM (RTX 3090 / 4090) | Runs real-time on 4 CPU cores |
| The catch | Slightly lower speaker similarity than 8B | Slow enough that iteration hurts | Lower fidelity, single-speaker focus |
Honest take: Start with the Local 1.7B. It clones a voice convincingly in 5GB of VRAM, and unless you’re producing hours of narration where the 8B’s marginally higher stability matters, you’ll never feel the difference.
What MOSS-TTS Actually Is
MOSS-TTS is an open-source speech generation family from MOSI.AI and the OpenMOSS team — the same group behind the MOSS large language models. The current release, MOSS-TTS v1.5, shipped May 26, 2026 under the Apache 2.0 license, which means you can use it commercially without the non-commercial restrictions that hobble a lot of “open” TTS models.
The headline feature is zero-shot voice cloning: you hand it a short audio clip of someone speaking, type the text you want, and it generates new speech in that voice. Critically — and unlike Qwen-TTS or many older cloning pipelines — MOSS-TTS does not require a transcript of the reference audio. You drop in a clip, you get the voice. That single difference removes the most error-prone step from the whole workflow.
v1.5 covers 31 languages (Chinese, English, French, German, Spanish, Japanese, Korean, Cantonese, Dutch, Hindi, Finnish, and more), follows punctuation-driven pauses more reliably than v1.0, and supports explicit inline pause markers for fine pacing control.
If you’ve already got ComfyUI running on Windows or in production on Linux, MOSS-TTS bolts on as a custom-node pack — no separate framework, no new server to babysit.
The Model Lineup (and Which One You Actually Want)
The MOSS-TTS family is wider than just two models, and the ComfyUI node pack exposes most of it. Here’s the practical breakdown with the VRAM figures the node author documents:
- MOSS-TTS Local 1.7B — ~5 GB VRAM. The fast lane. The node documentation flatly states it’s “the only model fast enough for practical iterative use on a single consumer GPU.” This is your default.
- MOSS-TTS Delay 8B — ~18 GB VRAM. The production-recommended model for long-form stability and the cleanest voice cloning. Needs a 24GB card to be comfortable.
- MOSS-TTSD v1.0 (dialogue) — ~18 GB VRAM. Multi-speaker conversation generation — think two distinct voices in a podcast clip.
- MOSS-VoiceGenerator (1.7B) — ~18 GB VRAM as packed. Designs a voice from a text description rather than a reference clip.
- MOSS-SoundEffect v2.0 (1.3B DiT) — ~18 GB VRAM. Environmental sound effects, not speech.
- MOSS-TTS-Nano (0.1B) — a CPU-first variant that does real-time generation on just 4 CPU cores, for machines with no usable GPU at all.
There’s also a MOSS-TTS-Realtime (1.7B) streaming build aimed at voice agents, with a measured ~180ms time-to-first-byte after warm-up and a real-time factor (RTF) of 0.51 — meaning it generates audio roughly twice as fast as playback. That’s the one to watch if you’re wiring TTS into a live assistant rather than rendering files.
How good is the cloning, in numbers?
On the standard Seed-TTS-eval benchmark, the two main models land here:
| Model | EN word error rate | EN speaker similarity | ZH char error rate | ZH speaker similarity |
|---|---|---|---|---|
| MOSS-TTS Local 1.7B | 1.93% | 73.28% | 1.44% | 79.62% |
| MOSS-TTS Delay 8B | 1.84% | 70.86% | 1.37% | 76.98% |
Read that carefully: the 8B has a marginally lower error rate, but the Local 1.7B actually scores higher on speaker similarity (73.28% vs 70.86% in English). For voice-cloning specifically, the small model is not the compromise the parameter count suggests — it’s arguably the better clone. That’s the data point that should settle the “which model” debate for most home labs.
Hardware: What You Need to Run It
The audio path is light compared to image generation. The codec runs at a 12.5-token-per-second frame rate (1 second of audio ≈ 12.5 tokens), so the model isn’t pushing the enormous token volumes a diffusion image model does. Output is 24 kHz mono.
For the Local 1.7B, a 12GB card like the RTX 3060 clears the ~5GB requirement with room to keep ComfyUI’s other nodes resident. A 16GB RTX 5060 Ti gives you headroom to also run an image or LLM workflow in the same session. The node pack uses FlashAttention 2 on CUDA GPUs with compute capability 8.0+ (Ampere and newer — RTX 30-series and up), which is most cards anyone is buying for AI in 2026.
For the Delay 8B, you’re at ~18GB, which realistically means a 24GB RTX 3090 or RTX 4090. The used 3090 remains the value pick here — see our used RTX 3090 breakdown for current pricing.
No GPU at all? You have two real options: run the Nano 0.1B on CPU, or rent a GPU by the hour. A single TTS render job is short enough that spinning up a cloud box on RunPod for an afternoon of voice work costs less than a coffee — see our rent-vs-buy math before committing to hardware.
Apple Silicon
There’s a first-class MLX path. The mlx-community has published 8-bit MLX conversions (mlx-community/MOSS-TTS-8B-8bit and MOSS-TTS-Local-Transformer-MLX-8bit) that run through mlx-audio, Apple’s MLX-based audio toolkit. The default runtime uses W8A-bf16 mixed precision with global and local KV cache enabled. On unified-memory Macs the VRAM-vs-RAM distinction disappears, so a Mac Mini M4 Pro with 24GB+ handles even the 8B 8-bit conversion. If you’re already running Ollama with MLX, the toolchain will feel familiar.
Installing the ComfyUI Node Pack
The community node pack is richservo/comfyui-moss-tts. Installation is the standard custom-node routine:
cd ComfyUI/custom_nodes
git clone https://github.com/richservo/comfyui-moss-tts
cd comfyui-moss-tts
pip install -r requirements.txt
The hard dependency that catches everyone: transformers>=5.0.0. MOSS-TTS uses architecture code that landed in the Transformers 5.x line, and most existing ComfyUI installs are still on a 4.x release pinned by some other node. After installing, restart ComfyUI fully (not just a workflow refresh).
Models auto-download to ComfyUI/models/moss-tts/ the first time you queue a prompt with a given model selected. All variants share one audio codec, OpenMOSS-Team/MOSS-Audio-Tokenizer, which downloads alongside the first model.
The error that will bite you
If you skipped the Transformers upgrade, the first prompt fails immediately:
ImportError: cannot import name 'MossTTSForConditionalGeneration' from 'transformers'
Fix: force the upgrade inside ComfyUI’s own Python environment, not your system Python:
# from the ComfyUI root, using its bundled python
python_embeded\python.exe -m pip install --upgrade "transformers>=5.0.0" # Windows portable
# or, for a venv install:
source venv/bin/activate && pip install --upgrade "transformers>=5.0.0"
The second common failure is a CUDA out-of-memory on the Delay 8B when ComfyUI already has an image model resident. The fix is boring but reliable: switch the Model Loader to the Local 1.7B, or add --lowvram to your ComfyUI launch flags so it offloads the idle image model before TTS runs.
The Voice-Cloning Workflow, Node by Node
The pack adds five nodes: MOSS-TTS Model Loader, MOSS-TTS Generate, MOSS-TTS Voice Design, MOSS-TTS Sound Effect, and MOSS-TTS Dialogue. For zero-shot cloning you only need two.
- MOSS-TTS Model Loader — pick
Local 1.7B. Leave precision on the default. - Load Audio (ComfyUI’s built-in node) — point it at your reference clip.
- MOSS-TTS Generate — wire the model in, wire the reference audio into the
reference_audioinput, and type your target text. No transcript field to fill — that’s the whole point. - Save Audio / Preview Audio — output is standard ComfyUI
AUDIO, so any audio output node works.
Getting a clean clone
The reference clip is everything. The node documentation is specific: 3–10 seconds is the optimal length, ~15 seconds is the ceiling before artifacts creep in. Use a single speaker, a clean recording, and minimal background noise or reverb. A phone voice memo in a quiet room beats a podcast rip with music under it every time.
If your only reference is a noisy recording, run it through a denoiser first — or transcribe-and-clean a longer source with a self-hosted Whisper setup to pick the cleanest 8-second window. For pacing, v1.5 honors inline pause markers and punctuation, so write your text the way you want it spoken: commas for short beats, periods for full stops.
A note on speed expectations: because the codec runs at 12.5 tokens/second of audio, a 30-second clip is only ~375 audio tokens. On the Local 1.7B with an RTX 3090-class card, that renders in a handful of seconds — fast enough to iterate on phrasing without breaking flow. The 8B is noticeably slower per render, which is exactly why the docs steer you to the small model for anything iterative.
What to Use It For (and the Ethics Line)
Practical home-lab uses: narrating your own blog posts in your own voice, generating consistent character voices for game or video projects, building a self-hosted voice assistant that sounds like something you chose rather than a stock TTS voice, and dubbing short clips across the 31 supported languages.
The obvious caution: clone voices you have the right to clone. Your own voice, a voice actor who consented, a public-domain or licensed recording — fine. Cloning a private individual or a public figure to put words in their mouth is the kind of thing that gets open models restricted. Apache 2.0 gives you the technical freedom; it doesn’t give you a pass on the ethics. Keep your reference library to voices you’re allowed to use.
For builders wiring TTS into larger agent stacks, the streaming Realtime model’s 180ms TTFB makes it viable as the speech layer of a local coding or chat agent — pair it with the tooling we cover over at aicoderscope.com for the LLM side.
FAQ
Does MOSS-TTS need an internet connection?
Only for the first model download. After the weights land in ComfyUI/models/moss-tts/, generation is fully offline.
Do I need a transcript of the reference audio? No. That’s the key advantage over Qwen-TTS and most older pipelines — you supply audio and target text only.
Which model for the best clone quality? Counterintuitively, the Local 1.7B scores higher speaker similarity on Seed-TTS-eval (73.28% EN) than the 8B (70.86% EN), while needing ~5GB instead of ~18GB. Use the 1.7B unless you specifically need the 8B’s long-form stability.
Can it run without a GPU? Yes — the Nano 0.1B variant does real-time generation on 4 CPU cores. Quality is lower than the GPU models, but it works on a NAS or mini PC.
What audio quality does it output? 24 kHz mono, suitable for narration, dialogue, and voice-agent use. It’s not 48kHz studio master quality, but it’s clean for spoken content.
Is it legal to use commercially? The model is Apache 2.0, so commercial use is permitted. Whether a specific cloned voice is legal depends entirely on your rights to that voice — that’s on you, not the license.
Sources
- MOSS-TTS — OpenMOSS (GitHub) — release date (May 26, 2026), Apache 2.0 license, 31 languages, Realtime RTF 0.51 / 180ms TTFB, Nano 0.1B on 4 CPU cores
- MOSS-TTS model card — OpenMOSS (GitHub) — Seed-TTS-eval WER/SIM numbers, 12.5 tokens/sec codec rate, FlashAttention 2 / compute capability 8+
- comfyui-moss-tts — richservo (GitHub) — per-model VRAM figures, node names, transformers>=5.0.0 requirement, reference audio guidelines, 24 kHz mono output
- MOSS-TTS-8B-8bit — mlx-community (Hugging Face) — Apple Silicon MLX 8-bit conversion, W8A-bf16 mixed precision
- MOSS-TTS-Local-Transformer-MLX-8bit — mlx-community (Hugging Face) — MLX local-model port for Mac
- OpenMOSS ships TTS v1.5 with multi-speaker cloning — AI Weekly — v1.5 release coverage and feature summary
Last updated June 10, 2026. Model variants, VRAM figures, and benchmark numbers reflect the MOSS-TTS v1.5 release; verify current node compatibility before installing.
Recommended Gear
- RTX 3060 12GB — clears the Local 1.7B’s ~5GB requirement with room for other ComfyUI nodes
- RTX 5060 Ti 16GB — headroom to run TTS alongside an image or LLM workflow
- RTX 3090 — 24GB used-market pick for the Delay 8B at ~18GB
- RTX 4090 — fastest single-card option for the 8B model
Was this article helpful?
Thanks for the feedback — it helps improve future articles.
Need hands-on help?
I offer 1-on-1 technical consulting for local AI setup, GPU selection, and AI coding tool configuration — same topics covered on this site.
Book a session — $49 / hour →