Jun 10, 2026

MOSS-TTS in ComfyUI 2026: Zero-Shot Voice Cloning From a 10-Second Clip on Your RTX or Mac

By RunAIHome Team · 14 min read

comfyuimoss-ttsvoice-cloningttslocal-aiapple-siliconmlxaudio

TL;DR: MOSS-TTS clones a voice from a clean 3–10 second clip with no reference transcript, runs locally under Apache 2.0, and slots into ComfyUI through a custom-node pack. The Local 1.7B model fits in roughly 5GB of VRAM and is the only variant fast enough for iterative work; the Delay 8B wants ~18GB and trades speed for a little more expressiveness across its 31 languages.

	MOSS-TTS Local 1.7B	MOSS-TTS Delay 8B	MOSS-TTS Nano 0.1B
Best for	Day-to-day cloning on one GPU	Maximum stability, long-form narration	CPU-only / no-GPU machines
Hardware	~5 GB VRAM (RTX 3060 12GB and up)	~18 GB VRAM (RTX 3090 / 4090)	Runs real-time on 4 CPU cores
The catch	Slightly lower speaker similarity than 8B	Slow enough that iteration hurts	Lower fidelity, single-speaker focus

Honest take: Start with the Local 1.7B. It clones a voice convincingly in 5GB of VRAM, and unless you’re producing hours of narration where the 8B’s marginally higher stability matters, you’ll never feel the difference.

What MOSS-TTS Actually Is

MOSS-TTS is an open-source speech generation family from MOSI.AI and the OpenMOSS team — the same group behind the MOSS large language models. The current release, MOSS-TTS v1.5, shipped May 26, 2026 under the Apache 2.0 license, which means you can use it commercially without the non-commercial restrictions that hobble a lot of “open” TTS models.

The headline feature is zero-shot voice cloning: you hand it a short audio clip of someone speaking, type the text you want, and it generates new speech in that voice. Critically — and unlike Qwen-TTS or many older cloning pipelines — MOSS-TTS does not require a transcript of the reference audio. You drop in a clip, you get the voice. That single difference removes the most error-prone step from the whole workflow.

v1.5 covers 31 languages (Chinese, English, French, German, Spanish, Japanese, Korean, Cantonese, Dutch, Hindi, Finnish, and more), follows punctuation-driven pauses more reliably than v1.0, and supports explicit inline pause markers for fine pacing control.

If you’ve already got ComfyUI running on Windows or in production on Linux, MOSS-TTS bolts on as a custom-node pack — no separate framework, no new server to babysit.

The Model Lineup (and Which One You Actually Want)

The MOSS-TTS family is wider than just two models, and the ComfyUI node pack exposes most of it. Here’s the practical breakdown with the VRAM figures the node author documents:

MOSS-TTS Local 1.7B — ~5 GB VRAM. The fast lane. The node documentation flatly states it’s “the only model fast enough for practical iterative use on a single consumer GPU.” This is your default.
MOSS-TTS Delay 8B — ~18 GB VRAM. The production-recommended model for long-form stability and the cleanest voice cloning. Needs a 24GB card to be comfortable.
MOSS-TTSD v1.0 (dialogue) — ~18 GB VRAM. Multi-speaker conversation generation — think two distinct voices in a podcast clip.
MOSS-VoiceGenerator (1.7B) — ~18 GB VRAM as packed. Designs a voice from a text description rather than a reference clip.
MOSS-SoundEffect v2.0 (1.3B DiT) — ~18 GB VRAM. Environmental sound effects, not speech.
MOSS-TTS-Nano (0.1B) — a CPU-first variant that does real-time generation on just 4 CPU cores, for machines with no usable GPU at all.

There’s also a MOSS-TTS-Realtime (1.7B) streaming build aimed at voice agents, with a measured ~180ms time-to-first-byte after warm-up and a real-time factor (RTF) of 0.51 — meaning it generates audio roughly twice as fast as playback. That’s the one to watch if you’re wiring TTS into a live assistant rather than rendering files.

How good is the cloning, in numbers?

On the standard Seed-TTS-eval benchmark, the two main models land here:

Model	EN word error rate	EN speaker similarity	ZH char error rate	ZH speaker similarity
MOSS-TTS Local 1.7B	1.93%	73.28%	1.44%	79.62%
MOSS-TTS Delay 8B	1.84%	70.86%	1.37%	76.98%

Read that carefully: the 8B has a marginally lower error rate, but the Local 1.7B actually scores higher on speaker similarity (73.28% vs 70.86% in English). For voice-cloning specifically, the small model is not the compromise the parameter count suggests — it’s arguably the better clone. That’s the data point that should settle the “which model” debate for most home labs.

Hardware: What You Need to Run It

The audio path is light compared to image generation. The codec runs at a 12.5-token-per-second frame rate (1 second of audio ≈ 12.5 tokens), so the model isn’t pushing the enormous token volumes a diffusion image model does. Output is 24 kHz mono.

For the Local 1.7B, a 12GB card like the RTX 3060 clears the ~5GB requirement with room to keep ComfyUI’s other nodes resident. A 16GB RTX 5060 Ti gives you headroom to also run an image or LLM workflow in the same session. The node pack uses FlashAttention 2 on CUDA GPUs with compute capability 8.0+ (Ampere and newer — RTX 30-series and up), which is most cards anyone is buying for AI in 2026.

For the Delay 8B, you’re at ~18GB, which realistically means a 24GB RTX 3090 or RTX 4090. The used 3090 remains the value pick here — see our used RTX 3090 breakdown for current pricing.

No GPU at all? You have two real options: run the Nano 0.1B on CPU, or rent a GPU by the hour. A single TTS render job is short enough that spinning up a cloud box on RunPod for an afternoon of voice work costs less than a coffee — see our rent-vs-buy math before committing to hardware.

Apple Silicon

There’s a first-class MLX path. The mlx-community has published 8-bit MLX conversions (mlx-community/MOSS-TTS-8B-8bit and MOSS-TTS-Local-Transformer-MLX-8bit) that run through mlx-audio, Apple’s MLX-based audio toolkit. The default runtime uses W8A-bf16 mixed precision with global and local KV cache enabled. On unified-memory Macs the VRAM-vs-RAM distinction disappears, so a Mac Mini M4 Pro with 24GB+ handles even the 8B 8-bit conversion. If you’re already running Ollama with MLX, the toolchain will feel familiar.

Installing the ComfyUI Node Pack

The community node pack is richservo/comfyui-moss-tts. Installation is the standard custom-node routine:

cd ComfyUI/custom_nodes
git clone https://github.com/richservo/comfyui-moss-tts
cd comfyui-moss-tts
pip install -r requirements.txt

The hard dependency that catches everyone: transformers>=5.0.0. MOSS-TTS uses architecture code that landed in the Transformers 5.x line, and most existing ComfyUI installs are still on a 4.x release pinned by some other node. After installing, restart ComfyUI fully (not just a workflow refresh).

Models auto-download to ComfyUI/models/moss-tts/ the first time you queue a prompt with a given model selected. All variants share one audio codec, OpenMOSS-Team/MOSS-Audio-Tokenizer, which downloads alongside the first model.

The error that will bite you

If you skipped the Transformers upgrade, the first prompt fails immediately:

ImportError: cannot import name 'MossTTSForConditionalGeneration' from 'transformers'

Fix: force the upgrade inside ComfyUI’s own Python environment, not your system Python:

# from the ComfyUI root, using its bundled python
python_embeded\python.exe -m pip install --upgrade "transformers>=5.0.0"   # Windows portable
# or, for a venv install:
source venv/bin/activate && pip install --upgrade "transformers>=5.0.0"

The second common failure is a CUDA out-of-memory on the Delay 8B when ComfyUI already has an image model resident. The fix is boring but reliable: switch the Model Loader to the Local 1.7B, or add --lowvram to your ComfyUI launch flags so it offloads the idle image model before TTS runs.

The Voice-Cloning Workflow, Node by Node

The pack adds five nodes: MOSS-TTS Model Loader, MOSS-TTS Generate, MOSS-TTS Voice Design, MOSS-TTS Sound Effect, and MOSS-TTS Dialogue. For zero-shot cloning you only need two.

MOSS-TTS Model Loader — pick Local 1.7B. Leave precision on the default.
Load Audio (ComfyUI’s built-in node) — point it at your reference clip.
MOSS-TTS Generate — wire the model in, wire the reference audio into the reference_audio input, and type your target text. No transcript field to fill — that’s the whole point.
Save Audio / Preview Audio — output is standard ComfyUI AUDIO, so any audio output node works.

Getting a clean clone

The reference clip is everything. The node documentation is specific: 3–10 seconds is the optimal length, ~15 seconds is the ceiling before artifacts creep in. Use a single speaker, a clean recording, and minimal background noise or reverb. A phone voice memo in a quiet room beats a podcast rip with music under it every time.

If your only reference is a noisy recording, run it through a denoiser first — or transcribe-and-clean a longer source with a self-hosted Whisper setup to pick the cleanest 8-second window. For pacing, v1.5 honors inline pause markers and punctuation, so write your text the way you want it spoken: commas for short beats, periods for full stops.

A note on speed expectations: because the codec runs at 12.5 tokens/second of audio, a 30-second clip is only ~375 audio tokens. On the Local 1.7B with an RTX 3090-class card, that renders in a handful of seconds — fast enough to iterate on phrasing without breaking flow. The 8B is noticeably slower per render, which is exactly why the docs steer you to the small model for anything iterative.

What to Use It For (and the Ethics Line)

Practical home-lab uses: narrating your own blog posts in your own voice, generating consistent character voices for game or video projects, building a self-hosted voice assistant that sounds like something you chose rather than a stock TTS voice, and dubbing short clips across the 31 supported languages.

The obvious caution: clone voices you have the right to clone. Your own voice, a voice actor who consented, a public-domain or licensed recording — fine. Cloning a private individual or a public figure to put words in their mouth is the kind of thing that gets open models restricted. Apache 2.0 gives you the technical freedom; it doesn’t give you a pass on the ethics. Keep your reference library to voices you’re allowed to use.

For builders wiring TTS into larger agent stacks, the streaming Realtime model’s 180ms TTFB makes it viable as the speech layer of a local coding or chat agent — pair it with the tooling we cover over at aicoderscope.com for the LLM side.

FAQ

Does MOSS-TTS need an internet connection? Only for the first model download. After the weights land in ComfyUI/models/moss-tts/, generation is fully offline.

Do I need a transcript of the reference audio? No. That’s the key advantage over Qwen-TTS and most older pipelines — you supply audio and target text only.

Which model for the best clone quality? Counterintuitively, the Local 1.7B scores higher speaker similarity on Seed-TTS-eval (73.28% EN) than the 8B (70.86% EN), while needing ~5GB instead of ~18GB. Use the 1.7B unless you specifically need the 8B’s long-form stability.

Can it run without a GPU? Yes — the Nano 0.1B variant does real-time generation on 4 CPU cores. Quality is lower than the GPU models, but it works on a NAS or mini PC.

What audio quality does it output? 24 kHz mono, suitable for narration, dialogue, and voice-agent use. It’s not 48kHz studio master quality, but it’s clean for spoken content.

Is it legal to use commercially? The model is Apache 2.0, so commercial use is permitted. Whether a specific cloned voice is legal depends entirely on your rights to that voice — that’s on you, not the license.

MOSS-TTS vs XTTS-v2 vs F5-TTS — which should I pick? The deciding factor is usually the license, not the audio. XTTS-v2 ships under the non-commercial CPML license, and since Coqui shut down in January 2024 there’s no commercial license to buy. F5-TTS is CC-BY-NC-4.0 — also non-commercial. Both are excellent for personal projects: XTTS-v2 clones from roughly 6 seconds across 17 languages, and F5-TTS is the speed champion, cloning from about 3 seconds with a flow-matching architecture that hits several times real-time on a mid-range card. But if you’re narrating anything you might monetize — a YouTube channel, client work, a product — MOSS-TTS’s Apache 2.0 license is the one that lets you ship without a legal asterisk. Pick MOSS-TTS for commercial freedom and 31-language coverage, F5-TTS if raw speed on a small GPU is the priority, XTTS-v2 if you’re strictly personal and want a battle-tested clone.

Is MOSS-TTS cheaper than ElevenLabs? For any real volume, yes — because after the hardware you already own, local generation is free. ElevenLabs’ entry commercial tier (Starter) runs $5/month for 30,000 characters, the Creator plan is $22/month for 100,000 characters, and Professional Voice Cloning requires Creator or above. At Multilingual v2 rates, 1 character is 1 credit, so 100,000 characters is only a few minutes of heavy scripting per day before you’re topping up. MOSS-TTS on a GPU you own has no per-character meter — the tradeoff is you supply the hardware and the setup time instead of a subscription. If you generate audio daily, the local model pays for itself fast; if you need a handful of clips a month, a cloud subscription may be less hassle.

Can I fine-tune MOSS-TTS on my own voice? You usually don’t need to. Zero-shot cloning already reproduces a speaker’s timbre from a short reference clip with no speaker-specific training, which covers most home-lab use. MOSS-TTS does support fine-tuning on custom datasets for cases where you want a dedicated model — a specific language, domain vocabulary, or a house voice you’ll reuse constantly — but that’s a heavier workflow than the ComfyUI node pack exposes. Start with zero-shot; only reach for fine-tuning if the clone consistently misses something specific about a voice.

Does it work on AMD GPUs with ROCm? The node pack’s documented fast path is CUDA-specific — it uses FlashAttention 2 on NVIDIA cards with compute capability 8.0+ (Ampere and newer). The upstream MOSS-TTS repo doesn’t publish official ROCm support, so on an AMD card you’d be relying on PyTorch’s ROCm build and losing the FlashAttention acceleration, which isn’t a path the maintainers document or test. If you’re on AMD today, the reliable options are the CPU-first Nano 0.1B or renting an NVIDIA box by the hour on RunPod for the render.

Sources

MOSS-TTS — OpenMOSS (GitHub) — release date (May 26, 2026), Apache 2.0 license, 31 languages, Realtime RTF 0.51 / 180ms TTFB, Nano 0.1B on 4 CPU cores
MOSS-TTS model card — OpenMOSS (GitHub) — Seed-TTS-eval WER/SIM numbers, 12.5 tokens/sec codec rate, FlashAttention 2 / compute capability 8+
comfyui-moss-tts — richservo (GitHub) — per-model VRAM figures, node names, transformers>=5.0.0 requirement, reference audio guidelines, 24 kHz mono output
MOSS-TTS-8B-8bit — mlx-community (Hugging Face) — Apple Silicon MLX 8-bit conversion, W8A-bf16 mixed precision
MOSS-TTS-Local-Transformer-MLX-8bit — mlx-community (Hugging Face) — MLX local-model port for Mac
OpenMOSS ships TTS v1.5 with multi-speaker cloning — AI Weekly — v1.5 release coverage and feature summary
Local TTS Voice Cloning 2026: Piper vs XTTS v2 vs F5-TTS — PromptQuorum — XTTS-v2 CPML/17-language details, F5-TTS CC-BY-NC-4.0 flow-matching / ~3s reference / speed figures
ElevenLabs Pricing Breakdown 2026 — Flexprice — Starter $5/30k chars, Creator $22/100k chars, PVC requires Creator+, 1 char = 1 credit on Multilingual v2

Last updated June 10, 2026. Model variants, VRAM figures, and benchmark numbers reflect the MOSS-TTS v1.5 release; verify current node compatibility before installing.

Recommended Gear

RTX 3060 12GB — clears the Local 1.7B’s ~5GB requirement with room for other ComfyUI nodes
RTX 5060 Ti 16GB — headroom to run TTS alongside an image or LLM workflow
RTX 3090 — 24GB used-market pick for the Delay 8B at ~18GB
RTX 4090 — fastest single-card option for the 8B model

Was this article helpful?