Qwen3-TTS on Mac Mini M4: The Ultimate Installation & Optimization Guide

The Mac Mini M4 is a powerhouse for local AI, but running Qwen3-TTS (Alibaba's latest high-quality text-to-speech model) requires a few "under-the-hood" tweaks to move from NVIDIA-centric defaults to Apple’s Metal Performance Shaders (MPS).

Follow this guide to avoid common pitfalls and get the best performance out of your M4 chip.


1. Prerequisites: System Dependencies

macOS lacks some low-level audio processing libraries required by TTS engines. Install them via Homebrew first:

brew install portaudio ffmpeg sox

Note: Skipping this will likely result in a /bin/sh: sox: command not found error during execution.


2. Environment Setup

We recommend using Python 3.12 with a clean Conda environment to keep things stable.

# Create and activate the environment
conda create -n qwen3-tts python=3.12 -y
conda activate qwen3-tts

# Install the core inference library
pip install -U qwen-tts

# Clone the repository for local modifications
git clone https://github.com/QwenLM/Qwen3-TTS.git
cd Qwen3-TTS
pip install -e .

3. The "M4 Special": Code Modifications

The default scripts are hardcoded for NVIDIA GPUs. To run on your M4, you must modify examples/test_model_12hz_base.py.

A. Fix Model Path & Acceleration (Approx. Line 50)

Find the Qwen3TTSModel.from_pretrained section and update it to use sdpa (Mac-compatible attention) and the mps device.

# --- BEFORE ---
# MODEL_PATH = "Qwen/Qwen3-TTS-12Hz-1.7B-Base/"
# tts = Qwen3TTSModel.from_pretrained(..., attn_implementation="flash_attention_2")

# --- AFTER (Modified for M4) ---
MODEL_PATH = "Qwen/Qwen3-TTS-12Hz-1.7B-Base" # Remove the trailing slash
tts = Qwen3TTSModel.from_pretrained(
    MODEL_PATH,
    torch_dtype=torch.bfloat16,   # M4 fully supports bfloat16
    attn_implementation="sdpa",    # Use SDPA instead of FlashAttention2
    device_map="mps",              # Force use of Apple GPU
)

B. Fix Synchronization Logic (Crucial!)

The M4 chip uses torch.mps, not torch.cuda. Using cuda.synchronize() will crash your script. Replace any synchronization calls with this hardware-aware block:

# Replace torch.cuda.synchronize() with:
if torch.cuda.is_available():
    torch.cuda.synchronize()
elif torch.backends.mps.is_available():
    torch.mps.synchronize() # The correct instruction for M4

4. Handling Large Downloads

The model is roughly 4GB. If you face slow speeds or connection timeouts with HuggingFace, use a mirror (if applicable) or ensure a stable connection.

To use a mirror in your terminal:

export HF_ENDPOINT=https://hf-mirror.com
python test_model_12hz_base.py

Troubleshooting: "InvalidHeaderDeserialization"

If you see a safetensors_rust.SafetensorError, it means your model download was interrupted and the file is corrupted.

  • The Fix: Go to ~/.cache/huggingface/hub, delete the Qwen folder, and run the script again to restart the download.

5. Running the Model

Once the edits are saved, navigate to the examples folder and run:

cd examples
python test_model_12hz_base.py

If everything is set up correctly, the M4 will generate a series of high-fidelity audio samples in a newly created directory.


Pro-Tips for Mac Users

  • Verification: To confirm your GPU is being used, run this in Python:
    import torch; print(torch.backends.mps.is_available()).
  • The "Reboot" Rule: Apple Silicon drivers can occasionally hang after heavy environment switching. If you get an inexplicable error, a system restart fixes 90% of driver-related issues.

Reference Code

import os
import torch
import soundfile as sf
import numpy as np
# Ensure 'qwen_tts' is installed/present in the environment
# 确保环境中已安装 'qwen_tts' 或该文件在当前目录下
from qwen_tts import Qwen3TTSModel

# ================= 1. Initialization (Setup) / 初始化设置 =================

# Auto-detect the hardware.
# 自动检测硬件设备。
# "mps" = Mac (Apple Silicon), "cuda" = NVIDIA GPU, "cpu" = Standard Processor
if torch.backends.mps.is_available():
    device = "mps"   # Mac M1/M2/M3/M4...
elif torch.cuda.is_available():
    device = "cuda"  # NVIDIA GPU / 英伟达显卡
else:
    device = "cpu"   # Standard CPU / 普通处理器

print(f"Using device / 当前使用设备: {device}")

# Define where to save the results
# 定义结果保存路径
OUT_DIR = "qwen3_slow_output"
os.makedirs(OUT_DIR, exist_ok=True)

print("Loading model... (This might take a minute)")
print("正在加载模型... (可能需要一分钟)")

# Loading the model from Hugging Face
# 从 Hugging Face 加载模型
model = Qwen3TTSModel.from_pretrained(
    "Qwen/Qwen3-TTS-12Hz-1.7B-Base",
    torch_dtype=torch.bfloat16,
    attn_implementation="sdpa",
    device_map=device,
)
print("Model loaded successfully! / 模型加载完成!")

# ================= 2. Reference Audio Settings / 参考音频设置 =================
# This is the voice the model will mimic (clone).
# 这是模型将要模仿(克隆)的声音。

# Option A: Use a URL (Official Qwen Example)
# 选项 A: 使用在线 URL (Qwen 官方示例)
ref_audio_url = "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-TTS-Repo/clone_2.wav"

# Option B: Use a local file (Uncomment the line below to use your own file)
# 选项 B: 使用本地文件 (取消下面这行的注释以使用自己的文件)
# ref_audio_url = "./my_voice.wav"

# CRITICAL: This text MUST match what is said in the reference audio exactly.
# If this is wrong, the quality will be bad.
# 关键:此文本必须与参考音频中的说话内容完全一致。
# 如果内容对不上,生成质量会变差。
ref_text_content = "Okay. Yeah. I resent you. I love you. I respect you. But you know what? You blew it! And thanks to you."

# ================= 3. Content to Generate / 生成内容配置 =================
# Tip: To make the speech slower and clearer, we add punctuation (like , . ...)
# This forces the model to pause between words.
# 技巧:为了让语速更慢、更清晰,我们增加了标点符号(如 , . ...)
# 这会强制模型在词与词之间停顿。

segments = [
    {
        "lang": "Chinese",
        # Original: 大家好...
        # Trick: Added commas to slow it down.
        # 技巧:增加逗号以减慢语速。
        "text": "大家好,这个视频是,分享如何在Mac Mini上,部署Qwen.3-TTS,运行官方例子程序,希望你们喜欢。", 
        "temp": 0.7, 
    },
    {
        "lang": "English",
        # Original: This video is about...
        # Trick: Added "..." and extra commas for a relaxed pace.
        # 技巧:增加 "..." 和额外的逗号,让节奏更舒缓。
        "text": "Hello everyone! In this video, I'll share how to deploy Qwen.3-TTS on a Mac Mini and run the official demos. I hope you enjoy it.", 
        "temp": 0.7,
    },
    {
        "lang": "Japanese",
        # Trick: Added extra Japanese commas (、)
        # 技巧:增加额外的日文逗号 (、)
        "text": "皆さん、こんにちは。この動画では、Mac MiniでQwen.3-TTSを導入し、公式デモを動かす方法をシェアします。気に入っていただけると嬉しいです。", 
        "temp": 0.7,
    },
    {
        "lang": "Korean",
        # Trick: Added breaks between concepts.
        # 技巧:在概念之间增加断句。
        "text": "안녕하세요 여러분. 이번 영상에서는 맥 미니(Mac Mini)에 Qwen.3-TTS를 구축하고, 공식 예제를 실행하는 방법을 공유해 드리겠습니다. 유익한 시간이 되시길 바랍니다.", 
        "temp": 0.7,
    },
    {
        "lang": "German",
        "text": "Hallo zusammen! In diesem Video zeige ich euch, wie man Qwen.3-TTS auf einem Mac Mini deployt und die offiziellen Demos ausführt. Ich hoffe, es gefällt euch.", 
        "temp": 0.6,
    },
    {
        "lang": "French",
        "text": "Bonjour à tous ! Dans cette vidéo, je vais partager comment déployer Qwen.3-TTS sur un Mac Mini et lancer les démos officielles. J'espère qu'elle vous plaira.", 
        "temp": 0.8,
    }
]

# ================= 4. Generation Loop / 生成循环 =================
all_audio_parts = []
final_sr = None # Sample rate / 采样率

print("Starting audio generation... / 开始生成音频...")

for i, seg in enumerate(segments):
    print(f"[{i+1}/{len(segments)}] Generating {seg['lang']} segment... / 正在生成 {seg['lang']} 片段...")

    # Try to use the 'speed' parameter if the model supports it
    # 尝试使用 'speed' 参数(如果模型支持)
    try:
        wavs, sr = model.generate_voice_clone(
            text=seg['text'],
            language=seg['lang'],
            ref_audio=ref_audio_url,
            ref_text=ref_text_content,
            temperature=seg['temp'],
            speed=0.85,  # 0.85 = 85% speed (Slower) / 0.85倍速(更慢)
        )
    except TypeError:
        # If 'speed' causes an error, remove it and just use the text tricks
        # 如果 'speed' 参数报错,则移除它,仅依赖文本标点技巧
        print(f"  (Note: Speed parameter not supported, using standard speed for {seg['lang']})")
        print(f"  (注意:不支持 Speed 参数,{seg['lang']} 将使用标准语速)")
        wavs, sr = model.generate_voice_clone(
            text=seg['text'],
            language=seg['lang'],
            ref_audio=ref_audio_url,
            ref_text=ref_text_content,
            temperature=seg['temp'],
        )

    # Process the audio data / 处理音频数据
    audio_data = wavs[0]
    if isinstance(audio_data, torch.Tensor):
        audio_data = audio_data.cpu().numpy()

    all_audio_parts.append(audio_data)
    if final_sr is None: final_sr = sr

# ================= 5. Merging Audio / 合并音频 =================
print("Merging all segments... / 正在合并所有片段...")

# Create a silence gap between languages
# Changed from 0.3s to 0.8s for a better listening experience
# 在语言之间创建静音间隔
# 为了更好的听感,时长设置为 0.3秒 (可根据需要调整)
silence_duration = 0.3
silence_samples = int(silence_duration * final_sr)
silence_data = np.zeros(silence_samples, dtype=np.float32)

final_sequence = []
for part in all_audio_parts:
    final_sequence.append(part)
    final_sequence.append(silence_data) # Add silence after each part / 每段后加静音

# Remove the very last silence block
# 移除最后一段多余的静音
if final_sequence:
    final_sequence.pop()

full_audio = np.concatenate(final_sequence)

# ================= 6. Save Output / 保存输出 =================
final_path = os.path.join(OUT_DIR, "Final_Slow_Mix.wav")
sf.write(final_path, full_audio, final_sr)

print("="*30)
print(f"Done! Audio saved to: / 完成!音频已保存至:\n{final_path}")
print("="*30)