Getting Started with Microsoft’s Latest Open-Source Long-Form Speech Model VibeVoice

What is VibeVoice? VibeVoice is a research framework released by Microsoft Research for long-form, multi-speaker, conversational speech synthesis. Target scenarios include entire podcast episodes, audio dramas, or interviews: it can maintain speaker consistency within a single generation and handle natural turn-taking. The model family includes multiple scales (e.g., 1.5B, 7B, etc.) and is available on Hugging Face as microsoft/VibeVoice-1.5B, along with model cards, weights, installation guides, and responsible use notes....

September 18, 2025 · 4 min