Optimizing Inference with Parameter/Data (P/D) Separation in vLLM Framework

Large language models often encounter GPU memory bottlenecks during inference deployment: Model parameters (P) can reach hundreds of GB and must remain resident in GPU memory. Input/output data (D) changes dynamically with each request but is often coupled with parameters on the same device, leading to imbalanced memory usage and limited scalability. To solve this problem, we can leverage the vLLM framework to implement Parameter/Data (P/D) Separation, improving the flexibility and throughput of inference systems....

September 29, 2025 · 5 min

Getting Started with Microsoft’s Latest Open-Source Long-Form Speech Model VibeVoice

What is VibeVoice? VibeVoice is a research framework released by Microsoft Research for long-form, multi-speaker, conversational speech synthesis. Target scenarios include entire podcast episodes, audio dramas, or interviews: it can maintain speaker consistency within a single generation and handle natural turn-taking. The model family includes multiple scales (e.g., 1.5B, 7B, etc.) and is available on Hugging Face as microsoft/VibeVoice-1.5B, along with model cards, weights, installation guides, and responsible use notes....

September 18, 2025 · 4 min

Azure 101 Series: Microsoft Azure Overview

Azure is a cloud computing platform and service provided by Microsoft. It offers a range of infrastructure as a service (IaaS), platform as a service (PaaS), and software as a service (SaaS) solutions for building, deploying, and managing various types of applications and services. Overview Azure provides a wide range of features and services, including virtual machines, storage, databases, artificial intelligence, machine learning, blockchain, Internet of Things (IoT), containers, and serverless computing....

June 19, 2024 · 13 min