Inference

Comprehensive Analysis of LLM Inference Parallelism Strategies: TP / DP / PP / EP Principles and vLLM Performance Verification

When the parameter scale of Large Language Models (LLMs) crosses the multi-billion threshold, one fact becomes unavoidable: the era of single-card is over. Whether for training or inference, the parameter scale of the model itself, the growth speed of KV Cache, and the memory and compute pressure brought by concurrent requests in real business scenarios all dictate that large models must run in multi-GPU or even multi-node environments. However, running a model with multiple cards is not simply a matter of stacking resources....

Getting Started with Microsoft’s Latest Open-Source Long-Form Speech Model VibeVoice

What is VibeVoice? VibeVoice is a research framework released by Microsoft Research for long-form, multi-speaker, conversational speech synthesis. Target scenarios include entire podcast episodes, audio dramas, or interviews: it can maintain speaker consistency within a single generation and handle natural turn-taking. The model family includes multiple scales (e.g., 1.5B, 7B, etc.) and is available on Hugging Face as microsoft/VibeVoice-1.5B, along with model cards, weights, installation guides, and responsible use notes....

A Beginner’s Guide to Inference with the SGLang Framework

As large language models (LLMs) grow in popularity, the focus for enterprises and individuals has shifted from training to inference (in other words, moving from “building wheels” to practical usage). In the field of inference, the two hottest frameworks are undoubtedly vLLM and SGLang. As a rising star, SGLang has also attracted attention. Today, we’ll explore SGLang through a beginner-friendly tutorial to help more people understand both LLM inference and the SGLang framework....