Comprehensive Analysis of LLM Inference Parallelism Strategies: TP / DP / PP / EP Principles and vLLM Performance Verification

When the parameter scale of Large Language Models (LLMs) crosses the multi-billion threshold, one fact becomes unavoidable: the era of single-card is over. Whether for training or inference, the parameter scale of the model itself, the growth speed of KV Cache, and the memory and compute pressure brought by concurrent requests in real business scenarios all dictate that large models must run in multi-GPU or even multi-node environments. However, running a model with multiple cards is not simply a matter of stacking resources....

December 24, 2025 · 5 min

Accelerating LLM Inference: Decoupling Prefill and Decode (PD Disaggregation)

As Large Language Models (LLMs) are widely deployed in dialogue systems, intelligent assistants, and Agent scenarios, the core challenge for inference systems has shifted from “can it run” to “how to run with lower latency, higher throughput, and greater stability”. Against this backdrop, PD Disaggregated (Prefill / Decode Decoupling) has gradually become a key architectural concept in large-scale online inference systems. This article will systematically explain what PD Disaggregated is, why it is needed, and the core advantages it brings to LLM inference systems from the perspective of model inference execution flow, without relying on any specific inference framework....

December 22, 2025 · 5 min

Fara-7B: Microsoft's Efficient Agentic Model for Computer Use

Microsoft Research recently released Fara-7B, an Agentic Small Language Model (SLM) designed specifically for Computer Use. Unlike traditional chat models, Fara-7B is designed to complete tasks by using a mouse and keyboard just like a human. What is Fara-7B? Fara-7B is a 7 billion parameter model built on Qwen2.5-VL-7B. Its key features include: Visual Perception: It operates by visually perceiving web pages (screenshots), without relying on Accessibility Trees or additional parsing models....

November 28, 2025 · 3 min

Optimizing Inference with Parameter/Data (P/D) Separation in vLLM Framework

Large language models often encounter GPU memory bottlenecks during inference deployment: Model parameters (P) can reach hundreds of GB and must remain resident in GPU memory. Input/output data (D) changes dynamically with each request but is often coupled with parameters on the same device, leading to imbalanced memory usage and limited scalability. To solve this problem, we can leverage the vLLM framework to implement Parameter/Data (P/D) Separation, improving the flexibility and throughput of inference systems....

September 29, 2025 · 5 min

A Beginner’s Guide to Inference with the SGLang Framework

As large language models (LLMs) grow in popularity, the focus for enterprises and individuals has shifted from training to inference (in other words, moving from “building wheels” to practical usage). In the field of inference, the two hottest frameworks are undoubtedly vLLM and SGLang. As a rising star, SGLang has also attracted attention. Today, we’ll explore SGLang through a beginner-friendly tutorial to help more people understand both LLM inference and the SGLang framework....

July 10, 2025 · 5 min