Understanding KV-Cache - The Core Acceleration Technology for LLM Inference

As large language models (LLMs) continue to grow in scale, the cost of inference has skyrocketed. To enable models to respond to user requests faster and more economically, various optimization techniques have emerged. Among them, KV-Cache (Key-Value Cache) stands out as one of the most critical and impactful inference acceleration mechanisms, widely adopted by all major inference frameworks (e.g., vLLM, TensorRT-LLM, LLama.cpp, llm-d, OpenAI Triton Transformer Engine, etc.). This article provides a comprehensive introduction to what KV-Cache is, how it works, why it significantly improves inference efficiency, its impact on the industry, and best practices for its use....

November 18, 2025 · 6 min

Easily Generate Videos with Sora 2 from Azure AI Foundry

With Azure AI Foundry opening support for Sora 2 (OpenAI’s generative video model), developers can now access top-tier video generation capabilities in an enterprise-grade, compliant, and controllable environment. This tutorial will take you from zero to production, showing how to call Sora 2 via the Playground and the Python SDK to complete a “text-to-video” workflow. Prerequisites Before starting, you need: Get an Azure subscription You need an Azure subscription. If you’re unsure how to get one, refer to the subscription registration section in my earlier article....

November 10, 2025 · 5 min

Building Your Own ChatGPT on Azure Without Writing Any Code

Using ChatGPT to help us solve problems in our work and daily life has become a habit. However, after using the official GPT-4o heavily, we may encounter temporary quota issues. Today, we will show you how to easily build your own personalized ChatGPT application using Azure OpenAI services. Prerequisites Before we begin, make sure you have an Azure global subscription. If you don’t have one yet, you can easily start an Azure subscription through Pay-as-you-go:...

June 25, 2024 · 5 min