Llm | Wilson Wu

Clawdbot: Empower Your Own Powerful AI Assistant on Azure

With the rapid development of generative AI and intelligent agents, self-hosted intelligent systems are gaining increasing attention from developers and technical teams. Clawdbot is one such open-source, self-hostable personal AI assistant. It not only engages in conversation with users but also executes tasks, integrates with messaging platforms, provides automation capabilities, and can be deployed on cloud platforms like Azure. This article will step-by-step introduce Clawdbot’s capabilities, architecture, and how to deploy and start using it on Microsoft Azure....

Microsoft TRELLIS: A Large Model for Production-Grade 3D Asset Generation and Guide to Deployment on Azure

At the end of 2025, Microsoft Research released an open-source large model project for 3D content creation called TRELLIS, accompanied by the academic paper “Structured 3D Latents for Scalable and Versatile 3D Generation”. This project significantly improves the quality and flexibility of text/image-to-3D asset generation through a unified structured latent space and advanced flow model technology. It also expands the multi-format output and editing capabilities of 3D models, making it one of the key technologies in the current 3D AI model ecosystem....

Comprehensive Analysis of LLM Inference Parallelism Strategies: TP / DP / PP / EP Principles and vLLM Performance Verification

When the parameter scale of Large Language Models (LLMs) crosses the multi-billion threshold, one fact becomes unavoidable: the era of single-card is over. Whether for training or inference, the parameter scale of the model itself, the growth speed of KV Cache, and the memory and compute pressure brought by concurrent requests in real business scenarios all dictate that large models must run in multi-GPU or even multi-node environments. However, running a model with multiple cards is not simply a matter of stacking resources....

Accelerating LLM Inference: Decoupling Prefill and Decode (PD Disaggregation)

As Large Language Models (LLMs) are widely deployed in dialogue systems, intelligent assistants, and Agent scenarios, the core challenge for inference systems has shifted from “can it run” to “how to run with lower latency, higher throughput, and greater stability”. Against this backdrop, PD Disaggregated (Prefill / Decode Decoupling) has gradually become a key architectural concept in large-scale online inference systems. This article will systematically explain what PD Disaggregated is, why it is needed, and the core advantages it brings to LLM inference systems from the perspective of model inference execution flow, without relying on any specific inference framework....

Image Generation Enters the Platform Era: GPT-Image-1.5 in Microsoft Foundry

In recent years, generative AI technology has evolved rapidly. Beyond natural language processing, image generation and editing capabilities have become a key frontier of AI innovation. In this trend, OpenAI launched the GPT Image series models, which are also available within the Azure OpenAI Service. The newly released GPT-Image-1.5 can be seen as the new flagship in the field of image generation, offering significant improvements in performance, efficiency, and controllability....

Building a Smart Address Parsing Chrome / Edge Extension with Azure OpenAI

In the e-commerce and logistics sectors, there is an overlooked but extremely time-consuming pain point: entering complex address formats correctly into fixed fields in logistics systems. Addresses sent by customers come in all sorts of strange formats—sometimes comma-separated, sometimes all in one line, with completely random orders, and often missing key information like province or state. In this article, I want to share how I built Auto Address, a Chrome and Edge browser extension that leverages the power of Azure OpenAI to solve this problem....

A Beginner's Guide to LLM Architectures

Over the past five years, the development of Large Language Models (LLMs) has almost completely reshaped the technological landscape of artificial intelligence. From GPT to LLaMA, from Transformer to Mixture-of-Experts (MoE), and from monolithic models to large-scale distributed parameter server systems, architectural evolution has directly driven leaps in capability. This article will systematically review the mainstream technical paths of LLMs from an architectural perspective, and analyze their pros, cons, and suitable scenarios from an application perspective, providing a reference for technical selection for R&D and business teams....

Quickly Configure the Latest Gemini 3 Pro Model for Github Copilot to Accelerate Development Experience

As AI coding assistants continue to evolve, GitHub Copilot keeps introducing more powerful models for developers to choose from. Recently, GitHub Copilot announced support for Google’s latest Gemini 3 Pro model (Preview). As a developer who uses Copilot daily, I tried it out immediately and found it surprisingly good at logical reasoning and long context understanding. In this post, I’ll guide you through switching to Gemini 3 Pro in VS Code and share some of my insights....

Understanding KV-Cache - The Core Acceleration Technology for LLM Inference

As large language models (LLMs) continue to grow in scale, the cost of inference has skyrocketed. To enable models to respond to user requests faster and more economically, various optimization techniques have emerged. Among them, KV-Cache (Key-Value Cache) stands out as one of the most critical and impactful inference acceleration mechanisms, widely adopted by all major inference frameworks (e.g., vLLM, TensorRT-LLM, LLama.cpp, llm-d, OpenAI Triton Transformer Engine, etc.). This article provides a comprehensive introduction to what KV-Cache is, how it works, why it significantly improves inference efficiency, its impact on the industry, and best practices for its use....

Easily Generate Videos with Sora 2 from Azure AI Foundry

With Azure AI Foundry opening support for Sora 2 (OpenAI’s generative video model), developers can now access top-tier video generation capabilities in an enterprise-grade, compliant, and controllable environment. This tutorial will take you from zero to production, showing how to call Sora 2 via the Playground and the Python SDK to complete a “text-to-video” workflow. Prerequisites Before starting, you need: Get an Azure subscription You need an Azure subscription. If you’re unsure how to get one, refer to the subscription registration section in my earlier article....

A Comprehensive Guide to LLM Fine-Tuning: Methods, Comparisons, and Best-Fit Scenarios

As enterprises build AI-powered applications, fine-tuning large language models (LLMs) has become essential for delivering customized capabilities. Over the years, fine-tuning techniques have evolved from traditional full-parameter training to efficient, low-cost approaches such as LoRA, QLoRA, Adapters, supervised fine-tuning (SFT), reward modeling (RM), and RLHF. This article provides a systematic overview of major fine-tuning methods, compares their strengths and weaknesses, and offers guidance on when each method is most suitable....

Optimizing Inference with Parameter/Data (P/D) Separation in vLLM Framework

Large language models often encounter GPU memory bottlenecks during inference deployment: Model parameters (P) can reach hundreds of GB and must remain resident in GPU memory. Input/output data (D) changes dynamically with each request but is often coupled with parameters on the same device, leading to imbalanced memory usage and limited scalability. To solve this problem, we can leverage the vLLM framework to implement Parameter/Data (P/D) Separation, improving the flexibility and throughput of inference systems....

Getting Started with Microsoft’s Latest Open-Source Long-Form Speech Model VibeVoice

What is VibeVoice? VibeVoice is a research framework released by Microsoft Research for long-form, multi-speaker, conversational speech synthesis. Target scenarios include entire podcast episodes, audio dramas, or interviews: it can maintain speaker consistency within a single generation and handle natural turn-taking. The model family includes multiple scales (e.g., 1.5B, 7B, etc.) and is available on Hugging Face as microsoft/VibeVoice-1.5B, along with model cards, weights, installation guides, and responsible use notes....

A Beginner’s Guide to Inference with the SGLang Framework

As large language models (LLMs) grow in popularity, the focus for enterprises and individuals has shifted from training to inference (in other words, moving from “building wheels” to practical usage). In the field of inference, the two hottest frameworks are undoubtedly vLLM and SGLang. As a rising star, SGLang has also attracted attention. Today, we’ll explore SGLang through a beginner-friendly tutorial to help more people understand both LLM inference and the SGLang framework....

Easily Deploy and use DeepSeek-R1 with Azure AI Foundry

The popularity of DeepSeek has once again showcased the charm of AI. However, this has not led to a reduction in the demand for computing power. Instead, it has brought about another wave of demand for computing power by building more AI business scenarios in low-cost, user-friendly artificial intelligence. Today, we will quickly experience the elegance of DeepSeek through Azure AI Foundry (formerly Azure AI Studio). Prerequisites First, you need to have an Azure subscription....