Welcome!

Cloud Native Enterprise Solutions and AI Business Applications, enhanced by Open Source!

Clawdbot: Empower Your Own Powerful AI Assistant on Azure

With the rapid development of generative AI and intelligent agents, self-hosted intelligent systems are gaining increasing attention from developers and technical teams. Clawdbot is one such open-source, self-hostable personal AI assistant. It not only engages in conversation with users but also executes tasks, integrates with messaging platforms, provides automation capabilities, and can be deployed on cloud platforms like Azure. This article will step-by-step introduce Clawdbot’s capabilities, architecture, and how to deploy and start using it on Microsoft Azure....

Microsoft TRELLIS: A Large Model for Production-Grade 3D Asset Generation and Guide to Deployment on Azure

At the end of 2025, Microsoft Research released an open-source large model project for 3D content creation called TRELLIS, accompanied by the academic paper “Structured 3D Latents for Scalable and Versatile 3D Generation”. This project significantly improves the quality and flexibility of text/image-to-3D asset generation through a unified structured latent space and advanced flow model technology. It also expands the multi-format output and editing capabilities of 3D models, making it one of the key technologies in the current 3D AI model ecosystem....

Comprehensive Analysis of LLM Inference Parallelism Strategies: TP / DP / PP / EP Principles and vLLM Performance Verification

When the parameter scale of Large Language Models (LLMs) crosses the multi-billion threshold, one fact becomes unavoidable: the era of single-card is over. Whether for training or inference, the parameter scale of the model itself, the growth speed of KV Cache, and the memory and compute pressure brought by concurrent requests in real business scenarios all dictate that large models must run in multi-GPU or even multi-node environments. However, running a model with multiple cards is not simply a matter of stacking resources....

Accelerating LLM Inference: Decoupling Prefill and Decode (PD Disaggregation)

As Large Language Models (LLMs) are widely deployed in dialogue systems, intelligent assistants, and Agent scenarios, the core challenge for inference systems has shifted from “can it run” to “how to run with lower latency, higher throughput, and greater stability”. Against this backdrop, PD Disaggregated (Prefill / Decode Decoupling) has gradually become a key architectural concept in large-scale online inference systems. This article will systematically explain what PD Disaggregated is, why it is needed, and the core advantages it brings to LLM inference systems from the perspective of model inference execution flow, without relying on any specific inference framework....

Image Generation Enters the Platform Era: GPT-Image-1.5 in Microsoft Foundry

In recent years, generative AI technology has evolved rapidly. Beyond natural language processing, image generation and editing capabilities have become a key frontier of AI innovation. In this trend, OpenAI launched the GPT Image series models, which are also available within the Azure OpenAI Service. The newly released GPT-Image-1.5 can be seen as the new flagship in the field of image generation, offering significant improvements in performance, efficiency, and controllability....

Building a Smart Address Parsing Chrome / Edge Extension with Azure OpenAI

In the e-commerce and logistics sectors, there is an overlooked but extremely time-consuming pain point: entering complex address formats correctly into fixed fields in logistics systems. Addresses sent by customers come in all sorts of strange formats—sometimes comma-separated, sometimes all in one line, with completely random orders, and often missing key information like province or state. In this article, I want to share how I built Auto Address, a Chrome and Edge browser extension that leverages the power of Azure OpenAI to solve this problem....

A Beginner's Guide to LLM Architectures

Over the past five years, the development of Large Language Models (LLMs) has almost completely reshaped the technological landscape of artificial intelligence. From GPT to LLaMA, from Transformer to Mixture-of-Experts (MoE), and from monolithic models to large-scale distributed parameter server systems, architectural evolution has directly driven leaps in capability. This article will systematically review the mainstream technical paths of LLMs from an architectural perspective, and analyze their pros, cons, and suitable scenarios from an application perspective, providing a reference for technical selection for R&D and business teams....

Kubernetes Ingress NGINX Retirement: Comprehensive Migration Plan and Practice Guide to Gateway API

On November 11, 2025, the official Kubernetes blog formally announced that the Ingress NGINX project has entered the Retirement phase and will cease maintenance entirely in March 2026. This move marks the official entry of Kubernetes cluster ingress and traffic management into the Gateway API era. For teams currently using Ingress NGINX, this is not just a technical upgrade, but a risk management task that needs to be planned as soon as possible....

Fara-7B: Microsoft's Efficient Agentic Model for Computer Use

Microsoft Research recently released Fara-7B, an Agentic Small Language Model (SLM) designed specifically for Computer Use. Unlike traditional chat models, Fara-7B is designed to complete tasks by using a mouse and keyboard just like a human. What is Fara-7B? Fara-7B is a 7 billion parameter model built on Qwen2.5-VL-7B. Its key features include: Visual Perception: It operates by visually perceiving web pages (screenshots), without relying on Accessibility Trees or additional parsing models....

Quickly Configure the Latest Gemini 3 Pro Model for Github Copilot to Accelerate Development Experience

As AI coding assistants continue to evolve, GitHub Copilot keeps introducing more powerful models for developers to choose from. Recently, GitHub Copilot announced support for Google’s latest Gemini 3 Pro model (Preview). As a developer who uses Copilot daily, I tried it out immediately and found it surprisingly good at logical reasoning and long context understanding. In this post, I’ll guide you through switching to Gemini 3 Pro in VS Code and share some of my insights....

Understanding KV-Cache - The Core Acceleration Technology for LLM Inference

As large language models (LLMs) continue to grow in scale, the cost of inference has skyrocketed. To enable models to respond to user requests faster and more economically, various optimization techniques have emerged. Among them, KV-Cache (Key-Value Cache) stands out as one of the most critical and impactful inference acceleration mechanisms, widely adopted by all major inference frameworks (e.g., vLLM, TensorRT-LLM, LLama.cpp, llm-d, OpenAI Triton Transformer Engine, etc.). This article provides a comprehensive introduction to what KV-Cache is, how it works, why it significantly improves inference efficiency, its impact on the industry, and best practices for its use....

Easily Generate Videos with Sora 2 from Azure AI Foundry

With Azure AI Foundry opening support for Sora 2 (OpenAI’s generative video model), developers can now access top-tier video generation capabilities in an enterprise-grade, compliant, and controllable environment. This tutorial will take you from zero to production, showing how to call Sora 2 via the Playground and the Python SDK to complete a “text-to-video” workflow. Prerequisites Before starting, you need: Get an Azure subscription You need an Azure subscription. If you’re unsure how to get one, refer to the subscription registration section in my earlier article....

A Comprehensive Guide to LLM Fine-Tuning: Methods, Comparisons, and Best-Fit Scenarios

As enterprises build AI-powered applications, fine-tuning large language models (LLMs) has become essential for delivering customized capabilities. Over the years, fine-tuning techniques have evolved from traditional full-parameter training to efficient, low-cost approaches such as LoRA, QLoRA, Adapters, supervised fine-tuning (SFT), reward modeling (RM), and RLHF. This article provides a systematic overview of major fine-tuning methods, compares their strengths and weaknesses, and offers guidance on when each method is most suitable....

Optimizing Inference with Parameter/Data (P/D) Separation in vLLM Framework

Large language models often encounter GPU memory bottlenecks during inference deployment: Model parameters (P) can reach hundreds of GB and must remain resident in GPU memory. Input/output data (D) changes dynamically with each request but is often coupled with parameters on the same device, leading to imbalanced memory usage and limited scalability. To solve this problem, we can leverage the vLLM framework to implement Parameter/Data (P/D) Separation, improving the flexibility and throughput of inference systems....

Getting Started with Microsoft’s Latest Open-Source Long-Form Speech Model VibeVoice

What is VibeVoice? VibeVoice is a research framework released by Microsoft Research for long-form, multi-speaker, conversational speech synthesis. Target scenarios include entire podcast episodes, audio dramas, or interviews: it can maintain speaker consistency within a single generation and handle natural turn-taking. The model family includes multiple scales (e.g., 1.5B, 7B, etc.) and is available on Hugging Face as microsoft/VibeVoice-1.5B, along with model cards, weights, installation guides, and responsible use notes....