In an AI architecture where multiple models coexist, such as GPT-4, GPT-4o, lightweight models, and vertical-domain models, one core question is:

How can the system automatically select the most suitable model without explicitly specifying a model ID?

This article introduces an engineering-friendly approach:

Use an embedding model to calculate user intent, perform semantic matching at the gateway layer, and dynamically route the request to the most suitable upstream model service.

We will use:

  • Embedding model: Azure OpenAI text-embedding-3-small
  • Gateway: Envoy
  • Core mechanism: vector similarity scoring + policy-based routing

Background and Motivation

In real production environments, large model calls usually face several challenges.

1. Cost and Performance Trade-Offs

Model typeStrengthsWeaknesses
Large models, such as GPT-4Strong capabilityHigh cost / high latency
Small models, such as GPT-4o-miniFast / inexpensiveLimited capability
Specialized modelsAccurateWeak generalization

Different requests should go to different models instead of using a one-size-fits-all approach.

2. Limitations of Traditional Approaches

Common strategies include:

  • Manually specifying model_id - not intelligent
  • Rule-based routing with if/else - hard to scale
  • Keyword matching - not semantically robust

3. Ideal Goal

What we want is:

The user only enters a prompt, and the system automatically understands the intent and selects the optimal model.

This is exactly where embeddings can help.

Core Principle: Embedding-Driven Semantic Routing

1. What Embeddings Are

An embedding model, such as text-embedding-3-small, maps text into a high-dimensional vector:

"Write a Python sorting function"
-> [0.012, -0.98, ..., 0.33]  (1536-dimensional vector)

Texts with similar meanings will have vectors that are closer to each other.

2. Routing Idea

The core logic can be summarized as:

User request -> Embedding -> Compare with intent prototype library -> Select best match -> Route to model

3. Intent Prototypes

We define a group of intent vectors in advance:

Intent typeExample textsTarget model
Code generation“write code”, “generate function”GPT-4
Simple Q&A“what is”, “define”GPT-4o-mini
Translation“translate this”Lightweight model
Long-form writing“write blog”GPT-4

These texts are embedded and cached ahead of time.

4. Similarity Calculation

A commonly used method is:

  • Cosine Similarity, recommended
similarity = cosine(query_embedding, intent_embedding)

Then choose the highest score:

argmax(similarity_i)

System Architecture

The overall architecture looks like this:

                +--------------+
                |   Client     |
                +------+-------+
                       |
               +-------v--------+
               |  Envoy Gateway |
               | Smart Routing  |
               +-------+--------+
                       |
        +--------------+--------------+
        |              |              |
 +------v-----+ +------v-----+ +------v-----+
 | GPT-4      | | GPT-4o-mini| | Specialized|
 |            | |            | | Model      |
 +------------+ +------------+ +------------+

Implementation Breakdown

1. Request Flow

Step 1: Get the User Request

{
  "input": "Help me write an example Go HTTP server"
}

Step 2: Call the Embedding API

Using Azure OpenAI:

POST https://{endpoint}/openai/deployments/text-embedding-3-small/embeddings?api-version=2024-02-15-preview

Request body:

{
  "input": "Help me write an example Go HTTP server"
}

Step 3: Calculate Similarity

Pseudocode:

def route(query_embedding):
    best_score = -1
    best_model = None

    for intent in intents:
        score = cosine_similarity(query_embedding, intent.embedding)
        if score > best_score:
            best_score = score
            best_model = intent.model

    return best_model

Step 4: Dynamic Routing in Envoy

Write the result into a header:

x-llm-route: gpt-4

Envoy routes based on the header:

routes:
- match:
    prefix: "/v1/chat"
    headers:
      - name: x-llm-route
        exact_match: gpt-4
  route:
    cluster: gpt4_service

Envoy Gateway Implementation Options

Envoy does not directly support vector computation, so it needs to be extended.

Envoy calls an external service through ext_proc:

Envoy -> ext_proc service -> Return routing decision -> Envoy forwards

The ext_proc service is responsible for:

  • Calling the embedding model
  • Calculating similarity
  • Returning the header

Option 2: WASM Plugin

Embed the logic inside Envoy:

  • Advantage: low latency
  • Disadvantage: more complex development, usually Rust or C++

Option 3: Sidecar Router, Simple

Client -> Router Service -> Envoy -> Model

The router service handles all smart routing logic.

Optimization Strategies

1. Vector Cache

Avoid repeated embedding calls:

cache[input_hash] = embedding

2. Hierarchical Intents

Avoid overly coarse single-layer classification:

Level 1: code / Q&A / writing
Level 2: language / length / difficulty

3. Threshold Control

if best_score < threshold:
    fallback_model = "gpt-4"

This helps prevent incorrect routing decisions.

4. Multi-Model Fusion, or Ensemble Routing

This can be extended into:

  • Top-K model candidates
  • A/B testing
  • Weighted routing

5. Cost-Aware Routing

Combine routing with:

  • Token budget
  • User tier
  • SLA

Then enhance the routing policy:

if user == "free":
    prefer_small_model()

Advantages

Compared with Rule-Based Routing

DimensionEmbedding routing
Semantic understandingStrong
ScalabilityHigh
Maintenance costLow

Compared with Manually Specifying Models

  • Better user experience, because users do not need to care about models
  • Automatically optimized cost
  • Support for dynamic policies

Practical Recommendations

1. Initial Stage

  • Start with 5 to 10 intent prototypes
  • Use manual labeling and tuning

2. Data-Driven Optimization

Record:

  • User requests
  • Routing results
  • User feedback

Continuously optimize the intent embeddings.

3. Combine with an LLM Router

In the future, this can evolve into:

Hybrid decision-making with a small model plus embeddings, also known as a Hybrid Router.

Summary

The essence of this approach is:

Replace rules with embeddings, and use vector similarity to drive model selection.

Key benefits:

  • Lower cost, by intelligently choosing lightweight models
  • Better quality, by sending complex tasks to stronger models
  • Architectural decoupling, by separating routing logic from models

In the multi-model era, this kind of semantic-driven gateway will become a standard architecture component.