Let Models Choose Models: Embedding-Driven Smart Routing for LLMs

In an AI architecture where multiple models coexist, such as GPT-4, GPT-4o, lightweight models, and vertical-domain models, one core question is:

How can the system automatically select the most suitable model without explicitly specifying a model ID?

This article introduces an engineering-friendly approach:

Use an embedding model to calculate user intent, perform semantic matching at the gateway layer, and dynamically route the request to the most suitable upstream model service.

We will use:

Embedding model: Azure OpenAI text-embedding-3-small
Gateway: Envoy
Core mechanism: vector similarity scoring + policy-based routing

Background and Motivation

In real production environments, large model calls usually face several challenges.

1. Cost and Performance Trade-Offs

Model type	Strengths	Weaknesses
Large models, such as GPT-4	Strong capability	High cost / high latency
Small models, such as GPT-4o-mini	Fast / inexpensive	Limited capability
Specialized models	Accurate	Weak generalization

Different requests should go to different models instead of using a one-size-fits-all approach.

2. Limitations of Traditional Approaches

Common strategies include:

Manually specifying model_id - not intelligent
Rule-based routing with if/else - hard to scale
Keyword matching - not semantically robust

3. Ideal Goal

What we want is:

The user only enters a prompt, and the system automatically understands the intent and selects the optimal model.

This is exactly where embeddings can help.

Core Principle: Embedding-Driven Semantic Routing

1. What Embeddings Are

An embedding model, such as text-embedding-3-small, maps text into a high-dimensional vector:

"Write a Python sorting function"
-> [0.012, -0.98, ..., 0.33]  (1536-dimensional vector)

Texts with similar meanings will have vectors that are closer to each other.

2. Routing Idea

The core logic can be summarized as:

User request -> Embedding -> Compare with intent prototype library -> Select best match -> Route to model

3. Intent Prototypes

We define a group of intent vectors in advance:

Intent type	Example texts	Target model
Code generation	“write code”, “generate function”	GPT-4
Simple Q&A	“what is”, “define”	GPT-4o-mini
Translation	“translate this”	Lightweight model
Long-form writing	“write blog”	GPT-4

These texts are embedded and cached ahead of time.

4. Similarity Calculation

A commonly used method is:

Cosine Similarity, recommended

similarity = cosine(query_embedding, intent_embedding)

Then choose the highest score:

argmax(similarity_i)

System Architecture

The overall architecture looks like this:

                +--------------+
                |   Client     |
                +------+-------+
                       |
               +-------v--------+
               |  Envoy Gateway |
               | Smart Routing  |
               +-------+--------+
                       |
        +--------------+--------------+
        |              |              |
 +------v-----+ +------v-----+ +------v-----+
 | GPT-4      | | GPT-4o-mini| | Specialized|
 |            | |            | | Model      |
 +------------+ +------------+ +------------+

Implementation Breakdown

1. Request Flow

Step 1: Get the User Request

{
  "input": "Help me write an example Go HTTP server"
}

Step 2: Call the Embedding API

Using Azure OpenAI:

POST https://{endpoint}/openai/deployments/text-embedding-3-small/embeddings?api-version=2024-02-15-preview

Request body:

{
  "input": "Help me write an example Go HTTP server"
}

Step 3: Calculate Similarity

Pseudocode:

def route(query_embedding):
    best_score = -1
    best_model = None

    for intent in intents:
        score = cosine_similarity(query_embedding, intent.embedding)
        if score > best_score:
            best_score = score
            best_model = intent.model

    return best_model

Step 4: Dynamic Routing in Envoy

Write the result into a header:

x-llm-route: gpt-4

Envoy routes based on the header:

routes:
- match:
    prefix: "/v1/chat"
    headers:
      - name: x-llm-route
        exact_match: gpt-4
  route:
    cluster: gpt4_service

Envoy Gateway Implementation Options

Envoy does not directly support vector computation, so it needs to be extended.

Option 1: External Processing, Recommended

Envoy calls an external service through ext_proc:

Envoy -> ext_proc service -> Return routing decision -> Envoy forwards

The ext_proc service is responsible for:

Calling the embedding model
Calculating similarity
Returning the header

Option 2: WASM Plugin

Embed the logic inside Envoy:

Advantage: low latency
Disadvantage: more complex development, usually Rust or C++

Option 3: Sidecar Router, Simple

Client -> Router Service -> Envoy -> Model

The router service handles all smart routing logic.

Optimization Strategies

1. Vector Cache

Avoid repeated embedding calls:

cache[input_hash] = embedding

2. Hierarchical Intents

Avoid overly coarse single-layer classification:

Level 1: code / Q&A / writing
Level 2: language / length / difficulty

3. Threshold Control

if best_score < threshold:
    fallback_model = "gpt-4"

This helps prevent incorrect routing decisions.

4. Multi-Model Fusion, or Ensemble Routing

This can be extended into:

Top-K model candidates
A/B testing
Weighted routing

5. Cost-Aware Routing

Combine routing with:

Token budget
User tier
SLA

Then enhance the routing policy:

if user == "free":
    prefer_small_model()

Advantages

Compared with Rule-Based Routing

Dimension	Embedding routing
Semantic understanding	Strong
Scalability	High
Maintenance cost	Low

Compared with Manually Specifying Models

Better user experience, because users do not need to care about models
Automatically optimized cost
Support for dynamic policies

Practical Recommendations

1. Initial Stage

Start with 5 to 10 intent prototypes
Use manual labeling and tuning

2. Data-Driven Optimization

Record:

User requests
Routing results
User feedback

Continuously optimize the intent embeddings.

3. Combine with an LLM Router

In the future, this can evolve into:

Hybrid decision-making with a small model plus embeddings, also known as a Hybrid Router.

Summary

The essence of this approach is:

Replace rules with embeddings, and use vector similarity to drive model selection.

Key benefits:

Lower cost, by intelligently choosing lightweight models
Better quality, by sending complex tasks to stronger models
Architectural decoupling, by separating routing logic from models

In the multi-model era, this kind of semantic-driven gateway will become a standard architecture component.

Background and Motivation#

1. Cost and Performance Trade-Offs#

2. Limitations of Traditional Approaches#

3. Ideal Goal#

Core Principle: Embedding-Driven Semantic Routing#

1. What Embeddings Are#

2. Routing Idea#

3. Intent Prototypes#

4. Similarity Calculation#

System Architecture#

Implementation Breakdown#

1. Request Flow#

Step 1: Get the User Request#

Step 2: Call the Embedding API#

Step 3: Calculate Similarity#

Step 4: Dynamic Routing in Envoy#

Envoy Gateway Implementation Options#

Option 1: External Processing, Recommended#

Option 2: WASM Plugin#

Option 3: Sidecar Router, Simple#

Optimization Strategies#

1. Vector Cache#

2. Hierarchical Intents#

3. Threshold Control#

4. Multi-Model Fusion, or Ensemble Routing#

5. Cost-Aware Routing#

Advantages#

Compared with Rule-Based Routing#

Compared with Manually Specifying Models#

Practical Recommendations#

1. Initial Stage#

2. Data-Driven Optimization#

3. Combine with an LLM Router#

Summary#