Performance Benchmarks

Comprehensive benchmarks comparing semantic routing against LLM-based routing across 56 test cases

70-90x
Faster than LLM routing

Local embeddings deliver ~5ms latency vs ~450ms for LLM routing

🎯
89%
Accuracy with OpenAI

Best overall accuracy across all query types, beating LLM routing at 81%

💰
23x
Cheaper than LLM

OpenAI embeddings cost $0.002/1K vs $0.048/1K for LLM routing

Results Summary

MethodAvg LatencyP95 LatencyAccuracyCost/1000 req
Local (Transformers.js)Fastest
~5ms~10ms77%Free
OpenAI EmbeddingsMost Accurate
~320ms~565ms89%$0.002/1K
LLM Routing
~450ms~620ms81%$0.048/1K

Latency Comparison

  • Average Latency
  • P95 Latency
LocalOpenAILLM0200400600800Latency (ms)

Lower is better. Local embeddings are 70x faster than OpenAI and 90x faster than LLM routing.

Accuracy by Query Complexity

  • LLM Routing
  • Local (Transformers.js)
  • OpenAI Embeddings
EasyMediumHard0255075100Accuracy (%)

OpenAI embeddings perform best on medium complexity queries, while all methods excel on straightforward queries.

Cost Comparison

LocalOpenAILLM00.0150.030.0450.06Cost ($) per 1000 requests

Local embeddings are completely free, while OpenAI is 23x cheaper than LLM routing.

Key Findings

💻Local Embeddings

  • ~70x faster than OpenAI embeddings
  • ~90x faster than LLM routing
  • Zero API costs - completely free
  • 100% accuracy on straightforward queries
  • Lower accuracy on nuanced queries (77% overall)

☁️OpenAI Embeddings

  • Best overall accuracy (89%)
  • ~30% faster than LLM routing
  • 23x cheaper than LLM routing
  • Excellent on medium complexity queries (96%)
  • Requires API key and internet connection

Choosing the Right Approach

Use CaseRecommendedWhy
High-volume, cost-sensitiveLocal embeddingsFree, <5ms latency
Production with clear intentsLocal embeddingsSpeed + accuracy on typical queries
Complex/ambiguous routingOpenAI embeddingsBest accuracy
Maximum accuracy on edge casesLLM routingReasoning capability
Offline/edge deploymentLocal embeddingsNo network required

Hybrid Approach

For optimal results, consider a hybrid strategy that uses local embeddings for fast first-pass routing, then falls back to OpenAI for uncertain cases:

hybrid.ts
async function smartRoute(query: string) {
  // Fast first-pass with local embeddings
  const localResult = await localRouter.route(query);

  // If confidence is high, use it
  if (localResult.score > 0.85) {
    return localResult;
  }

  // Fall back to OpenAI for uncertain cases
  return await openaiRouter.route(query);
}

// This gives you <5ms latency for ~80% of queries
// while maintaining high accuracy

Best of Both Worlds

This approach gives you <5ms latency for ~80% of queries while maintaining high accuracy by using OpenAI for edge cases. Perfect balance of speed, cost, and accuracy.

Benchmark Methodology

Dataset

56 test cases across 6 customer service routes (billing, technical support, account, shipping, returns, general inquiry)

Test Categories

  • Easy: Exact matches or very close paraphrases
  • Medium: Different wording and sentence structures
  • Hard: Ambiguous, multi-intent, or out-of-domain queries

Models Used

  • Local: Xenova/all-MiniLM-L6-v2
  • OpenAI: text-embedding-3-small
  • LLM: gpt-4o-mini

Environment

Benchmarks performed on Node.js 18+. Results may vary based on hardware, network conditions, and query distribution.