Performance Benchmarks

Comprehensive benchmarks comparing semantic routing against LLM-based routing across 56 test cases

⚡

70-90x

Faster than LLM routing

Local embeddings deliver ~5ms latency vs ~450ms for LLM routing

🎯

89%

Accuracy with OpenAI

Best overall accuracy across all query types, beating LLM routing at 81%

💰

23x

Cheaper than LLM

OpenAI embeddings cost $0.002/1K vs $0.048/1K for LLM routing

Results Summary

Method	Avg Latency	P95 Latency	Accuracy	Cost/1000 req
Local (Transformers.js)Fastest	~5ms	~10ms	77%	Free
OpenAI EmbeddingsMost Accurate	~320ms	~565ms	89%	$0.002/1K
LLM Routing	~450ms	~620ms	81%	$0.048/1K

Latency Comparison

Average Latency
P95 Latency

Lower is better. Local embeddings are 70x faster than OpenAI and 90x faster than LLM routing.

Accuracy by Query Complexity

LLM Routing
Local (Transformers.js)
OpenAI Embeddings

OpenAI embeddings perform best on medium complexity queries, while all methods excel on straightforward queries.

Cost Comparison

Local embeddings are completely free, while OpenAI is 23x cheaper than LLM routing.

Key Findings

💻Local Embeddings

✓~70x faster than OpenAI embeddings
✓~90x faster than LLM routing
✓Zero API costs - completely free
✓100% accuracy on straightforward queries
⚠Lower accuracy on nuanced queries (77% overall)

☁️OpenAI Embeddings

✓Best overall accuracy (89%)
✓~30% faster than LLM routing
✓23x cheaper than LLM routing
✓Excellent on medium complexity queries (96%)
ℹRequires API key and internet connection

Choosing the Right Approach

Use Case	Recommended	Why
High-volume, cost-sensitive	Local embeddings	Free, <5ms latency
Production with clear intents	Local embeddings	Speed + accuracy on typical queries
Complex/ambiguous routing	OpenAI embeddings	Best accuracy
Maximum accuracy on edge cases	LLM routing	Reasoning capability
Offline/edge deployment	Local embeddings	No network required

Hybrid Approach

For optimal results, consider a hybrid strategy that uses local embeddings for fast first-pass routing, then falls back to OpenAI for uncertain cases:

hybrid.ts

async function smartRoute(query: string) {
  // Fast first-pass with local embeddings
  const localResult = await localRouter.route(query);

  // If confidence is high, use it
  if (localResult.score > 0.85) {
    return localResult;
  }

  // Fall back to OpenAI for uncertain cases
  return await openaiRouter.route(query);
}

// This gives you <5ms latency for ~80% of queries
// while maintaining high accuracy

✨

Best of Both Worlds

This approach gives you <5ms latency for ~80% of queries while maintaining high accuracy by using OpenAI for edge cases. Perfect balance of speed, cost, and accuracy.

Benchmark Methodology

Dataset

56 test cases across 6 customer service routes (billing, technical support, account, shipping, returns, general inquiry)

Test Categories

Easy: Exact matches or very close paraphrases
Medium: Different wording and sentence structures
Hard: Ambiguous, multi-intent, or out-of-domain queries

Models Used

Local: Xenova/all-MiniLM-L6-v2
OpenAI: text-embedding-3-small
LLM: gpt-4o-mini

Environment

Benchmarks performed on Node.js 18+. Results may vary based on hardware, network conditions, and query distribution.