Performance Benchmarks
Comprehensive benchmarks comparing semantic routing against LLM-based routing across 56 test cases
Local embeddings deliver ~5ms latency vs ~450ms for LLM routing
Best overall accuracy across all query types, beating LLM routing at 81%
OpenAI embeddings cost $0.002/1K vs $0.048/1K for LLM routing
Results Summary
| Method | Avg Latency | P95 Latency | Accuracy | Cost/1000 req |
|---|---|---|---|---|
Local (Transformers.js)Fastest | ~5ms | ~10ms | 77% | Free |
OpenAI EmbeddingsMost Accurate | ~320ms | ~565ms | 89% | $0.002/1K |
LLM Routing | ~450ms | ~620ms | 81% | $0.048/1K |
Latency Comparison
- Average Latency
- P95 Latency
Lower is better. Local embeddings are 70x faster than OpenAI and 90x faster than LLM routing.
Accuracy by Query Complexity
- LLM Routing
- Local (Transformers.js)
- OpenAI Embeddings
OpenAI embeddings perform best on medium complexity queries, while all methods excel on straightforward queries.
Cost Comparison
Local embeddings are completely free, while OpenAI is 23x cheaper than LLM routing.
Key Findings
💻Local Embeddings
- ✓~70x faster than OpenAI embeddings
- ✓~90x faster than LLM routing
- ✓Zero API costs - completely free
- ✓100% accuracy on straightforward queries
- ⚠Lower accuracy on nuanced queries (77% overall)
☁️OpenAI Embeddings
- ✓Best overall accuracy (89%)
- ✓~30% faster than LLM routing
- ✓23x cheaper than LLM routing
- ✓Excellent on medium complexity queries (96%)
- ℹRequires API key and internet connection
Choosing the Right Approach
| Use Case | Recommended | Why |
|---|---|---|
| High-volume, cost-sensitive | Local embeddings | Free, <5ms latency |
| Production with clear intents | Local embeddings | Speed + accuracy on typical queries |
| Complex/ambiguous routing | OpenAI embeddings | Best accuracy |
| Maximum accuracy on edge cases | LLM routing | Reasoning capability |
| Offline/edge deployment | Local embeddings | No network required |
Hybrid Approach
For optimal results, consider a hybrid strategy that uses local embeddings for fast first-pass routing, then falls back to OpenAI for uncertain cases:
async function smartRoute(query: string) {
// Fast first-pass with local embeddings
const localResult = await localRouter.route(query);
// If confidence is high, use it
if (localResult.score > 0.85) {
return localResult;
}
// Fall back to OpenAI for uncertain cases
return await openaiRouter.route(query);
}
// This gives you <5ms latency for ~80% of queries
// while maintaining high accuracyBest of Both Worlds
This approach gives you <5ms latency for ~80% of queries while maintaining high accuracy by using OpenAI for edge cases. Perfect balance of speed, cost, and accuracy.
Benchmark Methodology
Dataset
56 test cases across 6 customer service routes (billing, technical support, account, shipping, returns, general inquiry)
Test Categories
- Easy: Exact matches or very close paraphrases
- Medium: Different wording and sentence structures
- Hard: Ambiguous, multi-intent, or out-of-domain queries
Models Used
- Local: Xenova/all-MiniLM-L6-v2
- OpenAI: text-embedding-3-small
- LLM: gpt-4o-mini
Environment
Benchmarks performed on Node.js 18+. Results may vary based on hardware, network conditions, and query distribution.