Advanced LLM applications chain together dozens of LLM calls to accomplish complex tasks. This workflow not only enables systems to reliably achieve their goals but also gives engineers precise control over their systems—rather than merely tweaking vague prompts.
However, one of the drawbacks of LLM workflows is their high cost and latency. For instance, AI trip planner that I built recently cost about 0.26 dollars (excluding Perplexity API) and took approximately 289 seconds to generate a 2-day trip schedule—requiring 22 LLM calls across 4 different models. (Check this link to see the actual trace of LLM calls)
To reduce cost and latency while improving performance, I manually selected the appropriate model for each LLM call. For simple tasks like summarization, I used small and fast models such as Claude Haiku. For reflection and validation tasks, I employed o3-mini—a reasoning model—to achieve better accuracy. My decisions were based on understanding the difficulty of each task and the specific capabilities of each model. Additionally, I made the final choice between similar-tier models, such as GPT-4o and Claude Sonnet, by carefully examining their responses and comparative performance on each task.
However, I questioned the scalability of this approach. What if I need to manage hundreds of constantly evolving models? What if my workflow becomes extremely lengthy and difficult to examine at each step?
This is where automatic model routing comes in. Instead of human engineers manually setting models, model routing automatically determines which LLM would best respond to a given prompt. In this short write-up, I'll explain seven kinds of model routers. I've grouped these into "Brute search" and "Classifier" categories to make them easier to understand. Let's start with Brute search.
Methods in this category call multiple models in sequence or parallel to get the best results, focusing solely on performance optimization rather than improving latency or reducing cost.
Strictly speaking, these methods don't truly “route” models but rather explore or search through multiple options. Nevertheless, they could be an excellent choice if your primary goal is maximizing workflow performance.
https://arxiv.org/pdf/2310.12963
This method first uses a small model to generate a response, then self reflect on its own answer using the following prompt:
With the result of self reflection, it decides whether to call bigger model, in this case GPT4, or the small model’s response is enough.
It’s important to note that this method is not for optimizing latency and it can cost more if most of your tasks require a strong model.
https://arxiv.org/pdf/2305.05176
Similar to AutoMix, Frugal-GPT chains models in order of intelligence so that the smallest model first tries answering the question, and if the answer is inadequate, it passes the task to the next model in sequence.