In our previous post, we discussed several ways to enhance the performance of your AI applications. One of those approaches is the evaluation and choice of a GenAI model. And with the proliferation of modern language models, you could be forgiven for feeling somewhat lost. If you’re creating a Gen AI-forward project using LLMs, you have a huge number of options, and choosing the right one can be daunting. It’s a landscape teeming with commercial offerings, cloud endpoints with ever-expanding catalogs, and a plethora of free "BYOHardware" solutions, many allowing fine-tuning, function-calling, and every other new development. Perhaps most difficult of all is navigating “performance data”, with each model claiming to be tested on the ultimate benchmark set, and none of them fully expressing real-world performance. How do you begin to sift through it all?
While everyone needs a good mix of cost and performance, how those performance metrics correspond to real use cases can be open to interpretation. For example, the Mistral/Mixtral model set uses several benchmarks to measure its performance, which look very useful for comparing models. Likewise, OpenHermes 2.5 (a derivative from the Mistral family), also publishes performance metrics. But comparing these directly is tough because this set of metrics is just one “family” out of an even larger set of sets. In particular, the set used for Mixtral is called GPT4All, which seems fairly comprehensive by itself. But TruthfulQA, AGIEval, BigBench, and others are also “performance benchmarks”, and each comprises additional datasets and measurements which will cast a model’s performance in a slightly different light.
And lest you think testing all of these things would be sufficient, consider the Starling-LM model. This model uses the “MT-Bench” performance evaluation set based on a “chatbot arena” - where users can input prompts of their choice into two anonymized LLMs, and then pick a winner, developing a kind of “ELO Ranking” for LLMs. The Starling team used GPT-4 as a judge for MT-Bench and considered a broad set of queries. While creative and impressive, this is yet another set of benchmarks we would have to consider.
Researchers deeply value complete, exhaustive, and carefully constructed tests that speak to the differences between one model and another, which is shown by these published results. But instead of focusing on globally-defined model performance, the practical developer tends to have a specific use case, drawn from a well-defined set of tasks:
This isn’t strictly every possible task, but the point is that your actual needs are fairly well-defined: First, you need the simplest high-accuracy AI structure that always does your specific task, and second, the cheapest and fastest resource and model available that still meets that “every-time” requirement. Crucially, you need these things in that order - low-accuracy and failed prompt results are a non-starter no matter how inexpensive or quick they are. Simultaneously trying to engineer prompts or other solutions while also exploring the model space can lead to a lot of lost time and effort. But using the built-up confidence you probably already have with popular commodity models, you can use a prompt-first approach that avoids a lot of this effort:
This isn’t the only implementation doctrine for Gen AI. Some prefer to begin with the most powerful models available, and then try to reduce time and cost once that approach is functional. But this can require restructuring some of the pipeline again - offloading prompt functionality to smaller models is generally harder than developing on more lightweight models in the first place, and then using additional power where needed. This “intermediate model prototyping” approach, followed by optimization based on a known benchmark, is a useful way to avoid getting lost in the “model performance” woods. And if you have a set of business needs like the ones discussed here, hopefully, it can save you some time as well.
To learn more about MVL, read our manifesto and let’s build a company together.
We are with our founders from day one, for the long run.