Solutions
Solution
Industry Spotlight
.avif)
Watch our latest video case study!
Check out how Colibri's partnership with Nomo Fintech has transformed their approach to data
Learn more
Success stories
Insights
The Colibri Digital team recently joined AWS for a live OnAir demo to show how we’ve built a solution that answers exactly that, using real-time model evaluation, cost-performance metrics, and automated orchestration to help teams stop guessing and start optimising.
Here’s what we shared and why it matters.
As Gelareh Taghizadeh, Colibri’s Head of AI and Data Science puts it:
“At the heart of every GenAI build, there’s one critical task: evaluation.”
Choosing the best LLM or architecture for a use case isn’t as simple as testing a few prompts. Foundational models differ widely in cost, latency, accuracy, and adaptability and the choice becomes even more complex when multiple tools or agents are involved.
Colibri’s engineering team used to spend 5–7 days manually testing and benchmarking combinations of:
It was slow, repetitive, and not scalable, especially in a world where new models drop monthly.
So we automated it.
Colibri built a GenAI evaluation framework — internally referred to as a "switchboard" — to compare multiple LLMs, approaches, and architectures at once.
What it does:
In the demo, we tested a simple Gen AI use case, creating a travel itinerary based on current weather and local events. One version used a basic single LLM; the other used a multi-agent architecture with tools for live weather and web search.
Instead of choosing based on gut feel or brand preference, the framework gave us:
That means product teams, engineers and business stakeholders can make data-informed decisions about:
As Sergio Ghisler, a Colibri data scientist, shared during the demo:
“Before, it took 5–7 days to do a full model evaluation. Now it takes less than 24 hours and gives us clearer, faster answers.”
If you're building Gen AI-powered tools — internally or for customers, this framework changes the game:
✅ No more guesswork. Choose models based on real performance, not hype.
✅ Faster time to deploy. Evaluate new models in <24 hours.
✅ Business-value lens. Balance accuracy, latency, and cost and prove ROI.
And thanks to AWS, it’s all deployable across SageMaker and Bedrock with access to top-tier foundation models like Anthropic Claude, Meta Llama, and Amazon Titan.
The Gen AI landscape moves fast. What’s “best” today may be obsolete in six weeks. That’s why Colibri built this as a flexible, evolving layer that:
“There’s no one-size-fits-all model. But there is a better way to choose.”
— Gelareh Taghizadeh, Head of AI & Data Science, Colibri