Articles

8 MIN READ

Beyond the Hype: How Colibri Built a Smarter Way to Evaluate Generative AI Models

As organisations explore how to operationalise generative AI, many are asking the same question: How do we choose the right model - and know it’s delivering value?

Written by

Lydia Cartwright

Published on

Copy Link

https://www.colibri.com/insights/beyond-the-hype-how-colibri-built-a-smarter-way-to-evaluate-generative-ai-models

The Colibri Digital team recently joined AWS for a live OnAir demo to show how we’ve built a solution that answers exactly that, using real-time model evaluation, cost-performance metrics, and automated orchestration to help teams stop guessing and start optimising.

‍

Here’s what we shared and why it matters.

‍

🎯 The Problem: Model Choice Isn’t Obvious or Easy

‍

As Gelareh Taghizadeh, Colibri’s Head of AI and Data Science puts it:

‍

“At the heart of every GenAI build, there’s one critical task: evaluation.”

‍

Choosing the best LLM or architecture for a use case isn’t as simple as testing a few prompts. Foundational models differ widely in cost, latency, accuracy, and adaptability and the choice becomes even more complex when multiple tools or agents are involved.

‍

Colibri’s engineering team used to spend 5–7 days manually testing and benchmarking combinations of:

‍

Embedding + generative models
Latency and token costs
Output quality vs. use case needs

‍

It was slow, repetitive, and not scalable, especially in a world where new models drop monthly.

‍

Watch the full demo below:

‍

🛠 The Solution: Colibri’s Gen AI Evaluation Framework

‍

So we automated it.

‍

Colibri built a GenAI evaluation framework — internally referred to as a "switchboard" — to compare multiple LLMs, approaches, and architectures at once.

‍

What it does:

‍

Runs structured side-by-side evaluations of multiple models
Measures output quality across five criteria: correctness, relevance, readability, coherence, helpfulness
Logs latency and cost metrics for every interaction
Supports single-call LLMs vs multi-agent architectures
Deploys across AWS infrastructure with support for SageMaker, Bedrock, and third-party APIs

‍

In the demo, we tested a simple Gen AI use case, creating a travel itinerary based on current weather and local events. One version used a basic single LLM; the other used a multi-agent architecture with tools for live weather and web search.

‍

📊 The Result: Real Evaluation, Real Numbers

‍

Instead of choosing based on gut feel or brand preference, the framework gave us:

‍

A quality score breakdown for both approaches
Latency comparison (multi-agent = slower, but more relevant)
Cost estimates for scale (e.g. price per 100K daily requests)

‍

That means product teams, engineers and business stakeholders can make data-informed decisions about:

‍

When multi-agent is worth the complexity
Which model delivers “good enough” accuracy
Where cost/performance trade-offs land

‍

As Sergio Ghisler, a Colibri data scientist, shared during the demo:

‍

“Before, it took 5–7 days to do a full model evaluation. Now it takes less than 24 hours and gives us clearer, faster answers.”

‍

🔄 Why This Matters for Your Business

‍

If you're building Gen AI-powered tools — internally or for customers, this framework changes the game:

‍

✅ No more guesswork. Choose models based on real performance, not hype.

‍
✅ Faster time to deploy. Evaluate new models in <24 hours.

‍
✅ Business-value lens. Balance accuracy, latency, and cost and prove ROI.

‍

And thanks to AWS, it’s all deployable across SageMaker and Bedrock with access to top-tier foundation models like Anthropic Claude, Meta Llama, and Amazon Titan.

‍

🧠 A Final Thought: Build for Change

‍

The Gen AI landscape moves fast. What’s “best” today may be obsolete in six weeks. That’s why Colibri built this as a flexible, evolving layer that:

‍

Integrates new models with ease
Supports ongoing model evaluation ("GenAI Ops")
Empowers clients to adapt fast, without rebuilding from scratch

“There’s no one-size-fits-all model. But there is a better way to choose.”
— Gelareh Taghizadeh, Head of AI & Data Science, Colibri

‍