Menu
Discuss a project
Book a call
Back
Discuss a project
Book a call
Back
Back
Articles
8 MIN READ

Beyond the Hype: How Colibri Built a Smarter Way to Evaluate Generative AI Models

As organisations explore how to operationalise generative AI, many are asking the same question: How do we choose the right model - and know it’s delivering value?

The Colibri Digital team recently joined AWS for a live OnAir demo to show how we’ve built a solution that answers exactly that, using real-time model evaluation, cost-performance metrics, and automated orchestration to help teams stop guessing and start optimising.

Here’s what we shared and why it matters.

🎯 The Problem: Model Choice Isn’t Obvious or Easy

As Gelareh Taghizadeh, Colibri’s Head of AI and Data Science puts it:

“At the heart of every GenAI build, there’s one critical task: evaluation.”

Choosing the best LLM or architecture for a use case isn’t as simple as testing a few prompts. Foundational models differ widely in cost, latency, accuracy, and adaptability and the choice becomes even more complex when multiple tools or agents are involved.

Colibri’s engineering team used to spend 5–7 days manually testing and benchmarking combinations of:

  • Embedding + generative models

  • Latency and token costs

  • Output quality vs. use case needs



It was slow, repetitive, and not scalable, especially in a world where new models drop monthly.

Watch the full demo below:

🛠 The Solution: Colibri’s Gen AI Evaluation Framework

So we automated it.

Colibri built a GenAI evaluation framework — internally referred to as a "switchboard" — to compare multiple LLMs, approaches, and architectures at once.

What it does:

  • Runs structured side-by-side evaluations of multiple models

  • Measures output quality across five criteria: correctness, relevance, readability, coherence, helpfulness

  • Logs latency and cost metrics for every interaction

  • Supports single-call LLMs vs multi-agent architectures

  • Deploys across AWS infrastructure with support for SageMaker, Bedrock, and third-party APIs



In the demo, we tested a simple Gen AI use case, creating a travel itinerary based on current weather and local events. One version used a basic single LLM; the other used a multi-agent architecture with tools for live weather and web search.

Colibri team at AWS OnAir

📊 The Result: Real Evaluation, Real Numbers

Instead of choosing based on gut feel or brand preference, the framework gave us:

  • A quality score breakdown for both approaches

  • Latency comparison (multi-agent = slower, but more relevant)

  • Cost estimates for scale (e.g. price per 100K daily requests)



That means product teams, engineers and business stakeholders can make data-informed decisions about:

  • When multi-agent is worth the complexity

  • Which model delivers “good enough” accuracy

  • Where cost/performance trade-offs land



As Sergio Ghisler, a Colibri data scientist, shared during the demo:

“Before, it took 5–7 days to do a full model evaluation. Now it takes less than 24 hours and gives us clearer, faster answers.”

🔄 Why This Matters for Your Business

If you're building Gen AI-powered tools — internally or for customers, this framework changes the game:

No more guesswork. Choose models based on real performance, not hype.


Faster time to deploy. Evaluate new models in <24 hours.


Business-value lens. Balance accuracy, latency, and cost and prove ROI.

And thanks to AWS, it’s all deployable across SageMaker and Bedrock with access to top-tier foundation models like Anthropic Claude, Meta Llama, and Amazon Titan.

🧠 A Final Thought: Build for Change

The Gen AI landscape moves fast. What’s “best” today may be obsolete in six weeks. That’s why Colibri built this as a flexible, evolving layer that:

  • Integrates new models with ease

  • Supports ongoing model evaluation ("GenAI Ops")

  • Empowers clients to adapt fast, without rebuilding from scratch

“There’s no one-size-fits-all model. But there is a better way to choose.”
— Gelareh Taghizadeh, Head of AI & Data Science, Colibri

Speak to an expert