Articles

8 MIN READ

Escaping the Notebook Trap: Taming 30-Year-Old Legacy Data with Config, Not Code

If you work in data engineering long enough, you eventually run into the “notebook trap.”

Written by

Ben Masih

Published on

May 15, 2026

Copy Link

https://www.colibri.com/insights/taming-30--year-old-legacy-data-with-config-not-code

It usually happens when dealing with messy, legacy data sources. Every time the business needs to connect a new data source to the warehouse, a developer has to build a bespoke pipeline for it. You write the code, test it, deploy it, and maintain it. Eventually, you look up and realise your architecture is just a massive pile of near-identical notebooks that all do roughly the same thing, but in slightly different ways.

‍

We recently faced this exact scenario witha client in the pension fund administration space. We weren’t working with tidy, modern data feeds. We were dealing with two decades-old legacy systemsspanning hundreds, if not thousands, of files. We had proprietary flat files with complex multi value structures and 30-year-old COBOL files relying onpacked decimal encoding.

‍

Traditionally, this would mean building a completely custom pipeline for every single file. That is a slow, expensive,and deeply frustrating way to scale an enterprise data platform.

‍

So, we decided to fundamentally flip the problem: instead of writing code per source, what if you write config? Adding anew file becomes a metadata exercise, not an engineering one.

‍

The Shift from Code to Configuration

‍

To build the backbone of our client'sBronze and Silver ingestion layers, we implemented a custom, internal metadata-driven pipeline.

‍

Built on top of Databricks’ Lakeflow Declarative Pipelines, this framework keeps the core concept beautifully simple: instead of writing separate code for every source, you define your source and target metadata once in a standard configuration file. From there, a single generic pipeline engine handles the rest.

‍

For a project dealing with thousands of legacy files, this distinction matters a lot. A metadata-driven approach means onboarding a new file is largely a configuration task, not a development one.

‍

For data leaders, the commercial impact of this shift is undeniable. It directly impacts how quickly the business can get new data into the hands of analysts, and how much it costs to keep the whole thing running as the data landscape grows.

‍

The Reality Check: Navigating the Open-Source Edge

‍

Is this ready for the enterprise? In short:yes, it is genuinely ready, but you need to go in with your eyes open.

‍

When building custom frameworks on top of experimental or open-source data layers, it often doesn’t come with the formal SLAs of a fully commercialised, out-of-the-box product. If an internal team tries to deploy this alone and something breaks, they may find themselves relying on community forums or GitHub repositories for fixes.

‍

This is exactly why enterprise adoption requires an experienced engineering partner. You need a team that can not only implement the framework but provide the ongoing managed support to keep it robust. We are using this approach in a real production-bound environment with complex legacy sources, and with the right architectural guardrails, the framework holds up brilliantly. If you are already on Databricks, it is worth adopting now.

‍

But the biggest hurdle to getting this working in a messy corporate environment isn’t the software. It’s source system knowledge.

‍

A metadata-driven pipeline handles the mechanics brilliantly, but it cannot tell you what your data actually means. On this project, we're dealing with 30-year-old legacy systems where the routing logic for some files is buried in code that only one person fully understands.If you don't have that domain knowledge captured somewhere, the framework is waiting on you, not the other way around. The tech is the easy part.

‍

The Next Strategic Frontier: Autonomous Onboarding

‍

So, what does the next evolution of this architecture look like?

‍

Our legacy source systems come with a native configuration and metadata file. This file describes the structure of every file at runtime—field names, multi value groupings, the lot.

‍

The next step in this transformation isn't about heavy engineering; it is about pushing automation to its absolute limit.The goal is to build a fully metadata-driven ingestion layer that makes onboarding completely self-service. The framework reads the configuration file, derives the metadata it needs automatically, and you are done. No manual configuration step, no manual intervention.

‍

The underlying framework is already there.The final step is simply partnering with the right experts to fully leverage what your source systems are already telling you.

‍

If your team is spending more time maintaining one-off pipelines than actually delivering value from the data, it is time to change your approach.

‍

Escaping the Notebook Trap: Taming 30-Year-Old Legacy Data with Config, Not Code

The Shift from Code to Configuration

The Reality Check: Navigating the Open-Source Edge

The Next Strategic Frontier: Autonomous Onboarding

Monthly newsletter