How we think about Data Pipelines is changing | by Hugo Lu | Nov, 2023

by Ali Kazal on Unsplash

The goal is to reliably and efficiently release data into production

Hugo Lu

Data Pipelines are series of tasks organised in a directed acyclic or “DAG”. Historically, these are run on orchestration packages like Airflow or Prefect, and require infrastructure managed by data engineers or platform teams. These data pipelines typically run on a schedule, and allow data engineers to update data in locations such as or data lakes.

This is now changing. There is a great shift in mentality happening. As the data engineering industry matures, mindsets are shifting from a “move data to serve the business at all costs” to “ and ” / “software engineering” mindset.

Continuous Data Integration and Delivery

I’ve written before about how Data Teams ship data whereas software teams ship code.

This is a process called “Continuous Data Integration and Delivery”, and is the process of reliably and efficiently releasing data into production. There are subtle differences with the definition of “CI/CD” as used in Software Engineer, illustrated below.

Image the author’s

In software engineering, Continuous Delivery is non-trivial because of the importance of having a near exact replica for code to operate in a staging environment.

Within Data Engineering, this is not necessary because the good we ship is data. If there is a table of data, and we know that as long as a few conditions are satisfied, the data is of a sufficient to be used, then that is sufficient for it to be “released” into production, so to speak.

The process of releasing data into production — the analog for Continuous Delivery — is very simple, as it simply relates to copying or cloning a dataset.

Furthermore, a key pillar of data engineering is reacting to new data as it arrives or checking to see if new data exists. There is no analog for this in software engineering — software do not need to…

Source link