Introduction to Data Version Control | by David Farrugia | Aug, 2023


PYTHON | DATA | PROGRAMMING

A step-by-step guide to implementing your own DVC in Python using Hangar

David Farrugia
Towards Data Science
Photo by Florian Olivo on Unsplash

Any production-level system requires some kind of versioning.

A single source of current truth.

Any resources that are continuously updated, especially simultaneously by multiple users, require some kind of an audit trail to keep track of all changes.

In software engineering, the solution to this is Git.

If you have written code in your life, then you are probably familiar with the beauty that is Git.

Git allows us to commit changes, create different branches from a source, and merge back our branches, to the original to name a few.

DVC is purely the same paradigm but for datasets. See, live data systems are continuously ingesting newer data points while different users carry out different experiments on the same datasets.

This leads to multiple versions of the same dataset, which is definitely not a single source of truth.

Additionally, in a machine learning environment, we would also have several versions of the same ‘model’ trained on different versions of the same dataset (for instance, model re-training to include newer data points).

If not properly audited and versioned, this would create a tangled web of datasets and experiments. We definitely do not want that!

DVC is, therefore, a system that involves tracking our datasets by registering changes on a particular dataset. There are multiple DVC solutions both free and paid.

I recently discovered Hangar, a fully open-source Python DVC package. Let’s have a look at what it can do, shall we?

The hangar package is a pure Python implementation and is available through pip.

Its core functionality is also closely developed to git, which greatly helps the learning curve.



Source link

This post originally appeared on TechToday.