A Guide to Real-World Data Collection for Machine Learning | by Leah Berg and Ray McLendon | Sep, 2023


5 Actionable Strategies to Optimize Your Data Collection Process

Leah Berg and Ray McLendon
Towards Data Science
Photo by Henrik Dønnestad on Unsplash

Whether you’re brand new to data science or the Chief Data Scientist at a large organization, you’ve probably played with perfectly crafted data sets to solve toy machine learning problems. Maybe you’ve used K-Means clustering to predict flower species in the Iris data set. Or maybe you’ve tried out a logistic regression model to predict which passengers survived the Titanic voyage.

While these data sets are great for practicing the basics of machine learning, they don’t mirror the real-world data you’ll come across on the job. In reality, your data can have quality issues, might not be perfect for the task at hand, or may not exist yet. This means Data Scientists often need to roll up their sleeves and gather data — a challenge often not covered in today’s data science curriculum.

For new Data Scientists, collecting extensive amounts of data before diving into the problem at hand can feel extremely daunting since this stage lays the foundation for the entire machine learning project. However, with the right strategies, this process can become much more manageable.

Throughout my 10+ years as a Data Scientist, I’ve encountered a wide variety of data collection strategies, and in this article, I’ll share five of my favorite tips to optimize your data collection process and set you on the path to creating a successful machine learning product.

A powerful starting point lies in offering tangible value right from the beginning. Let’s borrow an example from a major player in the automotive industry, Tesla. Their quest for a fully autonomous vehicle is a substantial goal that’s taken years to develop and has required a massive amount of data collection.

So, what did they do while amassing all of this data?

Photo by Milan Csizmadia on Unsplash



Source link

This post originally appeared on TechToday.