Data Cleaning: Another Fancy Way to call Ourselves Data Janitors
A major part of my first data project was making sure that the messy data that I got was properly cleaned. I had 5 days of time to do this project, and I spent 4 days of cleaning it. They weren’t kidding when they said data scientists spend 60-80% of their time on cleaning and organizing data. That’s why many blog posts (such as this one) called ourselves data janitors.
Why Data Cleaning is SOOO Time-Consuming
Raw data are messy. A big problem when it comes to cleaning, or fixing, data up for the exploratory phrase is that data that’s gathered over the years are usually collected in different ways. Something as simple as the use of semicolons or tabs for column separators, to zip code issue where zip codes are stored as integers (they are not integers) will required time to clean, convert, and sort for a project.
In addition, it’s very hard to anticipate all the ways things can go wrong while cleaning. Although I spent 4 days of cleaning my data, I still found problem with my newly cleaned data during the exploratory phrase when I started plotting charts. So here I was, cleaning the data again on Day 5.
As painful as it was, however, I completely understand why data cleaning is SOOO important. Because when you understand the material better, you’ll get better insights.
I believe that spending time working with data, and to transform/explore it better should be 100% of what data scientists should be doing. This is the task they sign up for. We are getting paid to understand the numbers in boxes and what they meant. Imagine a high school janitor and the stuff he/she found in the classrooms and garbage cans after school. They probably have a better sense of what the students are like than the teachers or the principle do, right? They just don’t go around and tell their bosses and parents about the kids’ secrets.
Planning Ahead
If data cleaning isn’t done right, it has the potential to wreck everything and possibly for you to restart (geez, think about that life…). Worse, there’s a very real possibility that some issues go undetected and somehow end up in our final analyze phrase, and then we give the wrong suggestions or prediction, which then cause our bosses and shareholders to make the wrong decision. I feel like we’ve seen this drama appears in every other financial Hollywood movies already (*cough the big short cough*)..
Hence, it’s best to plan out the cleaning of data before you start the cleaning process. My suggestions on how best to do that are the following (maybe take it with a grand of salt because I’m only on my second week as a data scientist newbie):
- Open up all of your data and take a look at the data columns, rows, separators..etc.
- Set up your data science problem (aka the problem you’d like to tackle, to solve, to reveal)
- Open up all of you data again, but this time import into python, and get rid of the nulls.
- Double check the nulls.
- Double check the nulls.
- Double check the nulls.
- Convert string to integer
- Make new dataframes and then merge them
- Check for outliers and then the nulls.
- Get your 5th cup of black coffee.
- Exploratory stage.
With a structured, well-planned process in place, you can now go into the exploratory stage with peace of mind. And when you are at your 938485th job interview, you can show off how efficient you are at removing nulls and identify outliers.
No Comments