Guides: Love Data Week: Clean Data

Find Data

Love Data Week Day Two: Clean Data

Messy Data? No Worries

Once you've found your dataset, you'll need to make some decisions about data quality. Data cleaning involves modifying data that is incorrect, incomplete, irrelevant, duplicated, or improperly formatted. It's one step in a long process towards preparing your data, but it often takes the majority of your time. Cleaning your data can help you understand the content you have and the questions you can ask going forward. (If you’re interested in reading more about the practice of cleaning data as a methodology, check out Against Cleaning by Katie Rawson and Trevor Muñoz.)

You Clean Up Nicely!

Learn how to clean a newfound dataset using a new tool or language:

Excel: An Introduction to Data Cleaning with Excel (NUS Libraries)
OpenRefine: Data Wrangling with OpenRefine (Penn Libraries)
Python: Python for Data Science – Part 1: Cleaning Data (Cornell University Center for Advanced Computing)
R: R Basics: Prepare Data for Modeling and Analysis (Penn Libraries)

Shiny, Happy Data 101

Implement tidy data. Think about structure of your dataset. Does each column represent one attribute of your data? Does each row represent a particular observation and all its attributes?
Remove duplicate or irrelevant observations. Sometimes we collect more data than we need to analyze. Are there duplicate rows or columns in your dataset? Are there obvious outliers or data present that aren't helpful for your analysis?
Fix structural errors. Spelling errors or multiple spellings are common in text fields. Some words or abbreviations may be spelled in multiple ways - do you want uniformity or is that difference important to understanding your data?
Review missing data. How do you want to handle empty or null values? Do these values make sense in the context of your dataset?

Love Data Week

Find Data

Love Data Week Day Two: Clean Data

Messy Data? No Worries

You Clean Up Nicely!

Shiny, Happy Data 101

Where to start?

Want to learn some data skills? Check out our workshops!

No idea where to begin?