Skip to Main Content
Go to Penn Libraries homepage   Go to Guides homepage
Banner: RDDS; Research Data & Digital Scholarship displayed between 3D mesh surfaces

Love Data Week

An annual celebration of all things data

Find Data

Love Data Week Day Two: Clean Data

 

 

 


 

Messy Data? No Worries

Once you've found your dataset, you'll need to make some decisions about data quality. Data cleaning involves modifying data that is incorrect, incomplete, irrelevant, duplicated, or improperly formatted. It's one step in a long process towards preparing your data, but it often takes the majority of your time. Cleaning your data can help you understand the content you have and the questions you can ask going forward. (If you’re interested in reading more about the practice of cleaning data as a methodology, check out Against Cleaning by Katie Rawson and Trevor Muñoz.)  


 

You Clean Up Nicely!

Learn how to clean a newfound dataset using a new tool or language:


 

Shiny, Happy Data 101


Implement tidy data. Think about structure of your dataset. Does each column represent one attribute of your data? Does each row represent a particular observation and all its attributes? 
Remove duplicate or irrelevant observations. Sometimes we collect more data than we need to analyze. Are there duplicate rows or columns in your dataset? Are there obvious outliers or data present that aren't helpful for your analysis? 
Fix structural errors. Spelling errors or multiple spellings are common in text fields.  Some words or abbreviations may be spelled in multiple ways - do you want uniformity or is that difference important to understanding your data?
Review missing data. How do you want to handle empty or null values? Do these values make sense in the context of your dataset? 

Penn Libraries Home Search the Catalog
(215) 898-7555