Skip to main content
Click logo to go to Libraries homepage

Data Management Best Practices: Documentation

You may have heard people tell you to create metadata to go along with your data. You may have blocked this out. The reason for this recommendation is so that your data will be understandable and usable in the future - either for you and your lab members or for a wider audience should you share your data outside the lab. There are many ways to document your data beyond using metadata, though, and more information on all of them are here. If you have questions please ask!

Keep a file with information about your project in the same folder as your other files. A rule of thumb is to write as much information as necessary to understand your data.

Suggested Information to Note (care of the DMPTool)

Project Level

  • Title: Name of the dataset or research project that produced it
  • Creator: Names and addresses of the organizations or people who created the data; preferred format for personal names is surname first (e.g., Smith, Jane).
  • Identifier: Unique number used to identify the data, even if it is just an internal project reference number
  • Date: Key dates associated with the data, including: project start and end date; release date; time period covered by the data; and other dates associated with the data lifespan, such as maintenance cycle, update schedule; preferred format is yyyy-mm-dd, or yyyy.mm.dd-yyyy.mm.dd for a range
  • Method: How the data were generated, listing equipment and software used (including model and version numbers), formulae, algorithms, experimental protocols, and other things one might include in a lab notebook
  • Processing: How the data have been altered or processed (e.g., normalized)
  • Source: Citations to data derived from other sources, including details of where the source data is held and how it was accessed
  • Funder: Organizations or agencies who funded the research

File Level

  • Subject: Keywords or phrases describing the subject or content of the data
  • Place: All applicable physical locations
  • Language: All languages used in the dataset
  • Variable list: All variables in the data files, where applicable
  • Code list: Explanation of codes or abbreviations used in either the file names or the variables in the data files (e.g. '999 indicates a missing value in the data')

Technical Description

  • File inventory: All files associated with the project, including extensions (e.g. 'NWPalaceTR.WRL', 'stone.mov')
  • File Formats: Formats of the data, e.g., FITS, SPSS, HTML, JPEG, etc.
  • File structure: Organization of the data file(s) and layout of the variables, where applicable
  • Version: Unique date/time stamp and identifier for each version
  • Checksum: A digest value computed for each file that can be used to detect changes; if a recomputed digest differs from the stored digest, the file must have changed
  • Necessary software: Names of any special-purpose software packages required to create, view, analyze, or otherwise use the data

Access

  • Rights: Any known intellectual property rights, statutory rights, licenses, or restrictions on use of the data
  • Access information: Where and how your data can be accessed by other researchers

There are long lists of metadata schema for many disciplines and each of those schemes have lots and lots of documentation that someone expects you're going to read. My guess is you don't want to read that documentation. Please ask for help if you need to write some serious metadata and are overhelmed.

A Few Tools for Creating Metadata
Discipline Standard Tools
General Research Data Dublin Core Dublin Core Generator
Social Sciences Data Documentation Initiative (DDI)

- see all DDI tools
- Nesstar Publisher: standalone tool to create DDI metadata (Windows only)
- Colectica: Excel add-on that creates DDI metadata (Windows only)

Ecology, Geosciences, & Biology Ecological Metadata Language (EML) - note DarwinCore is also common but no tool seems to exist to create metadata in this standard

- see all EML tools
- Morpho: standalone tool (all platforms), creates metadata, edit data, upload both to the Knowledge Network for Biocomplexity

Geographic ISO 19115-2014 - Federal Geographic Data Committee (FGDC) tools for metadata creation

This table inspired by Oregon State University Libraries' guide on Metadata/Documentation

ReadMe files should be used to describe your project and your data. When depositing data into repositories, you'll likely include a ReadMe file that just explains the files you've deposited. When you're keeping ReadMe files for your own records, it's good to have a top-folder ReadMe that explains all the subfolders and files that are part of the project as well as having them for lower-level files.

These two resources give great overviews of ReadMe files and guidance on how to create them:

Here's some guidance from two popular repositories that recommend and use ReadMe file:

Codebooks are documents that explain the variables in your dataset. ICPSR suggests that these documents should note:

  • Variable name: The name or number assigned to each variable in the data collection. Some researchers prefer to use mnemonic abbreviations (e.g., EMPLOY1), while others use alphanumeric patterns (e.g., VAR001). For survey data, try to name variables after the question numbers - e.g., Q1, Q2b, etc. [In above example, H40-SF12-2]
  • Variable label: A brief description to identify the variable for the user. Where possible, use the exact question or research wording. ["SF12 - ASSESSMENT OF R'S GENERAL HEALTH"]
  • Question text: Where applicable, the exact wording from survey questions. ["In general, would you say your health is . . ."]
  • Values: The actual coded values in the data for this variable. [1, 2, 3, 4, 5]
  • Value labels: The textual descriptions of the codes. [Excellent, Very Good, Good, Fair, Poor]
  • Summary statistics: Where appropriate and depending on the type of variable, provide unweighted summary statistics for quick reference. For categorical variables, for instance, frequency counts showing the number of times a value occurs and the percentage of cases that value represents for the variable are appropriate. For continuous variables, minimum, maximum, and median values are relevant.
  • Missing data: Where applicable, the values and labels of missing data. Missing data can bias an analysis and is important to convey in study documentation. Remember to describe all missing codes, including "system missing" and blank. [e.g., Refusal (-1)]
  • Universe skip patterns: Where applicable, information about the population to which the variable refers, as well as the preceding and following variables. [e.g., Default Next Question: H00035.00]
  • Notes: Additional notes, remarks, or comments that contextualize the information conveyed in the variable or relay special instructions. For measures or questions from copyrighted instruments, the notes field is the appropriate location to cite the source.

See Also:


Data Dictionaries are very similar to (and arguably the same as) codebooks. DataQ has a great entry on data dictionaries written by Yasmeen Shorish:

This video from Kristin Briney is one of the most descriptive, yet concise, resources explaining data dictionaries: https://www.youtube.com/watch?v=Fe3i9qyqPjo . The video details what should go into the dictionary (variable or field names, units, relationships to other variables, data types, what people need to make sense of a researcher's work) and explains the reasons why a researcher might want one. There are also examples given of what a data dictionary looks like. There is also a blog post on the topic from the same author, in case you prefer text to video: http://dataabinitio.com/?p=454 

For those looking at data dictionaries from a relational database perspective, this video tutorial provides stepwise instruction: https://www.youtube.com/embed/QRMUReSENjU

A robust and technical definition of a data dictionary from a LIS encyclopedia may be useful for some researchers and librarians: "Data Dictionary (Metadata Dictionary): A subsystem of a database that records the definitions (semantics) for all the metadata elements used in a database. A data dictionary may also include detailed documentation about the rellationships among metadata elements, as well as syntax and schema application rules. The term data dictionary comes from the relational database community and may be viewed as a type of metadata specification" Drake, M. A. (2003). Metadata in the World Wide Web in Encyclopedia of library and information science. 2nd ed. / New York: Marcel Dekker.

See this entry at DataQ 

 

Loading