Data dictionaries and codebooks are essential documentation of the variables, structure, content, and layout of your datasets. A good dictionary/codebook has enough information about each variable for it to be self explanatory and interpreted properly by someone outside of your research group. The terms are often used interchangeably, but codebooks tend to for survey data and allow the reader to follow the structured format of the survey and possible response value.
Data dictionaries and codebooks should include:
- Variable name: The name or number assigned to each variable in the data collection. These are often complex and mean nothing except to the researcher at the time of data collection. Some researchers prefer to use mnemonic abbreviations (e.g., EMPLOY1), while others use alphanumeric patterns (e.g., VAR001). For survey data, try to name variables after the question numbers - e.g., Q1, Q2b, etc.
- Examples: H40-SF12-2, FLJ36031Y, DOB
- Variable label: A brief description to identify the variable for the user. Where possible, use the exact question or research wording.
- Example: SF12 - ASSESSMENT OF R'S GENERAL HEALTH
- Variable meaning: outline the exact definition of the variable and if possible align it with existing vocabularies to increase interoperability amongst research data.
- Question text: Where applicable, the exact wording from survey questions.
- Example: In general, would you say your health is . . .
- Level of Measurement: the method the value was measured with
- Examples: Nominal, ordinary, scale, ratio, interval, none (such as for qualitative variables)
- Values: The actual coded values in the data for this variable. Non-coded values, such as temperature readings, can just be input as they are and the researcher can skip the value labels (shown in next bullet point).
- Example: Likert scale - 1, 2, 3, 4, 5, temperature reading - 100.4
- Value labels: The textual descriptions of the codes.
- Example: Excellent, Very Good, Good, Fair, Poor
- Summary statistics: Where appropriate and depending on the type of variable, provide unweighted summary statistics for quick reference. For categorical variables, for instance, frequency counts showing the number of times a value occurs and the percentage of cases that value represents for the variable are appropriate. For continuous variables, minimum, maximum, and median values are relevant.
- Missing data: Where applicable, the values and labels of missing data. Missing data can bias an analysis and is important to convey in study documentation. Remember to describe all missing codes and differentiate between the types of missing data, including system missing, data instrument error, and participant skip question.
- Example: Refusal (-1), Missing due to instrument calibration issue (-9)
- Universe skip patterns: Where applicable, information about the population to which the variable refers, as well as the preceding and following variables.
- Example: Default Next Question: H00035.00
- Dates and Times: the dates and times of the data collection. Be sure to indicate time zone. Standardize this by using the ISO 8601 international standard for time and date communication.
- Example: 2007-04-05T14:30-04:00
- Notes: Additional notes, remarks, or comments that contextualize the information conveyed in the variable or relay special instructions. For measures or questions from copyrighted instruments, the notes field is the appropriate location to cite the source.
Developed out of ICPSR, What is a Codebook? and SAMHDA, What is a Codebook?.