Guides: Creating Inclusive Surveys: Collecting Sensitive Data

General information

When you're collecting information about people, or any other type of sensitive information, it's important to consider how secure the data is at every stage of its life cycle - from collection to archiving. Penn has a number of resources available to help ensure your survey participants are protected as you do your research.

Software and Tools

Collection

Recommended:

Qualtrics: available to Penn researchers
REDCap: available to most researchers in the School of Medicine

Google Forms, Survey Monkey, and Microsoft Forms are all user friendly ways to set up quick surveys, but their privacy policies and terms of use allow for the parent company to have access to the data you collect. While this is something they're unlikely to take advantage of, it's safest for your participants to use tools that don't allow this like Qualtrics or REDCap.

Storage

Storage depends on how sensitive the data you've collected is. Consult with IRB to determine whether the storage option you've chosen is appropriate. Also about what data can be stored in Penn+Box and Amazon Web Services.

Recommended Options:

Penn+Box: refer to this chart from ISC for information about acceptable data types
Amazon Web Services: refer to this chart from ISC for information about acceptable data types
Microsoft OneDrive
University network drives: contact your local IT department to find out how to access these drives remotely
Personal/work devices
Un-networked computers: for very highly sensitive data, you may need to work on a computer that is disconnect from the internet and other networks.

For more information, see the Data Storage Best Practices guide and ISC's Information Security Polices and Procedures page.

Google Drive and non-Penn affiliated DropBox accounts are not recommended for storage of sensitive or human subjects data as these are not secure storage options.

Collaboration

Many of the above storage options have options for collaboration. Additionally, a relatively secure option for sharing data is LabArchives, when using through your Penn account. LabArchives is an online lab notebook, but can be used as storage and sharing space for data and information collected in all disciplines.

For sharing more sensitive files and information, Penn also offers a Secure Share service that allows you to send encrypted files to other Penn researchers.

Publishing

Some data may not be publishable due to its highly sensitive nature. However, there are options for even very sensitive data.

Datasets may be published in a repository designed for sensitive data. Because we are a member institution of ICPSR, Penn researchers can publish data in this social science repository. If the data should not be fully public, it can be limited to researchers at other member institutions or even kept as onsite only, where researchers would need to travel to Ann Arbor to view the data files in person.

If your data can't be published, you can still publish a metadata record -or a record about your data- so that people know your dataset exists and roughly what information your dataset contains. This record may also include information about how to access the dataset if it is available privately somewhere.

There may be other publishing options for your sensitive dataset. Contact us to discuss the specifics of your data.

Passwords & Encryption

Passwords

Password protecting your computer and or files is a great way to control access. Of course, if your password is too simple, it doesn't work so well.

Choosing Strong Passwords from University of Edinburgh

Password Strength from xkcd (or from Explain xkcd)

Encryption

Encryption converts your data into an unreadable code that requires a password or key to be read. You can encrypt data while it is stored on your hard drive or other storage medium, or you can encrypt the data while transferring it from one location to another.

SAS's Information Technology for Research has a very good explanation of these encryption processes.

Anonymization & Re-identification Considerations

A lot of researchers believe they can't share their data if it contains personally identifying information (PII). Certainly PII and other sensitive information should not be shared - but de-identification may prevent re-identification of human subjects and if that's not possible, there are other ways to share your data while minimizing risks for your subjects.

De-identification & Re-identification Considerations

Direct Identifiers

Direct identifiers are things like names, addresses, phone numbers, PennKeys, pictures or anything that could, on its own, identify a research participant. Some variables would not be direct identifiers in large datasets can be in small datasets if the response is rare.

Indirect Identifiers

Indirect identifiers are values that could be combined with other values to identify a participant. These identifiers could also be combined with other datasets or information to re-identify a participant. It's understood that a person can be identified with minimal information. In most cases, age, gender, and ZIP code are enough to identify a participant.

Methods for De-identifying Quantitative Data

Remove direct identifiers whenever possible. Most often direct identifiers such as name or phone number are not variables you need to analyze your data. Include a non-identifying number to differentiate between records if needed.
Aggregate. Instead of listing values as exact numbers, use scales such as Age: 21-30, 31-40, etc.
Reduce precision
- Generalize text responses
- Use "and above" and "and below" to help remove outliers
- Generalize georeferenced data to broader areas such as state or region

More information

The Qualtitative Data Repository at Syracuse University has some excellent additional advice on de-identification available here.

Creating Inclusive Surveys

Penn Resources

Resources at Penn