Guides: Research Impact and Citation Analysis: Bibliometric Sources and their Limitations

General Limitations of Bibliometric Databases
Common Bibliometric Databases

General Limitations

No one database contains all publications.

From Dimensions to Lens, from Scopus to Web of Science, from Google Scholar to Semantic Scholar, there simply isn't one location from which we can harvest all scholarly and popular output.

Databases prioritize publication types differently.

Historically, the "big" databases like Scopus and Web of Science focused on STEM fields which tended to publish in journals. As such monographs and humanities typically have poor coverage in those database. However, any work that has a Digital Object Identifier (DOI) is indexed in Crossref, where it will potentially be picked up by Dimensions and OpenAlex.

Expect to find researchers with multiple entries.

Authors can publish under many variations of their name (example: K Gilje vs K E Gilje). It can take work to ensure each of those entities represents the correct person, and not a different person with the same name.

Curating an author identity takes maintenance -- but the process can be simplified

Any bibliometric tool relies on accurately identifying an author and their associated works. However, assigning credit to an author for their work is not a trivial process. Some bibliometric sources scrape authors from documents, others require claiming credit via their platform. Researchers change their names and their institutions, potentially leading to disconnects in their online presence among the different bibliometric sources. No tool will be entirely accurate, but researchers can do their part to ensure that the information entering the tool is as complete as possible.

One straightforward way of curating your author identity is to create an ORCID (Open Researcher and Contributor ID) iD, a persistent digital identifier across all platforms that follows a researcher through their entire career.

No one database indexes all researchers.

Just as no database contains all scholarly and popular publications, the same goes for networks of researchers. Some researchers primarily publish outside of traditional scholarly journals -- say, emerging scholars, or researchers experimenting with new media and forms of digital scholarship. For those looking for other Penn collaborators, keep an eye on local news sources, such as Penn Today, the Daily Pennsylvanian, and newsletters and websites from departments and research centers.

Impact Analysis may not include all impact of a work

Impact analysis when applied to a single document is particularly tricky. Imagine how many different forms of impact a journal article might make:

Serving as the foundational text on a graduate-level syllabus for hundreds of students
Helping to form the context and argument within the literature reviews of dozens of other scholarly works
Inspiring an analysis piece in a national newspaper, introducing thousands of readers to a new realm of research
Becoming the topic of interest in a heated social media thread in an engaged community of scholars
Included in a report from the Congressional Research Service to inform a Congressional committee on an important piece of legislation
Cited within the development process for a new patent

These are just a few examples, and not all of them are quantifiable. Despite tools like the Open Syllabus Project, there is no comprehensive source of university syllabi. Even if we can see how many works have cited a piece, we don’t necessarily know if those citations substantially contribute to the piece’s argument. We encourage you to consider a single publication’s impact as a complex and ongoing story.

Different sources provide different h-index counts.

Different databases give different h-indexes, due to the number of citations available in a particular database.

H-index focuses exclusively on citation practices.

Another limitation of h-index is that it looks at cited publications. It doesn’t represent scholarly productivity or impact of other kinds of publications. What if an author has published multiple newspaper editorials? Or if the author has a popular substack or blog presence? Or created an innovative digital exhibition? None of this productivity or impact would be captured by h-index if those publications were widely read and experienced but not formally cited.

English Language Biases

Most of these index services focus on English Language Journals. If a researcher is doing work on somewhere like South America it is possible their impact would be seen more in Spanish or Portuguese language journals.

OpenAlex

OpenAlex is a fully open index of scholarly entities and their connections operated by the nonprofit organization, OurResearch. It is originally based on the deprecated Microsoft Academic Graph (MAG) and further enhanced by information from Crossref. See OpenAlex's site for details on more sources.

Although OpenAlex was designed with API access in mind, the team at OurResearch created an online interface to help people navigate the database. You can build a search on the website and then save the URL to pull data in an automated way by API. You can also create an account to save searches and create alerts.

The only difference between a free account and a premium account is the number of API calls per day you can make. The technical documentation is available on their website.

Dimensions

Dimensions is a Digital Science project, who also owns Altmetric, ReadCube, Figshare, Symplectic and more. They have created a database that includes grants, publications, patents, datasets, clinical trials, and policy documents. For more information about the scope of the data available check out their website. Dimensions combines data from multiple other sites, for a more complete list read this blog.

There is a free version of Dimensions available that provides search access for non-commercial use to the datasets and publications. However for grants, patents and clinical trials, the premium version must be paid for. For more information on the differences between the free and premium versions check out their blog. An API is available with the premium version for an additional fee. The University of Pennsylvania does not have an institutional subscription for this resource.

Scopus

Scopus is owned by Elsevier and delivers "source-neutral" data. It grabs citations from a long list of journals, books in a series, preprints and more. They do not index monographs. Historically, Scopus focused on indexing STEM disciplines.

The University of Pennsylvania has an institutional membership to Scopus. An API is available through Penn's institutional membership.

Web of Science

Web of Science is owned by Clarivate. Their data encompasses more than 34,000 journals and 217 million works. Although they have the one of the smaller corpuses available, they claim to have the most accurate metadata information due to indexing by subject matter experts. Historically, the Web of Science has focused on STEM fields. A list of their subject classifications is available at this blog.

Clarivate owns the process for calculating the Journal Impact Factor and Journal Citation Reports, look at this essay for more details.

The University of Pennsylvania has an institutional account with Web of Science and you can access it's features as long as you make an account using your Penn email address.

Google Scholar

Google Scholar is a free search engine for articles, theses, books, abstracts and court opinions. They use web crawlers to extract data from webpages. It only pulls metadata from readable PDFs or HTML pages. Additionally, they can extract metadata from university repositories and journal publishers.

Since Google Scholar indexing is solely based on web crawlers, there is no verification of the data. There is no official API, but some third party ones have been created.

Research Impact and Citation Analysis

Contents