When Scientific Citations Point to Papers That Don't Exist

Hook

An audit of 2.5 million biomedical papers found thousands of citations pointing to papers that don’t exist. Not citations to retracted studies. Not misquoted sources. Citations to papers that were never published—fake DOIs, phantom journals, references fabricated from scratch.

This isn’t one researcher inventing data to save their career. It’s systematic gaming of the citation network itself. How does fraud scale like this—and why didn’t anyone notice until now?

How Citation Networks Work As Trust Systems

A scientific paper’s reference list is supposed to be a trail of real work. You cite the studies you built on. Readers trust those citations are accurate because that’s how knowledge compounds—each paper stands on the ones before it.

Peer review checks whether your methods are sound and your conclusions follow from your data. It doesn’t verify your references. Reviewers don’t look up every DOI to confirm the cited paper exists. That would take hours per paper. Journals don’t have the budget. The system assumes good faith.

That assumption creates the gap. When a citation looks formatted correctly—author names, journal title, year, DOI—reviewers move on. The structure signals legitimacy. But structure is easy to fake. A fraudulent reference can sit in a bibliography for years before anyone notices the DOI leads nowhere.

What Paper Mills Are And Why They Exist

Paper mills are commercial operations that fabricate entire studies for sale. A researcher pays a fee—often $1,000 to $5,000—and receives authorship on a paper they didn’t write, reporting experiments they didn’t conduct, with data they didn’t collect.

The business model works because academic systems reward publication volume. Hiring committees count papers. Promotion reviews count citations. Grant agencies count output. In some countries, researchers receive direct cash bonuses for publications in certain journals. Paper mills exploit that pressure.

They fabricate everything: fake authors with plausible names and email addresses, fake institutional affiliations, fake experimental data with realistic-looking figures and tables, fake reference lists citing other fake papers to build credibility. Some mills even create fake journals with real-looking websites and DOI prefixes purchased from legitimate registrars. The result is a manufactured citation network—fake papers citing fake papers—that looks legitimate until someone audits the underlying trail.

How The Audit Found Fake Citations

The Lancet audit, led by Columbia University’s Maxim Topaz, used two approaches to identify fake citations across 2.5 million biomedical papers. First: DOI and PubMed ID validation. Each cited reference includes a DOI (digital object identifier) or PubMed ID that should point to a real published paper. Researchers checked whether those identifiers actually resolved to papers in official registries. Thousands didn’t. The DOIs were formatted correctly but pointed to nothing.

Second: database searches. For citations without valid identifiers, researchers searched publication databases using author names, titles, and journal names from the reference list. Many citations claimed to be from journals that don’t exist. Others cited papers in real journals, but no such paper appeared in those journals’ archives.

Both methods require scale. A single researcher can’t manually verify millions of references. Automated systems can flag anomalies—a DOI that doesn’t resolve, a journal name not in any database—but each flag requires human judgment to confirm whether it’s fraud or a formatting error.

Why Verification Is Structurally Hard

Manual verification is impossibly slow. A single paper might cite 30 to 50 references. Checking each one—looking up the DOI, confirming the title matches, verifying the authors and journal—takes minutes per citation. For one paper, that’s hours. For thousands of papers published daily, it’s unworkable.

Automated systems can scan faster, but they flag anomalies, not fraud. A DOI that doesn’t resolve might be a typo. A missing PubMed ID might mean the paper is too recent to be indexed. A cluster of papers citing each other might be a legitimate research group, not a citation ring. Each flag requires human review.

The structural tension: science needs to move fast. Researchers publish thousands of papers daily. Peer review already takes months. If every citation required manual verification before publication, the system would freeze. But speed enables fraud. When checking costs more than faking, fraud scales. Journals are adding automated DOI checks now, but the game adapts. Fraudsters use real DOIs from predatory journals. They cite retracted papers that still have valid identifiers.

The Cost Of Fake Citations

Here’s what it costs when citations are fake. Other researchers build studies on fabricated work—wasting time, money, and credibility. A clinical trial might cite a fake safety study. A policy recommendation might rest on fabricated data. The literature fills with pollution: papers that look legitimate but teach nothing.

Trust erodes. When readers can’t assume a reference list is real, every paper becomes suspect. The whole system depends on citations being honest. A citation is a claim: “This work exists, and it supports what I’m saying.” When that claim is false, the knowledge network breaks. You can’t build on work that was never done.

The broader damage is invisible. How many researchers wasted months chasing a result that came from a fake paper? How many literature reviews included fabricated studies without knowing? Once fake citations enter the network, they propagate. One fake paper gets cited by a real one. That real paper gets cited by another. The fraud becomes embedded in legitimate work, and cleaning it out requires auditing the entire chain.

Close

Any trust system is vulnerable when verification costs more than fraud. Science assumed no one would bother faking citations because the payoff seemed low—but paper mills found a business model in it. The question isn’t who cheats. It’s how to design a system that doesn’t depend on everyone being honest.