Science and PDFs

Research careers in science are judged by papers. Building teams, collecting data, and going to conferences matters — but a publication record trumps it all.

This means that researchers have stacks of papers. They might be in magazines in a library. They might be printed and heavily annotated. But today, they’re probably stored on a computer. And if they are, they’re almost certainly stored in PDF format.

PDFs and papers

Inevitably researchers have hundreds, even thousands of PDFs on their computers. Thankfully, there are software packages to help manage them. Some that I’ve tried are,

Mendeley
Zotero
ReadCube
Papers
Docear
ColWiz

And there are a long list of other options on Wikipedia’s List of Reference Managers .

The core feature of a reference manager is to assist with inserting citations into new documents and managing the creation and formatting of the references section at the end of the paper. Most reference managers also perform the following four additional tasks,

Extract and improve metadata, such as authors, keywords, publication year, and institution so that they are searchable.
Allow annotations and notes to be made on a PDF.
Allow files, such as models, raw data, and appendices to be associated with a PDF.
Organise PDFs so that they are consistently named and the information in points 1, 2, and 3 can be shared among a research group and worked on collaboratively.

In this blog post I'll examine the first three in turn. Then I'll explain how implementing the same functionality within the PDF format could make the fourth task simpler, more open, and thus more future-proof.

Metadata in PDFs

Almost all scientific papers have a unique ID. This might be an ISSN or ISBN number, a DOI, a PubMed ID, an ArXiv ID, or something else. Most papers have many such IDs and databases exist that link the IDs with each other, and link the IDS with metadata.

These IDs let us do interest things. For this example, I’ll use a paper that I published in 2010 in Bioinformatics journal. Its DOI is 10.1093/bioinformatics/btq425. This code isn’t embedded in the PDF’s metadata, but it is printed at the top-right of the page following “DOI:” and every reference manager I tested was able to extract it.

From the DOI, online databases can give the PubMed Identifier (PMID) and from there the metadata can be downloaded using Entrez in a JSON-like format or in XML format . There is a huge amount of data here.

This data could be embedded in the original PDF in a format such as XMP (The Extensible Metadata Platform) . But in all of the Reference Managers I have looked at it, it is not. Instead it is stored in a database, either local or synced across devices and the cloud, that is specific to the reference manager in question. Embedding the metadata in XMP format in the PDF would make it transferable between applications, in a single file, using a standard metadata format instead of an application-specific database.

Annotations and notes in PDFs

Adding annotations and notes to PDFs offers a number of advantages over writing on paper. A big part of that advantage is that they are searchable and shareable. For this reason, PDF has powerful capabilities for adding annotations, highlights, and notes and these are an almost universally-supported feature of the format.

It is frustrating that very few reference managers use these features, instead opting to create their own systems for making and sharing notes and annotations within their application-specific databases.

A notable exception to this is Docear which uses PDF’s built-in annotation system. But even here, results are not always as expected.

Link to this tweet.

But this does show a possibility. What if scientific reference managers used the PDF standard tor annotation and notes instead of their own workarounds?

Attachments in PDFs

Scientific papers increasingly do not stand on their own. In a supplementary information section, they now frequently refer to other documents, high-resolution images, computer models, and large raw datasets.

These files are usually stored in their own file formats and made available for separate download from publishers’ websites. In some reference managers they can be manually linked to the paper, but this is often unreliable. Links break when files are moved or shared to another computer.

As I have shown in detail, PDF’s have the ability to contain attached files. What if journals made their papers available as PDFs with supplementary information attached? Instead of custom document-management systems, both on publishers servers and within reference management software, people could keep using the file metaphor that they are used to, powered by PDF.

Using PDFs to retain control

PDF is an open ISO standard. That includes its support for XMP metadata embedding, support for embedded annotation and notes, and support for file attachments.

Almost all of the features that reference managers are adding to PDFs, but currently implementing in their own way and hiding in their own database structures, could be implemented within the existing PDF standard. We could keep one paper along with its annotation and notes, its metadata, and any related files, in a single PDF file. Searchable, shareable, and syncable using any of the file-syncing services that we almost all already use.

I think this is a really interesting possibility for science. Instead of moving past the PDF, why not embrace the format and get organised. Adobe are already looking at it in other areas. Could we do the same for scientific publishing?

Oh and one last thing. If this kind of discussion was interesting to you, the team at Docear have been thinking about this for a while.