Most popular ▴ See a list of all my posts! Why are there no great Windows 10 apps? How moving the Capital helps Hartlepool. Gender bias calculator The Centre of the UK Defending Uber BusTracker Imagination not needed. Part 1. Imagination not needed. Part 2. Imagination not needed. Part 3. Why Birmingham fails Who is London? Innovation on buses. Heathrow

PDFs and Data ▾ Improving PDFs for Science. Improving PDFs for Planners. PDFAttacher. A Clearer Plan Hybrid PDFs PDF test-off. PDF Profiler Making PDFs play nicely with data

Housing ▾ Counting households. 1. Counting households. 2. The housing market works (where we let it) Hexmaps Adonis is wrong on housing Car free Birmingham

Regional Growth ▾ Measuring tech in the UK and France in 10 steps. Defending the Zombie graph. Channel 4 must move to Mancheseter Measuring innovation 1: meetups Measuring innovation 2: scientific papers. The UK city-size abnormality. Cities not cheese: why France is productive. How moving the Capital helps Hartlepool. Industrial Strategy. Leeds Growth Strategy 5: Limits. Leeds Growth Strategy 4: Focus. Leeds Growth Strategy 3: Inclusive growth. Leeds Growth Strategy 2: Where to grow? Leeds Growth Strategy 1: Why grow? Imagination not needed. Part 1. Imagination not needed. Part 2. Imagination not needed. Part 3. Inclusive growth. The BBC in Manchester 1 The BBC in Manchester 2 What works (growth) North-South divide: we never tried Imitating Manchester Why Birmingham fails Who is London? Researching research Replacing UK steel The Economist & The North The State of the North, 2015 Move the Lords! Calderdale Digital Strategy Maths of inequality Income by MSOA Heathrow and localism The NorthernPowerhouse Centralism and Santa Claus Yorkshire backwards London makes us poor

Transport ▾ Fixing it ourselves: bus data in the North. Open fare data will be hard. Transport is too complex! Investment is political London loses when it blocks Leeds' growth The Centre of the UK Defending Uber BusTracker Train time map What works (growth) The Value of Time Innovation on buses. Heathrow 1975 WYMetro Plan

Politics & Economics ▾ GDP measures are like toilets. The UK's private postcodes restrict innovation. Yorkshire could learn from Ireland's success. Alternatives to GDP are a waste of time. Fiscal balance in the UK "Not like London" Innovation takes time to measure Fifa and the right In defence of the € GDP mystery Liberal protectionists 5 types of EU voter Asylum responsibilities STEM vs STEAM The Economist & Scotland BBC Bias? Northern rail consultation What holds us back? Saving the Union Summing it up

Positive ▾ Bike Lights Playful Everywhere Greggs vs. Pret Guardian comment generator Consult less, do more! More things for Leeds! Cartoons PubQuest: Birmingham

Tech ▾ What's holding back opendata in the UK? Anti-trust law saved computing 1 Anti-trust law saved computing 2 Open Data Camp Cardiff Why are there no great Windows 10 apps? Tap to pay. Open Data in Birmingham Defending Uber BusTracker Train time map Building a TechNation How the UK holds back TechNorth GDS is Windows 8 OpenData at the BBC SimFlood SimSponge See me speak Digital Health Leeds Empties Leeds Site Allocations Building a Chrome extension I hate webkit Visualising mental health Microsoft's 5 easy wins Epson px700w reset Stay inside the Bubble

Old/incomplete ▾ Orange price rises The future of University Cherish our Capital Dealing with NIMBYs Sponsoring the tube Gender bias calculator MetNetMaker Malaria PhD Symbian Loops Zwack Kegg Project The EU Eduroam & Windows 8 Where is science vital? The Vomcano 10 things London can shove Holbeck Waterwheel

Last modified: 19 September 2017

Science and PDFs

Research careers in science are judged by papers. Building teams, collecting data, and going to conferences matters — but a publication record trumps it all.

This means that researchers have stacks of papers. They might be in magazines in a library. They might be printed and heavily annotated. But today, they’re probably stored on a computer. And if they are, they’re almost certainly stored in PDF format.

PDFs and papers

Inevitably researchers have hundreds, even thousands of PDFs on their computers. Thankfully, there are software packages to help manage them. Some that I’ve tried are,

And there are a long list of other options on Wikipedia’s List of Reference Managers .

The core feature of a reference manager is to assist with inserting citations into new documents and managing the creation and formatting of the references section at the end of the paper. Most reference managers also perform the following four additional tasks,

  1. Extract and improve metadata, such as authors, keywords, publication year, and institution so that they are searchable.
  2. Allow annotations and notes to be made on a PDF.
  3. Allow files, such as models, raw data, and appendices to be associated with a PDF.
  4. Organise PDFs so that they are consistently named and the information in points 1, 2, and 3 can be shared among a research group and worked on collaboratively.


In this blog post I'll examine the first three in turn. Then I'll explain how implementing the same functionality within the PDF format could make the fourth task simpler, more open, and thus more future-proof.

Metadata in PDFs

Almost all scientific papers have a unique ID. This might be an ISSN or ISBN number, a DOI, a PubMed ID, an ArXiv ID, or something else. Most papers have many such IDs and databases exist that link the IDs with each other, and link the IDS with metadata.

These IDs let us do interest things. For this example, I’ll use a paper that I published in 2010 in Bioinformatics journal. Its DOI is 10.1093/bioinformatics/btq425. This code isn’t embedded in the PDF’s metadata, but it is printed at the top-right of the page following “DOI:” and every reference manager I tested was able to extract it.

From the DOI, online databases can give the PubMed Identifier (PMID) and from there the metadata can be downloaded using Entrez in a JSON-like format or in XML format . There is a huge amount of data here.


This data could be embedded in the original PDF in a format such as XMP (The Extensible Metadata Platform) . But in all of the Reference Managers I have looked at it, it is not. Instead it is stored in a database, either local or synced across devices and the cloud, that is specific to the reference manager in question. Embedding the metadata in XMP format in the PDF would make it transferable between applications, in a single file, using a standard metadata format instead of an application-specific database.

Annotations and notes in PDFs

Adding annotations and notes to PDFs offers a number of advantages over writing on paper. A big part of that advantage is that they are searchable and shareable. For this reason, PDF has powerful capabilities for adding annotations, highlights, and notes and these are an almost universally-supported feature of the format.

It is frustrating that very few reference managers use these features, instead opting to create their own systems for making and sharing notes and annotations within their application-specific databases.

A notable exception to this is Docear which uses PDF’s built-in annotation system. But even here, results are not always as expected.

Link to this tweet.

But this does show a possibility. What if scientific reference managers used the PDF standard tor annotation and notes instead of their own workarounds?

Attachments in PDFs

Scientific papers increasingly do not stand on their own. In a supplementary information section, they now frequently refer to other documents, high-resolution images, computer models, and large raw datasets.

These files are usually stored in their own file formats and made available for separate download from publishers’ websites. In some reference managers they can be manually linked to the paper, but this is often unreliable. Links break when files are moved or shared to another computer.

As I have shown in detail, PDF’s have the ability to contain attached files. What if journals made their papers available as PDFs with supplementary information attached? Instead of custom document-management systems, both on publishers servers and within reference management software, people could keep using the file metaphor that they are used to, powered by PDF.

Using PDFs to retain control

PDF is an open ISO standard. That includes its support for XMP metadata embedding, support for embedded annotation and notes, and support for file attachments.

Almost all of the features that reference managers are adding to PDFs, but currently implementing in their own way and hiding in their own database structures, could be implemented within the existing PDF standard. We could keep one paper along with its annotation and notes, its metadata, and any related files, in a single PDF file. Searchable, shareable, and syncable using any of the file-syncing services that we almost all already use.

I think this is a really interesting possibility for science. Instead of moving past the PDF, why not embrace the format and get organised. Adobe are already looking at it in other areas. Could we do the same for scientific publishing?


Oh and one last thing. If this kind of discussion was interesting to you, the team at Docear have been thinking about this for a while.


blog comments powered by Disqus