A panoramic photograph of Malham Cove in North Yorkshire.

Extracting information and data from PDFs

Opening the Aireborough Background plan

Most of us will have been frustrated at a PDF document at some point. We’ve retyped data from an embedded table, spent ages removing line breaks from copied paragraphs of text, traced points from an embedded graph, or struggled to follow links or get back to the original source of a document.

But are there good ways of extracting information and data from PDFs?

I decided to do a little test using the Aireborough Background plan. It’s a document about planning homes, parks, and communities in West Leeds. Exciting right? I picked it because it’s the kind of document that makes the world we live in work. It’s got text, headings, tables, some colours, and even a landscape page.

The PDF was created with Acrobat Elements/Acrobat PDFMaker 7.0 for Word in 2013. I suspect that matters.

So what happened when I tried to extract the file’s contents?

Just the scores

I tried extracting the contents of the PDF with Adobe Acrobat (both free and pro), Microsoft Word (online and offline), PDF Tables, Corel PDF Fusion, LibreOffice, and Google Docs. If you just want the scores, they’re below. Or read on for more.

Software Package Paid/Free Tom’s PDF extraction score
Adobe Acrobat Paid (subscription from £1/month to £25/month) 9
Microsoft Word online Free 8
Microsoft Word Paid Paid (one-off, or subscription) 8
PDF Tables (if you just want tables) Paid (first 50 pages free, $15 for 500 pages) 4 (8)
Corel PDF Fusion Paid (£35) 5
LibreOffice Free 2
Google Docs Free 0

Offline options

Export to Word in Acrobat Reader Pro

This option wins. It’s really good. Tables are good, headings are the best of all the options, page rotation works.

Opening the PDF with Microsoft Word 2016

This was really impressive, it’s not quite as good at Acrobat but it seems to retain everything including most heading levels and all tables. Even page orientation works, with the section break where the document switches to landscape mode retained.

The coolest thing is that you can do exactly the same thing with Microsoft Word online, which is free. If you want to get an editable document out of a PDF, maybe just to copy information out of a table into Excel, and you don’t want to pay then Word is the way to go.

Export to HTML and the Edit button in Acrobat DC Pro

This was quite interesting. The HTML output from Acrobat Read Pro is similar to the Word output, but with no attempt to style it for a page.

Pressing the Edit button in Acrobat DC Pro does some cool things, it's not really extracting the whole file, but you can extract bits that you like, or change them.

Export to Word in Corel PDF Fusion

Decent, but not great. Worse than Word import or Acrobat Export.

Paste to Word from Corel PDF Fusion

Each line is separate, tables are broken, formatting is gone.

Paste to Microsoft Word 2016 from Acrobat Reader DC

Only extracted the first two pages, no document outline, tables work, formatting is okay.

Opening the PDF with LibreOffice Writer

This is not a good experience. The whole file opens, but in LibreOffice Draw as separate slides. Tables are converted to images, lines are separated.

Paste to LibreOffice Writer from Acrobat Reader DC

An image of the first page only is pasted into the document.

Online options

Using the Export option in Acrobat Reader DC

Exactly the same as Exporting from Acrobat Reader Pro, but it happens online and the subscriptions are much cheaper , less than £2/month.

PDF Tables

PDF Tables only extracts tables, so the rest is a mess. But the tables are great and it’s incredibly fast. There’s a lesson here about doing one thing really well and turning it into a business.

Google Docs

Doesn’t open.

So what?

Today you can get editable text and tables out of PDFs pretty cheaply (files here). You may well already have software that can do it, and if you’re willing to spend a bit of time with Microsoft Word online you can even do it for free.

But what about something more? LibreOffice (and alternatives like OpenOffice) have long had a feature when exporting a document as PDF to create a Hybrid PDF, containing the original document in an editable form.

It’s not often used, and few people even know it exists, but if you open one of these PDFs in LibreOffice, you get back exactly what you put in.

In my next blog I’ll be exploring this further and looking at whether this is a useful idea. I'll also be looking at how other software, software that a lot more people use, could do similar things. The technology already exists, there are articles about PDF File Attachments from back in 2010.

blog comments powered by Disqus