Last modified: 12 May 2017
Opening the Aireborough Background plan
Most of us will have been frustrated at a PDF document at some point. We’ve retyped data from an embedded table, spent ages removing line breaks from copied paragraphs of text, traced points from an embedded graph, or struggled to follow links or get back to the original source of a document.
But are there good ways of extracting information and data from PDFs?
I decided to do a little test using the Aireborough Background plan. It’s a document about planning homes, parks, and communities in West Leeds. Exciting right? I picked it because it’s the kind of document that makes the world we live in work. It’s got text, headings, tables, some colours, and even a landscape page.
The PDF was created with Acrobat Elements/Acrobat PDFMaker 7.0 for Word in 2013. I suspect that matters.
So what happened when I tried to extract the file’s contents?
I tried extracting the contents of the PDF with Adobe Acrobat (both free and pro), Microsoft Word (online and offline), PDF Tables, Corel PDF Fusion, LibreOffice, and Google Docs. If you just want the scores, they’re below. Or read on for more.
|Software Package||Paid/Free||Tom’s PDF extraction score|
|Adobe Acrobat||Paid (subscription from £1/month to £25/month)||9|
|Microsoft Word online||Free||8|
|Microsoft Word Paid||Paid (one-off, or subscription)||8|
|PDF Tables (if you just want tables)||Paid (first 50 pages free, $15 for 500 pages)||4 (8)|
|Corel PDF Fusion||Paid (£35)||5|
This option wins. It’s really good. Tables are good, headings are the best of all the options, page rotation works.
This was really impressive, it’s not quite as good at Acrobat but it seems to retain everything including most heading levels and all tables. Even page orientation works, with the section break where the document switches to landscape mode retained.
The coolest thing is that you can do exactly the same thing with Microsoft Word online, which is free. If you want to get an editable document out of a PDF, maybe just to copy information out of a table into Excel, and you don’t want to pay then Word is the way to go.
This was quite interesting. The HTML output from Acrobat Read Pro is similar to the Word output, but with no attempt to style it for a page.
Pressing the Edit button in Acrobat DC Pro does some cool things, it's not really extracting the whole file, but you can extract bits that you like, or change them.
Decent, but not great. Worse than Word import or Acrobat Export.
Each line is separate, tables are broken, formatting is gone.
Only extracted the first two pages, no document outline, tables work, formatting is okay.
This is not a good experience. The whole file opens, but in LibreOffice Draw as separate slides. Tables are converted to images, lines are separated.
An image of the first page only is pasted into the document.
Exactly the same as Exporting from Acrobat Reader Pro, but it happens online and the subscriptions are much cheaper , less than £2/month.
PDF Tables only extracts tables, so the rest is a mess. But the tables are great and it’s incredibly fast. There’s a lesson here about doing one thing really well and turning it into a business.
Today you can get editable text and tables out of PDFs pretty cheaply (files here). You may well already have software that can do it, and if you’re willing to spend a bit of time with Microsoft Word online you can even do it for free.
But what about something more? LibreOffice (and alternatives like OpenOffice) have long had a feature when exporting a document as PDF to create a Hybrid PDF, containing the original document in an editable form.
It’s not often used, and few people even know it exists, but if you open one of these PDFs in LibreOffice, you get back exactly what you put in.
In my next blog I’ll be exploring this further and looking at whether this is a useful idea. I'll also be looking at how other software, software that a lot more people use, could do similar things. The technology already exists, there are articles about PDF File Attachments from back in 2010.