Most popular ▴ See a list of all my posts! Why are there no great Windows 10 apps? How moving the Capital helps Hartlepool. Gender bias calculator The Centre of the UK Defending Uber BusTracker Imagination not needed. Part 1. Imagination not needed. Part 2. Imagination not needed. Part 3. Why Birmingham fails Who is London? Innovation on buses. Heathrow

PDFs and Data ▾ Improving PDFs for Science. Improving PDFs for Planners. PDFAttacher. A Clearer Plan Hybrid PDFs PDF test-off. PDF Profiler Making PDFs play nicely with data

Housing ▾ Counting households. 1. Counting households. 2. The housing market works (where we let it) Hexmaps Adonis is wrong on housing Car free Birmingham

Regional Growth ▾ Measuring tech in the UK and France in 10 steps. Defending the Zombie graph. Channel 4 must move to Mancheseter Measuring innovation 1: meetups Measuring innovation 2: scientific papers. The UK city-size abnormality. Cities not cheese: why France is productive. How moving the Capital helps Hartlepool. Industrial Strategy. Leeds Growth Strategy 5: Limits. Leeds Growth Strategy 4: Focus. Leeds Growth Strategy 3: Inclusive growth. Leeds Growth Strategy 2: Where to grow? Leeds Growth Strategy 1: Why grow? Imagination not needed. Part 1. Imagination not needed. Part 2. Imagination not needed. Part 3. Inclusive growth. The BBC in Manchester 1 The BBC in Manchester 2 What works (growth) North-South divide: we never tried Imitating Manchester Why Birmingham fails Who is London? Researching research Replacing UK steel The Economist & The North The State of the North, 2015 Move the Lords! Calderdale Digital Strategy Maths of inequality Income by MSOA Heathrow and localism The NorthernPowerhouse Centralism and Santa Claus Yorkshire backwards London makes us poor

Transport ▾ Fixing it ourselves: bus data in the North. Open fare data will be hard. Transport is too complex! Investment is political London loses when it blocks Leeds' growth The Centre of the UK Defending Uber BusTracker Train time map What works (growth) The Value of Time Innovation on buses. Heathrow 1975 WYMetro Plan

Politics & Economics ▾ GDP measures are like toilets. The UK's private postcodes restrict innovation. Yorkshire could learn from Ireland's success. Alternatives to GDP are a waste of time. Fiscal balance in the UK "Not like London" Innovation takes time to measure Fifa and the right In defence of the € GDP mystery Liberal protectionists 5 types of EU voter Asylum responsibilities STEM vs STEAM The Economist & Scotland BBC Bias? Northern rail consultation What holds us back? Saving the Union Summing it up

Positive ▾ Bike Lights Playful Everywhere Greggs vs. Pret Guardian comment generator Consult less, do more! More things for Leeds! Cartoons PubQuest: Birmingham

Tech ▾ What's holding back opendata in the UK? Anti-trust law saved computing 1 Anti-trust law saved computing 2 Open Data Camp Cardiff Why are there no great Windows 10 apps? Tap to pay. Open Data in Birmingham Defending Uber BusTracker Train time map Building a TechNation How the UK holds back TechNorth GDS is Windows 8 OpenData at the BBC SimFlood SimSponge See me speak Digital Health Leeds Empties Leeds Site Allocations Building a Chrome extension I hate webkit Visualising mental health Microsoft's 5 easy wins Epson px700w reset Stay inside the Bubble

Old/incomplete ▾ Orange price rises The future of University Cherish our Capital Dealing with NIMBYs Sponsoring the tube Gender bias calculator MetNetMaker Malaria PhD Symbian Loops Zwack Kegg Project The EU Eduroam & Windows 8 Where is science vital? The Vomcano 10 things London can shove Holbeck Waterwheel

Last modified: 12 May 2017

Extracting information and data from PDFs

Opening the Aireborough Background plan

Most of us will have been frustrated at a PDF document at some point. We’ve retyped data from an embedded table, spent ages removing line breaks from copied paragraphs of text, traced points from an embedded graph, or struggled to follow links or get back to the original source of a document.

But are there good ways of extracting information and data from PDFs?

I decided to do a little test using the Aireborough Background plan. It’s a document about planning homes, parks, and communities in West Leeds. Exciting right? I picked it because it’s the kind of document that makes the world we live in work. It’s got text, headings, tables, some colours, and even a landscape page.

The PDF was created with Acrobat Elements/Acrobat PDFMaker 7.0 for Word in 2013. I suspect that matters.

So what happened when I tried to extract the file’s contents?

Just the scores

I tried extracting the contents of the PDF with Adobe Acrobat (both free and pro), Microsoft Word (online and offline), PDF Tables, Corel PDF Fusion, LibreOffice, and Google Docs. If you just want the scores, they’re below. Or read on for more.

Software Package Paid/Free Tom’s PDF extraction score
Adobe Acrobat Paid (subscription from £1/month to £25/month) 9
Microsoft Word online Free 8
Microsoft Word Paid Paid (one-off, or subscription) 8
PDF Tables (if you just want tables) Paid (first 50 pages free, $15 for 500 pages) 4 (8)
Corel PDF Fusion Paid (£35) 5
LibreOffice Free 2
Google Docs Free 0

Offline options

Export to Word in Acrobat Reader Pro

This option wins. It’s really good. Tables are good, headings are the best of all the options, page rotation works.

Opening the PDF with Microsoft Word 2016

This was really impressive, it’s not quite as good at Acrobat but it seems to retain everything including most heading levels and all tables. Even page orientation works, with the section break where the document switches to landscape mode retained.

The coolest thing is that you can do exactly the same thing with Microsoft Word online, which is free. If you want to get an editable document out of a PDF, maybe just to copy information out of a table into Excel, and you don’t want to pay then Word is the way to go.

Export to HTML and the Edit button in Acrobat DC Pro

This was quite interesting. The HTML output from Acrobat Read Pro is similar to the Word output, but with no attempt to style it for a page.

Pressing the Edit button in Acrobat DC Pro does some cool things, it's not really extracting the whole file, but you can extract bits that you like, or change them.

Export to Word in Corel PDF Fusion

Decent, but not great. Worse than Word import or Acrobat Export.

Paste to Word from Corel PDF Fusion

Each line is separate, tables are broken, formatting is gone.

Paste to Microsoft Word 2016 from Acrobat Reader DC

Only extracted the first two pages, no document outline, tables work, formatting is okay.

Opening the PDF with LibreOffice Writer

This is not a good experience. The whole file opens, but in LibreOffice Draw as separate slides. Tables are converted to images, lines are separated.

Paste to LibreOffice Writer from Acrobat Reader DC

An image of the first page only is pasted into the document.

Online options

Using the Export option in Acrobat Reader DC

Exactly the same as Exporting from Acrobat Reader Pro, but it happens online and the subscriptions are much cheaper , less than £2/month.

PDF Tables

PDF Tables only extracts tables, so the rest is a mess. But the tables are great and it’s incredibly fast. There’s a lesson here about doing one thing really well and turning it into a business.

Google Docs

Doesn’t open.

So what?

Today you can get editable text and tables out of PDFs pretty cheaply (files here). You may well already have software that can do it, and if you’re willing to spend a bit of time with Microsoft Word online you can even do it for free.

But what about something more? LibreOffice (and alternatives like OpenOffice) have long had a feature when exporting a document as PDF to create a Hybrid PDF, containing the original document in an editable form.

It’s not often used, and few people even know it exists, but if you open one of these PDFs in LibreOffice, you get back exactly what you put in.

In my next blog I’ll be exploring this further and looking at whether this is a useful idea. I'll also be looking at how other software, software that a lot more people use, could do similar things. The technology already exists, there are articles about PDF File Attachments from back in 2010.

blog comments powered by Disqus