A panoramic photograph of Malham Cove in North Yorkshire.

Building a really simple page-scraping Chrome extension.

and understanding how it works.

Want to parse the content of a website? More comfortable coding in javascript and displaying your results in HTML than you are using Scrapy at a Python command prompt? A google Chrome extension might be perfect for you.

Sadly, the best guide to building a simple but functional page-scraping Chrome extension is quite complicated. So I’ve learned from it and written a much simpler Hello World Chrome extension for page scraping.

Download the source code and the packed extension, and have a look, it's less than 40 lines of code. If you need help installing it follow Google's instructions. For the important part of understanding how it works, I've drawn some pictures.

Get content from a page.

My example will get content from the currently loaded page and display it in the Chrome extension's popup. Here the active tab is on Nokia's homepage and that title is displayed in my extension's popup.

Bundle an extension.

There are five important parts to the extension. The logo, the popup page's html file, the popup page's javascript file, and the manifest.json file which tells Chrome how to bundle these files together into an extension.

Inject the payload.

The fifth important part of the extension solves the cross-site scripting problem. An extension is effectively a little website, and for sensible security reasons scripts from one website can't easily access the content on another website. popup.js can access the content on popup.html and change it, but it's blocked from accessing the content of the currently loaded web page unless that page specifically allows it, which it almost never will.

Chrome has access to both pages and you can tell it to inject and run the payload.js script in the current webpage. Once injected the payload.js script can access and change the content of the currently active tab and send messages back to the popup.js script using the chrome runtime messaging service. Since we've set popup.js as a persistent background script in the extension manifest it will keep listening for messages from popup.js until Chrome closes.

Add more features.

If it all works properly, your extension should display the current tab's title. Once you've seen how it works you can extend this Hello World extension however you like. The payload.js script can do anything it likes with the current web page, including navigating somewhere else, or clicking a link. The chrome runtime messaging service supports JSON objects so you can easily pass formatted data between your extension and the current page.

Thanks for reading, and in case you missed the first download link,

Download the sourcecode and the packed extension.

comments powered by Disqus