wiki:ScraperScripts

To load news items from a given source into OpenBlock, you need to write a scraper script.

A scraper script is a Python script that does several things:

  1. Download raw data from some URL on the web.
  2. Extract information we can use for creating news items.
  3. Insert a particular kind of news item into the database (eg. a police report, a restaurant inspection, a news article, ...)

Typically, one scraper script is intended to fetch news from a single URL and create a single kind of news item.

There is a bunch of infrastructure within OpenBlock for writing these scripts, but we need to make it easier and deal with a lot of common cases automatically.

What if I can't write Python scripts?

It is not currently possible to import news without writing a script. We plan to gradually reduce the difficulty of importing news in a number of ways.

For one thing, we plan to provide some out-of-the box scripts that work with a number of popular feed formats (RSS, Atom, GeoJSON) and/or web service APIs (eg. flickr), along with appropriate out-of-the-box news item types (news article, flickr photo, restaurant inspection, etc.). It should be possible to use these just by doing a bit of configuration.

And once we've reduced the problem to one of configuration, we can make it possible to do through the OpenBlock admin UI and not have to use the command line at all.

It might also be possible to use something like Yahoo Pipes to massage other source data into a format that one of our standard scripts could handle.

Can I see some examples?

Our  demo site loads Boston news via the scripts in  http://github.com/openplans/openblock/tree/master/obdemo/obdemo/scrapers/ There are examples for numerous other cities in  http://github.com/openplans/openblock/tree/master/everyblock/everyblock/cities/

What kind of data can I parse?

Theoretically anything, but see Ideal Feed Formats.

Scraper-writing tutorial?

Yes! It's here:  http://openblockproject.org/docs/scraper_tutorial.html

Scraper Infrastructure Notes

A Wiki of (non-openblock) Scrapers

  ScraperWiki might be a good resource for finding scraper scripts. Many are in Python. They'd have to be modified to convert the data into OpenBlock? NewsItems?, but this can be a great source of ideas and even code.