Scraper Infrastructure Notes – OpenBlock Developers

Context Navigation

Comparison of ebdata.retrieval vs. ebdata.blobs

	BLOBS	RETRIEVAL
Uses templatemaker to strip extraneous HTML	YES	NO
Crawling dumb sites (eg. incremental page IDs)	YES	NO
Stores records of crawling in database	YES #note1	YES #note2 #note3
Handles arbitrary attributes	NO	YES
Really short scraper scripts	YES #note4	NO #note4
Double-checks location when both location_name and geom are provided	NO	YES, see safe_location(), but not automatic
Auto-geocodes location_name if needed	YES (geotagging.py)	YES, create_newsitem()
Parses multiple locations per crawled page and auto-creates multiple NewsItems?	YES (geotagging.py)	NO, you have to call create_newsitem() manually.
Supports multiple schemas in one scraper	NO?	YES
Can fetch & parse without saving to db (for testing)	NO?	YES, display_data()

notes:

#=note1 blobs stores crawl history as Page objects which have the text of the crawled page, and a .when_crawled timestamp, and a fair amount of other metadata.

#=note2 retrieval.scrapers.newsitem_list_detail stores only a timestamp of when each schema was last scraped, by creating a ebpub.db.models.DataUpdate? instance, which just has some basic statistics. Scraped content is not saved.

#=note3retrieval.scrapers.new_newsitem_list_detail creates instances of ebdata.retrieval.models.ScrapedPage? (content and a bit of metadata about a crawled page, much simpler than blobs.models.Page), and NewsItemHistory? (just a m2m mapping of ScrapedPages? to NewsItems?).

#=note4 anecdotally, scrapers written against ebdata.blobs tend to be shorter.

Download in other formats:

Plain Text