wiki:Scraper Infrastructure Notes

Comparison of ebdata.retrieval vs. ebdata.blobs

BLOBS RETRIEVAL
Uses templatemaker to strip extraneous HTML YES NO
Crawling dumb sites (eg. incremental page IDs) YES NO
Stores records of crawling in database YES #note1 YES #note2 #note3
Handles arbitrary attributes NO YES
Really short scraper scripts YES #note4 NO #note4
Double-checks location when both location_name and geom are provided NO YES, see safe_location(), but not automatic
Auto-geocodes location_name if needed YES (geotagging.py) YES, create_newsitem()
Parses multiple locations per crawled page and auto-creates multiple NewsItems? YES (geotagging.py) NO, you have to call create_newsitem() manually.
Supports multiple schemas in one scraper NO? YES
Can fetch & parse without saving to db (for testing) NO? YES, display_data()

notes:

#=note1 blobs stores crawl history as Page objects which have the text of the crawled page, and a .when_crawled timestamp, and a fair amount of other metadata.

#=note2 retrieval.scrapers.newsitem_list_detail stores only a timestamp of when each schema was last scraped, by creating a ebpub.db.models.DataUpdate? instance, which just has some basic statistics. Scraped content is not saved.

#=note3retrieval.scrapers.new_newsitem_list_detail creates instances of ebdata.retrieval.models.ScrapedPage? (content and a bit of metadata about a crawled page, much simpler than blobs.models.Page), and NewsItemHistory? (just a m2m mapping of ScrapedPages? to NewsItems?).

#=note4 anecdotally, scrapers written against ebdata.blobs tend to be shorter.