Comparison of ebdata.retrieval vs. ebdata.blobs
BLOBS | RETRIEVAL | |
---|---|---|
Uses templatemaker to strip extraneous HTML | YES | NO |
Crawling dumb sites (eg. incremental page IDs) | YES | NO |
Stores records of crawling in database | YES #note1 | YES #note2 #note3 |
Handles arbitrary attributes | NO | YES |
Really short scraper scripts | YES #note4 | NO #note4 |
Double-checks location when both location_name and geom are provided | NO | YES, see safe_location(), but not automatic |
Auto-geocodes location_name if needed | YES (geotagging.py) | YES, create_newsitem() |
Parses multiple locations per crawled page and auto-creates multiple NewsItems? | YES (geotagging.py) | NO, you have to call create_newsitem() manually. |
Supports multiple schemas in one scraper | NO? | YES |
Can fetch & parse without saving to db (for testing) | NO? | YES, display_data() |
notes:
#=note1 blobs stores crawl history as Page objects which have the text of the crawled page, and a .when_crawled timestamp, and a fair amount of other metadata.
#=note2 retrieval.scrapers.newsitem_list_detail stores only a timestamp of when each schema was last scraped, by creating a ebpub.db.models.DataUpdate? instance, which just has some basic statistics. Scraped content is not saved.
#=note3retrieval.scrapers.new_newsitem_list_detail creates instances of ebdata.retrieval.models.ScrapedPage? (content and a bit of metadata about a crawled page, much simpler than blobs.models.Page), and NewsItemHistory? (just a m2m mapping of ScrapedPages? to NewsItems?).
#=note4 anecdotally, scrapers written against ebdata.blobs tend to be shorter.