diff --git a/Scrape-Stage.md b/Scrape-Stage.md new file mode 100644 index 0000000..7286fd1 --- /dev/null +++ b/Scrape-Stage.md @@ -0,0 +1,23 @@ +The Scrape pipeline stage allows the content of an RSS item from the RSS feed itself to be replaced with the content scraped from the item's webpage. + +### `extractor` +A string, either `builtin` or the module name of a specific extractor (see below). + +### `convert_to_data_uris` +A boolean that controls whether images in posts should be fetched from the web, converted to data URIs and injected into RSS items. + +This option will significantly increase the database size as images will be stored directly in the DB. + +**Note:** This option may be disabled by server administrators and is restricted to certain MIME types (PNG, JPG, TIFF, HEIF, and HEIC). + +## Extractors +Extractors define how the contents of a web page are isolated from the rest of the page. There is a `builtin` extractor which uses a general purpose algorithm for isolating and extracting contents from the web page, but for some websites it may be unreliable. For this reason, there are a number of builtin extractors for specific websites. + +- beckyhansmeyer.com: `Frenzy.Pipeline.Extractor.BeckyHansmeyer` +- daringfireball.net: `Frenzy.Pipeline.Extractor.DaringFireball` +- ericasadun.com: `Frenzy.Pipeline.Extractor.EricaSadun` +- finertech.com: `Frenzy.Pipeline.Extractor.FinerTech` +- 512pixels.net: `Frenzy.Pipeline.Extractor.FiveTwelvePixels` +- macstories.net: `Frenzy.Pipeline.Extractor.MacStories` +- om.co: `Frenzy.Pipeline.Extractor.OmMalik` +- whatever.scalzi.com: `Frenzy.Pipeline.Extractor.WhateverScalzi` \ No newline at end of file