Update page 'Scrape Stage'

2019-11-02 03:19:13 +00:00 · 2019-11-02 03:19:13 +00:00 · 7023569ef8
parent 4651e08553
commit 7023569ef8
1 changed files with 23 additions and 0 deletions
--- a/Scrape-Stage.md
+++ b/Scrape-Stage.md
@ -0,0 +1,23 @@
+The Scrape pipeline stage allows the content of an RSS item from the RSS feed itself to be replaced with the content scraped from the item's webpage.
+
+### `extractor`
+A string, either `builtin` or the module name of a specific extractor (see below).
+
+### `convert_to_data_uris`
+A boolean that controls whether images in posts should be fetched from the web, converted to data URIs and injected into RSS items.
+
+This option will significantly increase the database size as images will be stored directly in the DB.
+
+**Note:** This option may be disabled by server administrators and is restricted to certain MIME types (PNG, JPG, TIFF, HEIF, and HEIC).
+
+## Extractors
+Extractors define how the contents of a web page are isolated from the rest of the page. There is a `builtin` extractor which uses a general purpose algorithm for isolating and extracting contents from the web page, but for some websites it may be unreliable. For this reason, there are a number of builtin extractors for specific websites.
+
+- beckyhansmeyer.com: `Frenzy.Pipeline.Extractor.BeckyHansmeyer`
+- daringfireball.net: `Frenzy.Pipeline.Extractor.DaringFireball`
+- ericasadun.com: `Frenzy.Pipeline.Extractor.EricaSadun`
+- finertech.com: `Frenzy.Pipeline.Extractor.FinerTech`
+- 512pixels.net: `Frenzy.Pipeline.Extractor.FiveTwelvePixels`
+- macstories.net: `Frenzy.Pipeline.Extractor.MacStories`
+- om.co: `Frenzy.Pipeline.Extractor.OmMalik`
+- whatever.scalzi.com: `Frenzy.Pipeline.Extractor.WhateverScalzi`