Update page 'Scrape Stage'

Shadowfacts 2019-11-02 03:19:13 +00:00
parent 4651e08553
commit 7023569ef8
1 changed files with 23 additions and 0 deletions

23
Scrape-Stage.md Normal file

@ -0,0 +1,23 @@
The Scrape pipeline stage allows the content of an RSS item from the RSS feed itself to be replaced with the content scraped from the item's webpage.
### `extractor`
A string, either `builtin` or the module name of a specific extractor (see below).
### `convert_to_data_uris`
A boolean that controls whether images in posts should be fetched from the web, converted to data URIs and injected into RSS items.
This option will significantly increase the database size as images will be stored directly in the DB.
**Note:** This option may be disabled by server administrators and is restricted to certain MIME types (PNG, JPG, TIFF, HEIF, and HEIC).
## Extractors
Extractors define how the contents of a web page are isolated from the rest of the page. There is a `builtin` extractor which uses a general purpose algorithm for isolating and extracting contents from the web page, but for some websites it may be unreliable. For this reason, there are a number of builtin extractors for specific websites.
- beckyhansmeyer.com: `Frenzy.Pipeline.Extractor.BeckyHansmeyer`
- daringfireball.net: `Frenzy.Pipeline.Extractor.DaringFireball`
- ericasadun.com: `Frenzy.Pipeline.Extractor.EricaSadun`
- finertech.com: `Frenzy.Pipeline.Extractor.FinerTech`
- 512pixels.net: `Frenzy.Pipeline.Extractor.FiveTwelvePixels`
- macstories.net: `Frenzy.Pipeline.Extractor.MacStories`
- om.co: `Frenzy.Pipeline.Extractor.OmMalik`
- whatever.scalzi.com: `Frenzy.Pipeline.Extractor.WhateverScalzi`