Go to file
Jaehyun Shin cc5e07271a Update README.md 2016-04-25 10:03:16 +09:00
config initial commit 2016-04-15 20:51:29 +09:00
lib add document 2016-04-24 18:40:35 +09:00
test add document 2016-04-24 18:40:35 +09:00
.gitignore add ex_docs 2016-04-23 12:52:52 +09:00
.travis.yml add .travis.yml 2016-04-23 13:21:57 +09:00
CANGELOG.md add document 2016-04-24 18:40:35 +09:00
LICENSE Initial commit 2016-04-15 20:50:51 +09:00
README.md Update README.md 2016-04-25 10:03:16 +09:00
mix.exs add document 2016-04-24 18:40:35 +09:00
mix.lock add ex_docs 2016-04-23 12:52:52 +09:00
test.html add candidate builder 2016-04-23 12:31:03 +09:00

README.md

Readability

Build Status Readability version

Readability is a tool for extracting and curating the primary readable content of a webpage.
Check out The Documentation for full and detailed guides

Installation

If available in Hex, the package can be installed as:

  1. Add readability to your list of dependencies in mix.exs:
```elixir
def deps do
  [{:readability, "~> 0.3"}]
end
```
  1. Ensure readability is started before your application:
```elixir
def application do
  [applications: [:readability]]
end
```

Usage

The example below, html variable is the html source from blog content "Elixir Design Goals".

Examples


### Extract the title
Readability.title(html)
#=> Elixir Design Goals

### Extract the primary content with transformed html.
html
|> Readability.article
|> Readability.raw_html
#=>
# <div><div class=\"entry-content\"><p>During the last year,
# ...
# ... out our sidebar for other learning resources.</p></div></div>

### Extract only text from the primary content.
html
|> Readability.article
|> Readability.readable_text

#=>
# During the last year, we have spoken at many conferences spreading the word about Elixir. We usually s.....
# ...
# ... started guide, or check out our sidebar for other learning resources.

Options

You may provide options(Keyword type) to Readability.article, including:

  • retry_length \\ 250
  • min_text_length \\ 25
  • remove_unlikely_candidates \\ true,
  • weight_classes \\ true,
  • clean_conditionally \\ true,
  • remove_empty_nodes \\ true,

Test

To run the test suite:

$ mix test

Todo

  • Extract authors
  • Extract Images
  • Extract Videos
  • Convert relative paths into absolute paths of img#src and a#href
  • More configurable
  • Command line interface
  • readability.js is a standalone version of the readability library used for Firefox Reader View.
  • newspaper is an advanced news extraction, article extraction, and content curation library for Python.
  • ruby-readability is a tool for extracting the primary readable content of a webpage.

LICENSE

This code is under the Apache License 2.0. See http://www.apache.org/licenses/LICENSE-2.0.