Go to file
keepcosmos 46ac9dddde add doc 2016-04-24 16:14:31 +09:00
config initial commit 2016-04-15 20:51:29 +09:00
lib add doc 2016-04-24 16:14:31 +09:00
test add doc 2016-04-24 16:14:31 +09:00
.gitignore add ex_docs 2016-04-23 12:52:52 +09:00
.travis.yml add .travis.yml 2016-04-23 13:21:57 +09:00
LICENSE Initial commit 2016-04-15 20:50:51 +09:00
README.md add doc 2016-04-24 16:14:31 +09:00
mix.exs add doc 2016-04-24 16:14:31 +09:00
mix.lock add ex_docs 2016-04-23 12:52:52 +09:00
test.html add candidate builder 2016-04-23 12:31:03 +09:00

README.md

Readability

Build Status Readability version

Readability library for extracting and curating articles.
Check out The Documentation for full and detailed guides

Installation

If available in Hex, the package can be installed as:

  1. Add readability to your list of dependencies in mix.exs:
```elixir
def deps do
  [{:readability, "~> 0.3"}]
end
```
  1. Ensure readability is started before your application:
```elixir
def application do
  [applications: [:readability]]
end
```

Usage

To parse document, you must prepare html string. The below example below, html variable is the html code of page from Elixir Design Goals

Examples


### Extract the title
Readability.title(html)
#=> Elixir Design Goals

### Extract the content with transformed html.
content = Readability.content(html)
Readability.raw_html(content)
#=>
# <div><div class=\"entry-content\"><p>During the last year,
# ...
# ...
# or check out our sidebar for other learning resources.</p></div></div>

### Extract the text only content.
Readability.readable_text(content)
#=>
# During the last year, we have spoken at many conferences spreading the word about Elixir. We usually s.....
# ...
# ...
# started guide, or check out our sidebar for other learning resources.

Options

You may provide options(Keyword type) to Readability.content, including:

  • retry_length: 250(default),
  • min_text_length: 25(default),
  • remove_unlikely_candidates: true(default),
  • weight_classes: true(default),
  • clean_conditionally: true(default),
  • remove_empty_nodes: true(default),

Test

To run the test suite:

$ mix test

TODO

  • Extract a author
  • Extract Images
  • Convert relative paths into absolute paths of img#src and a#href
  • More configurable
  • Command line interface
  • readability.js is a standalone version of the readability library used for Firefox Reader View.
  • newspaper is an advanced news extraction, article extraction, and content curation library for Python.
  • ruby-readability is a tool for extracting the primary readable content of a webpage.

LICENSE

This code is under the Apache License 2.0. See http://www.apache.org/licenses/LICENSE-2.0.