Go to file

Jaehyun Shin 45fe9b1950

Merge pull request #34 from chingan90/feature/mime-regex-change

When we regex-check the MIME header we should also support zero space…

2018-02-12 10:28:11 +09:00

config

Allow HTTPoison options to be specified in config

2016-10-18 09:54:00 -04:00

lib

When we regex-check the MIME header we should also support zero space between the type and the charset, say "text/html;charset=utf-8".

2018-02-09 11:22:17 +08:00

test

Do a case-insensitive content-type check

2017-10-29 15:09:00 +08:00

.gitignore

add ex_docs

2016-04-23 12:52:52 +09:00

.travis.yml

Update minimum Elixir version requirement to 1.3.0

2017-11-03 23:40:18 +08:00

CHANGELOG.md

Merge branch 'master' of https://github.com/keepcosmos/readability

2017-01-27 13:13:19 +09:00

LICENSE

Initial commit

2016-04-15 20:50:51 +09:00

mix.exs

update to 0.9.1

2017-11-09 19:40:34 +09:00

mix.lock

update mix.lock

2017-11-09 19:37:33 +09:00

README.md

update to 0.9.1

2017-11-09 19:40:34 +09:00

README.md

Readability

Readability is a tool for extracting and curating the primary readable content of a webpage.
Check out The Documentation for full and detailed guides

Installation

If available in Hex, the package can be installed as:

Add readability to your list of dependencies in mix.exs:

def deps do
  [{:readability, "~> 0.9"}]
end

Ensure readability is started before your application:

def application do
  [applications: [:readability]]
end

Note: Readability requires Elixir 1.3 or higher.

Usage

Examples

Just pass url

url = "https://medium.com/@kenmazaika/why-im-betting-on-elixir-7c8f847b58"
summary = Readability.summarize(url)

summary.title
#=> "Why I’m betting on Elixir"

summary.authors
#=> ["Ken Mazaika"]

summary.article_html
#=>
# <div><div><p id=\"3476\"><strong><em>Background: </em></strong><em>I’ve spent...
# ...
# ...button!</em></h3></div></div>

summary.article_text
#=>
# Background: I’ve spent the past 6 years building web applications in Ruby and.....
# ...
# ... value in this article, it would mean a lot to me if you hit the recommend button!

From raw html

### Extract the title.
Readability.title(html)

### Extract authors.
Readability.authors(html)

### Extract the primary content with transformed html.
html
|> Readability.article
|> Readability.readable_html

### Extract only text from the primary content.
html
|> Readability.article
|> Readability.readable_text

### you can extract the primary images with Floki
html
|> Readability.article
|> Floki.find("img")
|> Floki.attribute("src")

Options

If result is different with your expectation, you can add options.

Example

url = "https://medium.com/@kenmazaika/why-im-betting-on-elixir-7c8f847b58"
summary = Readability.summarize(url, [clean_conditionally: false])

min_text_length \\ 25
remove_unlikely_candidates \\ true
weight_classes \\ true
clean_conditionally \\ true
retry_length \\ 250

You can find other algorithm and regex options in readability.ex

Test

To run the test suite:

$ mix test

Todo

Extract authors
More configurable
Summarize function
Convert relative paths into absolute paths of img#src and a#href

Contributions are welcome!

Check out the main features milestone and features of related projects below

readability.js is a standalone version of the readability library used for Firefox Reader View.
newspaper is an advanced news extraction, article extraction, and content curation library for Python.
ruby-readability is a tool for extracting the primary readable content of a webpage.

LICENSE

This code is under the Apache License 2.0. See http://www.apache.org/licenses/LICENSE-2.0.

README.md Unescape Escape

Readability

Installation

Usage

Examples

Just pass url

From raw html

Options

Example

Test

Todo

Related and Inpired Projects

LICENSE

README.md