readability/README.md

# Readability

[![Build Status](https://travis-ci.org/keepcosmos/readability.svg?branch=master)](https://travis-ci.org/keepcosmos/readability)
[![Readability version](https://img.shields.io/hexpm/v/readability.svg)](https://hex.pm/packages/readability)
[![Deps Status](https://beta.hexfaktor.org/badge/all/github/keepcosmos/readability.svg)](https://beta.hexfaktor.org/github/keepcosmos/readability)

Readability is a tool for extracting and curating the primary readable content of a webpage.  
Check out The [Documentation](https://hexdocs.pm/readability/Readability.html) for full and detailed guides

## Installation

If [available in Hex](https://hex.pm/docs/publish), the package can be installed as:

  1. Add readability to your list of dependencies in `mix.exs`:

```elixir
def deps do
  [{:readability, "~> 0.9"}]
end
```

  2. Ensure readability is started before your application:

```elixir
def application do
  [applications: [:readability]]
end
```

Note: Readability requires Elixir 1.3 or higher.

## Usage

### Examples

#### Just pass url
```elixir
url = "https://medium.com/@kenmazaika/why-im-betting-on-elixir-7c8f847b58"
summary = Readability.summarize(url)

summary.title
#=> "Why I’m betting on Elixir"

summary.authors
#=> ["Ken Mazaika"]

summary.article_html
#=>
# <div><div><p id=\"3476\"><strong><em>Background: </em></strong><em>I’ve spent...
# ...
# ...button!</em></h3></div></div>

summary.article_text
#=>
# Background: I’ve spent the past 6 years building web applications in Ruby and.....
# ...
# ... value in this article, it would mean a lot to me if you hit the recommend button!
```

#### From raw html

```elixir
### Extract the title.
Readability.title(html)

### Extract authors.
Readability.authors(html)

### Extract the primary content with transformed html.
html
|> Readability.article
|> Readability.readable_html

### Extract only text from the primary content.
html
|> Readability.article
|> Readability.readable_text

### you can extract the primary images with Floki
html
|> Readability.article
|> Floki.find("img")
|> Floki.attribute("src")
```

### Options

If result is different with your expectation, you can add options.

#### Example
```elixir
url = "https://medium.com/@kenmazaika/why-im-betting-on-elixir-7c8f847b58"
summary = Readability.summarize(url, [clean_conditionally: false])
```

* min_text_length \\\\ 25
* remove_unlikely_candidates \\\\ true
* weight_classes \\\\ true
* clean_conditionally \\\\ true
* retry_length \\\\ 250

**You can find other algorithm and regex options in `readability.ex`**

## Test

To run the test suite:

    $ mix test

## Todo

* [x] Extract authors
* [x] More configurable
* [x] Summarize function
* [ ] Convert relative paths into absolute paths of `img#src` and `a#href`

**Contributions are welcome!**

Check out [the main features milestone](https://github.com/keepcosmos/readability/milestones) and features of related projects below

## Related and Inpired Projects

* [readability.js](https://github.com/mozilla/readability) is a standalone version of the readability library used for Firefox Reader View.
* [newspaper](https://github.com/codelucas/newspaper) is an advanced news extraction, article extraction, and content curation library for Python.
* [ruby-readability](https://github.com/cantino/ruby-readability) is a tool for extracting the primary readable content of a webpage.

## LICENSE

This code is under the Apache License 2.0. See <http://www.apache.org/licenses/LICENSE-2.0>.
-												initial commit

											
										
										
											2016-04-15 11:51:29 +00:00
+								# Readability
-												add doc

											
										
										
											2016-04-24 07:14:31 +00:00
+								[![Build Status](https://travis-ci.org/keepcosmos/readability.svg?branch=master)](https://travis-ci.org/keepcosmos/readability)
 								[![Readability version](https://img.shields.io/hexpm/v/readability.svg)](https://hex.pm/packages/readability)
-												add deps status badge

											
										
										
											2017-02-05 10:08:06 +00:00
+								[![Deps Status](https://beta.hexfaktor.org/badge/all/github/keepcosmos/readability.svg)](https://beta.hexfaktor.org/github/keepcosmos/readability)
-												add doc

											
										
										
											2016-04-24 07:14:31 +00:00
-												Update README.md
											
										
										
											2016-04-25 01:00:04 +00:00
+								Readability is a tool for extracting and curating the primary readable content of a webpage.
-												add doc

											
										
										
											2016-04-24 07:14:31 +00:00
+								Check out The [Documentation](https://hexdocs.pm/readability/Readability.html) for full and detailed guides
-												initial commit

											
										
										
											2016-04-15 11:51:29 +00:00
 								## Installation
 								If [available in Hex](https://hex.pm/docs/publish), the package can be installed as:
 . Add readability to your list of dependencies in `mix.exs`:
-												Update README.md
											
										
										
											2017-09-27 06:32:28 +00:00
+								```elixir
 								def deps do
-												update to 0.9.1

											
										
										
											2017-11-09 10:40:34 +00:00
+								  [{:readability, "~> 0.9"}]
-												Update README.md
											
										
										
											2017-09-27 06:32:28 +00:00
+								end
 								```
-												initial commit

											
										
										
											2016-04-15 11:51:29 +00:00
 . Ensure readability is started before your application:
-												Update README.md
											
										
										
											2017-09-27 06:32:28 +00:00
+								```elixir
 								def application do
 								  [applications: [:readability]]
 								end
 								```
-												add doc

											
										
										
											2016-04-24 07:14:31 +00:00
-												Update minimum Elixir version requirement to 1.3.0

											
										
										
											2017-10-30 12:58:37 +00:00
+								Note: Readability requires Elixir 1.3 or higher.
-												add doc

											
										
										
											2016-04-24 07:14:31 +00:00
+								## Usage
 								### Examples
-												add summarize function
this closes #4, closes #3

											
										
										
											2016-05-07 09:23:19 +00:00
 								#### Just pass url
-												add doc

											
										
										
											2016-04-24 07:14:31 +00:00
+								```elixir
-												add summarize function
this closes #4, closes #3

											
										
										
											2016-05-07 09:23:19 +00:00
+								url = "https://medium.com/@kenmazaika/why-im-betting-on-elixir-7c8f847b58"
 								summary = Readability.summarize(url)
 								summary.title
 								#=> "Why I’m betting on Elixir"
 								summary.authors
 								#=> ["Ken Mazaika"]
 								summary.article_html
 								#=>
 								# <div><div><p id=\"3476\"><strong><em>Background: </em></strong><em>I’ve spent...
 								# ...
 								# ...button!</em></h3></div></div>
-												add doc

											
										
										
											2016-04-24 07:14:31 +00:00
-												add summarize function
this closes #4, closes #3

											
										
										
											2016-05-07 09:23:19 +00:00
+								summary.article_text
 								#=>
 								# Background: I’ve spent the past 6 years building web applications in Ruby and.....
 								# ...
 								# ... value in this article, it would mean a lot to me if you hit the recommend button!
 								```
 								#### From raw html
 								```elixir
-												add authors finder

											
										
										
											2016-04-28 06:13:03 +00:00
+								### Extract the title.
-												add doc

											
										
										
											2016-04-24 07:14:31 +00:00
+								Readability.title(html)
-												add authors finder

											
										
										
											2016-04-28 06:13:03 +00:00
 								### Extract authors.
 								Readability.authors(html)
-												add doc

											
										
										
											2016-04-24 07:14:31 +00:00
-												Update README.md
											
										
										
											2016-04-25 01:03:16 +00:00
+								### Extract the primary content with transformed html.
-												add document

											
										
										
											2016-04-24 09:40:35 +00:00
+								html
 								|> Readability.article
-												add authors finder

											
										
										
											2016-04-28 06:13:03 +00:00
+								|> Readability.readable_html
-												add doc

											
										
										
											2016-04-24 07:14:31 +00:00
-												Update README.md
											
										
										
											2016-04-25 01:03:16 +00:00
+								### Extract only text from the primary content.
-												add document

											
										
										
											2016-04-24 09:40:35 +00:00
+								html
 								|> Readability.article
 								|> Readability.readable_text
-												add summarize function
this closes #4, closes #3

											
										
										
											2016-05-07 09:23:19 +00:00
+								### you can extract the primary images with Floki
 								html
 								|> Readability.article
 								|> Floki.find("img")
 								|> Floki.attribute("src")
-												add doc

											
										
										
											2016-04-24 07:14:31 +00:00
+								```
 								### Options
-												add summarize function
this closes #4, closes #3

											
										
										
											2016-05-07 09:23:19 +00:00
+								If result is different with your expectation, you can add options.
 								#### Example
 								```elixir
 								url = "https://medium.com/@kenmazaika/why-im-betting-on-elixir-7c8f847b58"
 								summary = Readability.summarize(url, [clean_conditionally: false])
 								```
-												add doc

											
										
										
											2016-04-24 07:14:31 +00:00
-												add document

											
										
										
											2016-04-24 09:40:35 +00:00
+								* min_text_length \\\\ 25
-												add summarize function
this closes #4, closes #3

											
										
										
											2016-05-07 09:23:19 +00:00
+								* remove_unlikely_candidates \\\\ true
 								* weight_classes \\\\ true
 								* clean_conditionally \\\\ true
 								* retry_length \\\\ 250
 								**You can find other algorithm and regex options in `readability.ex`**
-												add doc

											
										
										
											2016-04-24 07:14:31 +00:00
 								## Test
 								To run the test suite:
 								    $ mix test
-												add document

											
										
										
											2016-04-24 09:40:35 +00:00
+								## Todo
-												Update README.md
											
										
										
											2016-05-07 07:02:32 +00:00
-												add summarize function
this closes #4, closes #3

											
										
										
											2016-05-07 09:23:19 +00:00
+								* [x] Extract authors
 								* [x] More configurable
 								* [x] Summarize function
 								* [ ] Convert relative paths into absolute paths of `img#src` and `a#href`
-												Update README.md
											
										
										
											2016-05-07 09:33:06 +00:00
 								**Contributions are welcome!**
 								Check out [the main features milestone](https://github.com/keepcosmos/readability/milestones) and features of related projects below
-												add doc

											
										
										
											2016-04-24 07:14:31 +00:00
 								## Related and Inpired Projects
 								* [readability.js](https://github.com/mozilla/readability) is a standalone version of the readability library used for Firefox Reader View.
 								* [newspaper](https://github.com/codelucas/newspaper) is an advanced news extraction, article extraction, and content curation library for Python.
 								* [ruby-readability](https://github.com/cantino/ruby-readability) is a tool for extracting the primary readable content of a webpage.
 								## LICENSE
 								This code is under the Apache License 2.0. See <http://www.apache.org/licenses/LICENSE-2.0>.