2016-04-15 11:51:29 +00:00
|
|
|
|
# Readability
|
|
|
|
|
|
2016-04-24 07:14:31 +00:00
|
|
|
|
[![Build Status](https://travis-ci.org/keepcosmos/readability.svg?branch=master)](https://travis-ci.org/keepcosmos/readability)
|
|
|
|
|
[![Readability version](https://img.shields.io/hexpm/v/readability.svg)](https://hex.pm/packages/readability)
|
|
|
|
|
|
2016-04-25 01:00:04 +00:00
|
|
|
|
Readability is a tool for extracting and curating the primary readable content of a webpage.
|
2016-04-24 07:14:31 +00:00
|
|
|
|
Check out The [Documentation](https://hexdocs.pm/readability/Readability.html) for full and detailed guides
|
2016-04-15 11:51:29 +00:00
|
|
|
|
|
|
|
|
|
## Installation
|
|
|
|
|
|
|
|
|
|
If [available in Hex](https://hex.pm/docs/publish), the package can be installed as:
|
|
|
|
|
|
|
|
|
|
1. Add readability to your list of dependencies in `mix.exs`:
|
|
|
|
|
|
2016-04-24 07:14:31 +00:00
|
|
|
|
```elixir
|
|
|
|
|
def deps do
|
2016-07-16 06:36:57 +00:00
|
|
|
|
[{:readability, "~> 0.5.0"}]
|
2016-04-24 07:14:31 +00:00
|
|
|
|
end
|
|
|
|
|
```
|
2016-04-15 11:51:29 +00:00
|
|
|
|
|
|
|
|
|
2. Ensure readability is started before your application:
|
|
|
|
|
|
2016-04-24 07:14:31 +00:00
|
|
|
|
```elixir
|
|
|
|
|
def application do
|
|
|
|
|
[applications: [:readability]]
|
|
|
|
|
end
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
## Usage
|
|
|
|
|
|
|
|
|
|
### Examples
|
2016-05-07 09:23:19 +00:00
|
|
|
|
|
|
|
|
|
#### Just pass url
|
2016-04-24 07:14:31 +00:00
|
|
|
|
```elixir
|
2016-05-07 09:23:19 +00:00
|
|
|
|
url = "https://medium.com/@kenmazaika/why-im-betting-on-elixir-7c8f847b58"
|
|
|
|
|
summary = Readability.summarize(url)
|
|
|
|
|
|
|
|
|
|
summary.title
|
|
|
|
|
#=> "Why I’m betting on Elixir"
|
|
|
|
|
|
|
|
|
|
summary.authors
|
|
|
|
|
#=> ["Ken Mazaika"]
|
|
|
|
|
|
|
|
|
|
summary.article_html
|
|
|
|
|
#=>
|
|
|
|
|
# <div><div><p id=\"3476\"><strong><em>Background: </em></strong><em>I’ve spent...
|
|
|
|
|
# ...
|
|
|
|
|
# ...button!</em></h3></div></div>
|
2016-04-24 07:14:31 +00:00
|
|
|
|
|
2016-05-07 09:23:19 +00:00
|
|
|
|
summary.article_text
|
|
|
|
|
#=>
|
|
|
|
|
# Background: I’ve spent the past 6 years building web applications in Ruby and.....
|
|
|
|
|
# ...
|
|
|
|
|
# ... value in this article, it would mean a lot to me if you hit the recommend button!
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
#### From raw html
|
|
|
|
|
|
|
|
|
|
```elixir
|
2016-04-28 06:13:03 +00:00
|
|
|
|
### Extract the title.
|
2016-04-24 07:14:31 +00:00
|
|
|
|
Readability.title(html)
|
2016-04-28 06:13:03 +00:00
|
|
|
|
|
|
|
|
|
### Extract authors.
|
|
|
|
|
Readability.authors(html)
|
2016-04-24 07:14:31 +00:00
|
|
|
|
|
2016-04-25 01:03:16 +00:00
|
|
|
|
### Extract the primary content with transformed html.
|
2016-04-24 09:40:35 +00:00
|
|
|
|
html
|
|
|
|
|
|> Readability.article
|
2016-04-28 06:13:03 +00:00
|
|
|
|
|> Readability.readable_html
|
2016-04-24 07:14:31 +00:00
|
|
|
|
|
2016-04-25 01:03:16 +00:00
|
|
|
|
### Extract only text from the primary content.
|
2016-04-24 09:40:35 +00:00
|
|
|
|
html
|
|
|
|
|
|> Readability.article
|
|
|
|
|
|> Readability.readable_text
|
|
|
|
|
|
2016-05-07 09:23:19 +00:00
|
|
|
|
### you can extract the primary images with Floki
|
|
|
|
|
html
|
|
|
|
|
|> Readability.article
|
|
|
|
|
|> Floki.find("img")
|
|
|
|
|
|> Floki.attribute("src")
|
2016-04-24 07:14:31 +00:00
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
### Options
|
|
|
|
|
|
2016-05-07 09:23:19 +00:00
|
|
|
|
If result is different with your expectation, you can add options.
|
|
|
|
|
|
|
|
|
|
#### Example
|
|
|
|
|
```elixir
|
|
|
|
|
url = "https://medium.com/@kenmazaika/why-im-betting-on-elixir-7c8f847b58"
|
|
|
|
|
summary = Readability.summarize(url, [clean_conditionally: false])
|
|
|
|
|
```
|
2016-04-24 07:14:31 +00:00
|
|
|
|
|
2016-04-24 09:40:35 +00:00
|
|
|
|
* min_text_length \\\\ 25
|
2016-05-07 09:23:19 +00:00
|
|
|
|
* remove_unlikely_candidates \\\\ true
|
|
|
|
|
* weight_classes \\\\ true
|
|
|
|
|
* clean_conditionally \\\\ true
|
|
|
|
|
* retry_length \\\\ 250
|
|
|
|
|
|
|
|
|
|
**You can find other algorithm and regex options in `readability.ex`**
|
2016-04-24 07:14:31 +00:00
|
|
|
|
|
|
|
|
|
## Test
|
|
|
|
|
|
|
|
|
|
To run the test suite:
|
|
|
|
|
|
|
|
|
|
$ mix test
|
|
|
|
|
|
2016-04-24 09:40:35 +00:00
|
|
|
|
## Todo
|
2016-05-07 07:02:32 +00:00
|
|
|
|
|
2016-05-07 09:23:19 +00:00
|
|
|
|
* [x] Extract authors
|
|
|
|
|
* [x] More configurable
|
|
|
|
|
* [x] Summarize function
|
|
|
|
|
* [ ] Convert relative paths into absolute paths of `img#src` and `a#href`
|
2016-05-07 09:33:06 +00:00
|
|
|
|
|
|
|
|
|
**Contributions are welcome!**
|
|
|
|
|
|
|
|
|
|
Check out [the main features milestone](https://github.com/keepcosmos/readability/milestones) and features of related projects below
|
2016-04-24 07:14:31 +00:00
|
|
|
|
|
|
|
|
|
## Related and Inpired Projects
|
|
|
|
|
|
|
|
|
|
* [readability.js](https://github.com/mozilla/readability) is a standalone version of the readability library used for Firefox Reader View.
|
|
|
|
|
* [newspaper](https://github.com/codelucas/newspaper) is an advanced news extraction, article extraction, and content curation library for Python.
|
|
|
|
|
* [ruby-readability](https://github.com/cantino/ruby-readability) is a tool for extracting the primary readable content of a webpage.
|
|
|
|
|
|
|
|
|
|
## LICENSE
|
|
|
|
|
|
|
|
|
|
This code is under the Apache License 2.0. See <http://www.apache.org/licenses/LICENSE-2.0>.
|