readability/README.md

100 lines
2.7 KiB
Markdown
Raw Normal View History

2016-04-15 11:51:29 +00:00
# Readability
2016-04-24 07:14:31 +00:00
[![Build Status](https://travis-ci.org/keepcosmos/readability.svg?branch=master)](https://travis-ci.org/keepcosmos/readability)
[![Readability version](https://img.shields.io/hexpm/v/readability.svg)](https://hex.pm/packages/readability)
2016-04-25 01:00:04 +00:00
Readability is a tool for extracting and curating the primary readable content of a webpage.
2016-04-24 07:14:31 +00:00
Check out The [Documentation](https://hexdocs.pm/readability/Readability.html) for full and detailed guides
2016-04-15 11:51:29 +00:00
## Installation
If [available in Hex](https://hex.pm/docs/publish), the package can be installed as:
1. Add readability to your list of dependencies in `mix.exs`:
2016-04-24 07:14:31 +00:00
```elixir
def deps do
2016-04-28 06:13:03 +00:00
[{:readability, "~> 0.4"}]
2016-04-24 07:14:31 +00:00
end
```
2016-04-15 11:51:29 +00:00
2. Ensure readability is started before your application:
2016-04-24 07:14:31 +00:00
```elixir
def application do
[applications: [:readability]]
end
```
## Usage
### Examples
```elixir
2016-04-28 06:13:03 +00:00
### Get example page.
%{status_code: 200, body: html} = HTTPoison.get!("https://medium.com/@kenmazaika/why-im-betting-on-elixir-7c8f847b58")
2016-04-24 07:14:31 +00:00
2016-04-28 06:13:03 +00:00
### Extract the title.
2016-04-24 07:14:31 +00:00
Readability.title(html)
2016-04-28 06:13:03 +00:00
#=> "Why Im betting on Elixir"
### Extract authors.
Readability.authors(html)
#=> ["Ken Mazaika"]
2016-04-24 07:14:31 +00:00
2016-04-25 01:03:16 +00:00
### Extract the primary content with transformed html.
2016-04-24 09:40:35 +00:00
html
|> Readability.article
2016-04-28 06:13:03 +00:00
|> Readability.readable_html
2016-04-24 07:14:31 +00:00
#=>
2016-04-28 06:13:03 +00:00
# <div><div><p id=\"3476\"><strong><em>Background: </em></strong><em>Ive spent...
2016-04-24 07:14:31 +00:00
# ...
2016-04-28 06:13:03 +00:00
# ...button!</em></h3></div></div>
2016-04-24 07:14:31 +00:00
2016-04-25 01:03:16 +00:00
### Extract only text from the primary content.
2016-04-24 09:40:35 +00:00
html
|> Readability.article
|> Readability.readable_text
2016-04-24 07:14:31 +00:00
#=>
2016-04-28 06:13:03 +00:00
# Background: Ive spent the past 6 years building web applications in Ruby and.....
2016-04-24 07:14:31 +00:00
# ...
2016-04-28 06:13:03 +00:00
# ... value in this article, it would mean a lot to me if you hit the recommend button!
2016-04-24 07:14:31 +00:00
```
### Options
2016-04-24 09:40:35 +00:00
You may provide options(Keyword type) to `Readability.article`, including:
2016-04-24 07:14:31 +00:00
2016-04-24 09:40:35 +00:00
* retry_length \\\\ 250
* min_text_length \\\\ 25
* remove_unlikely_candidates \\\\ true,
* weight_classes \\\\ true,
* clean_conditionally \\\\ true,
* remove_empty_nodes \\\\ true,
2016-04-24 07:14:31 +00:00
## Test
To run the test suite:
$ mix test
2016-04-24 09:40:35 +00:00
## Todo
2016-04-28 06:13:03 +00:00
* [x] Extract authors
2016-04-24 07:14:31 +00:00
* [ ] Extract Images
2016-04-24 09:40:35 +00:00
* [ ] Extract Videos
2016-04-24 07:14:31 +00:00
* [ ] Convert relative paths into absolute paths of `img#src` and `a#href`
* [ ] More configurable
* [ ] Command line interface
## Related and Inpired Projects
* [readability.js](https://github.com/mozilla/readability) is a standalone version of the readability library used for Firefox Reader View.
* [newspaper](https://github.com/codelucas/newspaper) is an advanced news extraction, article extraction, and content curation library for Python.
* [ruby-readability](https://github.com/cantino/ruby-readability) is a tool for extracting the primary readable content of a webpage.
## LICENSE
This code is under the Apache License 2.0. See <http://www.apache.org/licenses/LICENSE-2.0>.