Tristan Havelick

Articles

A Library for Making List Webpages "Readable"

Posted a day ago - 2 minute read

This started as a README for a potential open source project. I decided to make a a blog post to maybe get some feedback or suggestions.

Status

Just an idea at this point.

Description

There are "readability" libraries such as:

These take a webpage containing an article and strip out all the navigation and ads. Most of these work great for pages that are articles and some, like Mozilla's do a pretty good job of identifying when a page is NOT an article. However, there isn't much these libraries can currently do with pages that are lists of articles like:

Here, I'm considering making a library that strips out extraneous side navigation/ads/other junk from web pages like these, and either returning a list of article URLs or a very simple page with article links in a bulleted list.

By combining this with a standard readability algorithm/library, one could create a simple and/or text-only view of a website than what is typically rendered by text browsers like w3m and lynx.

A browser built using this could sit in a missing place in the continuum of browser complexity:

Offpunk < THIS THING < w3m/lynx/elinks < visurf/netsurf/surf < qutebrowser/chrome/firefox

Possible Algorithm

  1. Start with a given HTTP/S URL
  2. retrieve that page
  3. Parse the page and get a list of all the links on that page
  4. Remove any links that are to domains other than the one from the original link
  5. Retrieve a few, say 10,maybe random links from the remaining of the list
  6. Get a similar list of links for each of those pages
  7. Get a list of the links that are common across all of the retrieved pages. It is a reasonable assumption that these would we be navigational links.
  8. Finally, we'd return to the list of links from the original page, and remove the links we've determined are navigational
  9. At this point, we're left with links only to unique content pages!

Flaws

  • If a website makes heavy use of cross linking in articles, those articles may be unfairly excluded from the final list of articles
  • For some sites, this would be redundant with RSS/Atom feeds a be of lower quality
  • This probably wouldn't work (out of the box) for sites that rely on client-side JavaScript to render content.