Monday, September 21, 2020

Reader Experience With 404S

There are many days when I don't feel like working on my project. I use this feeling to "productively procrastinate" on things that I've been wanting to do but haven't done yet. Earlier this week I decided to tackle two related problems:

  1. I want to know which pages are reachable from the home page. I can then review the ones that aren't reachable and consider adding them if they're finished.
  2. I want to make suggestions on the 404 page, but only to pages that are reachable from the home page. There are a whole bunch of random pages I have that aren't finished or useful, and I don't want to use those for suggestions.

To implement this, I parsed each page and found the links using a regular expression pattern and some quick-and-dirty code:

RE_anchor = re.compile(r'<a[^>]* href="([^"#]+)[^"]*"') def url_links_to(relativeurl, html_contents):     "Return the set of relative links from this page"     site = "https://www.redblobgames.com"     urls = []     for url in RE_anchor.findall(html_contents):         if url.find("mailto:") == 0: continue         url = urllib.parse.urljoin(site + relativeurl, url)         if not url.startswith(site): continue         url = url.replace(site, "")         if not (url.endswith(".html") or url.endswith("/")): continue         if url.endswith("/index.html"): url = url.replace("/index.html", "/")         if url == "/": url = "/index.html"         if url in urls: continue         urls.append(url)     return urls 

I then used depth first search to find all the pages reachable from the home page:

# link_map[url] = urls_links_to(url, contents of page) def all_reachable_pages(link_map):     "Return a list of all pages reachable from the home page"     frontier = ["/index.html"]     reached = set(frontier)     while frontier:         url = frontier.pop()         if url not in link_map:             print("WARNING: possible 404", url)             continue         for child in link_map[url]:             if child not in reached:                 frontier.append(child)                 reached.add(child)      return reached 

For part 1, I made a list of the reachable pages and I plan to review it periodically.

For part 2, I want help readers who encounter a 404 on my site. I looked through the 404 server logs to see what I might be able to help with. I found lots of bogus requests such as wpAdmin and other admin URLs (people trying to break into my server), and also lots of what seemed to be buggy crawlers. But I also found many URLs that seem to come from real humans. These seem to be either from copy/paste or forums automatically linkifying URLs:

The last one looks like a Markdown typo. There are also some that look like escaping/quoting errors:

All of these seem to have an unwanted suffix. I decided to implement a suggestion on the 404 page. I looked for a prefix of the non-matching URL that matched a valid URL. I picked the longest match:

const request = window.location.pathname; let bestUrl = ""; for (let url of urlsReachableFromHomePage) {     if (url.length > bestUrl.length         && request.slice(0, url.length) == url) {         bestUrl = url;     } } 

You can try it out by clicking on the broken links above.

This was a relatively low priority project but so satisfying.

No comments:

Post a Comment