There are many days when I don't feel like working on my project. I use this feeling to "productively procrastinate" on things that I've been wanting to do but haven't done yet. Earlier this week I decided to tackle two related problems:
- I want to know which pages are reachable from the home page. I can then review the ones that aren't reachable and consider adding them if they're finished.
- I want to make suggestions on the 404 page, but only to pages that are reachable from the home page. There are a whole bunch of random pages I have that aren't finished or useful, and I don't want to use those for suggestions.
To implement this, I parsed each page and found the links using a regular expression pattern and some quick-and-dirty code:
RE_anchor = re.compile(r'<a[^>]* href="([^"#]+)[^"]*"') def url_links_to(relativeurl, html_contents): "Return the set of relative links from this page" site = "https://www.redblobgames.com" urls = [] for url in RE_anchor.findall(html_contents): if url.find("mailto:") == 0: continue url = urllib.parse.urljoin(site + relativeurl, url) if not url.startswith(site): continue url = url.replace(site, "") if not (url.endswith(".html") or url.endswith("/")): continue if url.endswith("/index.html"): url = url.replace("/index.html", "/") if url == "/": url = "/index.html" if url in urls: continue urls.append(url) return urls
I then used depth first search to find all the pages reachable from the home page:
# link_map[url] = urls_links_to(url, contents of page) def all_reachable_pages(link_map): "Return a list of all pages reachable from the home page" frontier = ["/index.html"] reached = set(frontier) while frontier: url = frontier.pop() if url not in link_map: print("WARNING: possible 404", url) continue for child in link_map[url]: if child not in reached: frontier.append(child) reached.add(child) return reached
For part 1, I made a list of the reachable pages and I plan to review it periodically.
For part 2, I want help readers who encounter a 404 on my site. I looked through the 404 server logs to see what I might be able to help with. I found lots of bogus requests such as wpAdmin and other admin URLs (people trying to break into my server), and also lots of what seemed to be buggy crawlers. But I also found many URLs that seem to come from real humans. These seem to be either from copy/paste or forums automatically linkifying URLs:
- https://www.redblobgames.com/grids/hexagons/).
- https://www.redblobgames.com/pathfinding/a-star/introduction.html%C2%A0%E2%80%A6
- https://www.redblobgames.com/articles/visibility/)and
- https://www.redblobgames.com/grids/hexagons/,
- https://www.redblobgames.com/pathfinding/a-star/implementation.html%5D
- https://www.redblobgames.com/pathfinding/tower-defense/}(https://www.redblobgames.com/pathfinding/tower-defense/)
The last one looks like a Markdown typo. There are also some that look like escaping/quoting errors:
- https://www.redblobgames.com/grids/hexagons/%23map-storage
- https://www.redblobgames.com/pathfinding/a-star/implementation.html1003Introduction%20to%20the%20A*%20Algorithm
- https://www.redblobgames.com/grids/hexagons/"
All of these seem to have an unwanted suffix. I decided to implement a suggestion on the 404 page. I looked for a prefix of the non-matching URL that matched a valid URL. I picked the longest match:
const request = window.location.pathname; let bestUrl = ""; for (let url of urlsReachableFromHomePage) { if (url.length > bestUrl.length && request.slice(0, url.length) == url) { bestUrl = url; } }
You can try it out by clicking on the broken links above.
This was a relatively low priority project but so satisfying.
No comments:
Post a Comment