How many sites are reachable via HTTPS and how many are powered by open source?
Ben Balter is a government evangelist at GitHub and a former presidential innovation fellow. This column originally appeared May 11 on BenBalter.com.
In 2011 and then again in 2014, I used a small tool I wrote to crawl every site on the publicly available list of federal executive dot-gov domains to get a better sense of the state of federal IT, at least when it comes to agencies’ public-facing Web presence.
This weekend, I decided to resurrect that effort, with the recently updated list of dot-gov domains, and with a more finely tuned versions of the open source Site Inspector tool, thanks to some contributions from Eric Mill.
You can always compare them to the original 2011 or 2014 crawls, or browse the entire data set for yourself, but here are some highlights of what I found:
- 1,177 of those domains are live (about 86 percent, up from 83 percent last year, and 73 percent originally)
- Of those live domains, only 75 percent are reachable without the www. prefix, down from 83 percent last year
- 722 sites return an AAAA record, the first step toward IPv6 compliance (up from 64 last year, and 10 before that, more than a 10 times increase)
- 344 sites are reachable via HTTPS (stagnant at one in four from last year), and like last year, only one in 10 enforce it.
- 87 percent of sites have no decreeable CMS (the same as last year), with Drupal leading the pack with 123 sites, WordPress with 29 sites (double from last year), and Joomla powering eight (up one from last year)
- Just shy of 40 percent of sites advertise that they are powered by open source server software (e.g. Apache, Nginx), up from about a third last year, with about one in five sites responding they are powered by closed source software (e.g., Microsoft, Oracle, Sun)
- 61 sites are still somehow running IIS 6.0 (down from 74 last year), a 10-plus-year-old server
- HHS is still the biggest perpetrator of domain sprawl with 117 domains (up from 110 last year), followed by GSA (104, down from 105), Treasury (95, up from 92), and Interior (86, down from 89)
- Only 67 domains have a /developer page, 99 have a /data page, and 74 have a /data.json file, all significantly down from past years, due to more accurate means of calculation, which brings us to
- 255, or just shy of 20 percent of domains, don’t properly return “page not found” or 404 errors, meaning if you programmatically request their /data.json file (or any other nonexistent URL), the server will tell you it’s found the requested file, but really respond with a human-readable “page not found” error, making machine readability especially challenging
As Eric Mill properly points out, the list now includes legislative and judicial dot-gov domains, and thus isn’t limited to just to federal executive dot-govs.
As I’ve said in past years, math’s never been my strong point, so I highly encourage you to check my work. You can browse the full results at dotgov-browser.herokuapp.com or check an individual site (dot-gov or otherwise) at site-inspector.herokuapp.com. The source code for all tools used is available on GitHub. If you find an error, I encourage you to open an issue or submit a pull request.
(Image via tovovan/ Shutterstock.com)