Wednesday, June 24, 2020

preventing viewers from seeing stale static pages

Until recently, the front page of was generated by a CGI script.  There was a program, in a file named index.cgi, that ran every time you visited  The output of this program was sent to your browser, and this was the page that you saw.  While this setup worked fine, it was a little slow, so I started thinking about ways to speed it up.

Soon, I realized that the front page didn't need any of the dynamic behavior that a CGI script allows.  I decided to just run the script and save the output in a file named index.html.  In this way, I was able to speed up the site without doing a lot of work.  To make changes in the future, I can edit the CGI script, test it in the browser to make sure it works, re-run it on the server, and save the output to index.html.

Now, if you visit the site, the front page loads lightning-fast.  If you reload the page by clicking the logo in the top-left, the reload will be even faster: the server returns 304 Not Modified, and your browser can use the version of the page that is sitting in your browser's cache.

This is all straightforward, right?  Well, not exactly.  Some browsers -- Firefox and Chrome, at least -- will serve a static page directly from the browser cache, without checking with the server first.  If you update the page in between a viewer's visits to the site, the viewer may end up seeing the old version of the page, not the new one.  To circumvent this behavior, the server must explicitly tell the browser to check with the server before showing the cached version of the page.  This can be accomplished by sending the following header with the response:

        Cache-Control: must-revalidate, max-age=0

must-revalidate tells the browser to check before displaying the page.  However, the "check" here is not exactly checking the server; the browsers that I tested will often perform this "check" by checking against the browser cache.  To force the browser to check against the server, I had to set an expiry on the cached page.  That is what max-age=0 is doing there.

There is also a middle ground between "always check with the server" and "never check with the server."  If you are fine with letting someone see a page that is slightly out of date -- let's say an hour -- you can send the header

        Cache-Control: must-revalidate, max-age=3600

This way, someone viewing your site will only see a stale page for an hour.  The upside is that this will be faster, both for the browser (it can often show the page without having to use the network) and for your server (you won't have browsers connecting to your site as often).  The benefit is especially significant if the page in question is a "hub" page that you expect people to visit often while navigating your site.

If you have been on the Internet for a while, you have probably encountered this problem once or twice, when visiting a static site repeatedly.  You will see the latest version of the site the first time you visit, but on later visits, you may have to refresh to see updates. Many web pages tell you to refresh them; this is another manifestation of the problem.  The problem can be avoided, as I just explained, but it is not obvious to everyone that there is a problem (unless you go looking for it!).

That's about it.  If everything in this post was already obvious to you, then good... I am happy for you!  But when I started seeing unexpected behavior around static pages, I had to piece the solution (and the explanation) together from a few different places around the web.  I have written this post in the hopes that it will help someone who encountered the same problem as I did.