While developing the updated Gillius.org site, I noticed that sometimes the pages would not update when I went to them, until I hit refresh, even after I exited the browser and restarted it. I was using Firefox, which I learned made the problem more apparent versus IE, but I learned that it was my site's cache settings that was having a problem. I've fixed the problem, so if you see content that doesn't look right, missing comments, or is out of date you might have to refresh manually once, but hopefully not again after this. Read on for the details.
I know a bit about HTTP and its headers from general knowledge and from some programming of RESTful web services. But, what I thought is that when the server returns a Last-Modified header in the HTTP response, that the clients will always issue an HTTP GET with the If-Modified-Since header. What this means, is that if the page hasn't been updated since the date in the browser's cache, the server will return a "304 Not Modified" message and not the whole page. Sounds great, right?
Except that I learned about the Expires and Cache-Control headers. While I knew about Expires, what I didn't realize is what it REALLY means is that if you try to go to a page before the Expires date, the HTTP cache, which can be your HTTP proxy if you have one, or the browser itself, is allowed to return the cached version of the content without even contacting the origin website. That was what I didn't understand.
However, that didn't explain it all. My server was sending Last-Modified but not the Expires header, and when I updated the page I checked on the server to make sure the file date was updated and correct. So what was wrong? Well I pulled out a very handy Firefox extension called Firebug, which allows you to see all of the HTTP traffic (or non-traffic) in my case. What I learned is even when I restarted Firefox and went to my website there was absolutely no traffic to it. I looked at the cache and saw that some of the pages I was viewing were set to expire months from now. Based on what I was seeing, Firefox would wait 4 months before trying to hit the page unless I hit refresh???
Well I learned from an HttpWatch blog posting that per RFC 2616, it suggests using a heuristic to set Expires based on Last-Modified if Expires isn't set, to 10% of the time. So if the page was updated 10 weeks ago, it will set the expires to 1 week from now. Well, some of the pages on my site haven't been updated since 2006, or 5 years ago, which means Firefox's cache expires setting is 6 months.
I also learned that Expires doesn't mean that the cache goes away -- it just means the browser will "revalidate" the page -- that is contact the origin webserver with a If-Modified-Since, which only returns content if it's changed. Also, apparently IE's default settings work more like how I thought browsers always work, which is to revalidate pages in the cache at least once for each time you start the browser. Firefox won't do this even through a restart.
So what is the solution? Instead of setting Expires, you can set Cache-Control's max-age field to have an offset, which has the same effect as setting Expires to some time in the future from when the browser/proxy read the file. You can also set no-cache, which at first I thought meant don't store it at all on the disk, which is not true. All it means is that (at least for Firefox 5 according to Firebug) the page is revalidated against the cache each time. I could be wrong for IE, maybe max-age=0 is better. Please comment if you have some better suggestions.
Another post on HttpWatch suggests setting no-cache for HTML/dynamic pages, and "expire forever" for all other resources (images, JS, CSS, etc). What if you need to change one of these resources? The suggestion is to change the name. This would be annoying, except that specifying a query parameter is sufficient. I checked and I saw this in practice. If you have site.js, and you update it, change your HTML to link to site.js?2. Your HTTP server won't care, but the proxy/browser will load it as a new file. The content of the query string is irrelevant except to make it unique.
I needed a fast fix, and I'm scared I can follow that pattern since I'm still tweaking my site heavily, so for now I just set all resources to expire in 2 hours and HTML pages to no-cache. That doesn't mean resources are totally redownloaded every 2 hours, it just means an "If-Modified-Since" query is made rather than no query at all; the server will respond "Not Modified" if there has been no change; just a quick ping. I'm not concerned since I get 1000s of hits a month and not millions right now. The following .htaccess is what I've got for now:
#Force clients to revalidate cache for "resources" every 2 hours #This is because I'm still commonly changing resources. If I start to #add "version query parameters" to js files and such, I can bump this #up for those types <FilesMatch "\.(ico|pdf|flv|jpg|jpeg|png|gif|js|css|swf)$"> Header merge Cache-Control max-age=7200 </FilesMatch> #Except for HTML files, we want to revalidate each time... #This handles comments/new posts/page edits as well as forums <FilesMatch "\.(html|htm|xml|txt|xsl)$"> Header merge Cache-Control no-cache </FilesMatch>