Speed up a Drupal web site by preloading the page cache

Topics: Drupal
Technologies: Drupal 5+

Drupal's page cache stores fully-built pages for quick delivery to anonymous visitors to your site. However, any change to your site's content clears the cache soon afterwards. The next visitor to each page must wait for the page to be rebuilt fresh. For some pages, such as long view lists, this may take a few seconds. This article shows how to completely rebuild the page cache automatically so that visitors are always delivered cached pages quickly.

How Drupal's page cache works

Drupal's page cache is stored as a database table containing fully-built HTML pages tagged with the page's URL. When an anonymous visitor asks for a page, Drupal looks up the URL in the table. If an entry is found, the HTML in that entry is returned to the visitor immediately. This takes much less than a second, even on a slow web server.

If a page isn't in the page cache, Drupal goes about the slower process of building the page from scratch by assembling node content, laying out the page, and themeing everything. When the page is built, it's saved to the page cache and returned to the visitor. The next visitor to ask for that page gets it quickly from the page cache.

Gradually, as a site is used, the page cache fills in and more and more page requests are satisfied by the page cache. For the best performance, all pages should be in the page cache.

Caching pages for anonymous visitors only

Drupal's page cache is only used for anonymous visitors. Drupal treats all such visitors identically – they all have the same editing permissions (usually none), the same language and time zone settings, and the same theme. This makes it possible to build pages generically, save them in the cache, and deliver the same cached page over and over to any anonymous visitor.

In contrast, logged in visitors can each have different editing permissions and different language, time zone, and theme settings that make the pages for them look and act differently from those other visitors get. It isn't practical for Drupal to cache a unique version of every page for every visitor, so logged in visitors always get pages built fresh every time.

For most sites, most visitors are anonymous. And for these visitors Drupal's page cache is an essential performance feature. The more filled in it is, the better it is.

Clearing the cache on site changes

If you edit a node, the page displaying that node changes. If that page is in the page cache, it should be removed from the cache so that visitors won't be sent an out-of-date version of the page. If there are other pages that refer to the edited node, such as view lists, then those pages should be removed from the cache too.

Ideally, Drupal would keep track of every page that depends upon every node so that it can remove the right pages from the cache on a node change. Unfortunately, this isn't practical. There are way too many ways in which a node can affect pages throughout a site. A node could be listed in one or more views, shown in a "recent posts" block on every page, listed in an RSS feed, used by a CCK node reference field in another node, or queried by custom theme or page PHP code. There is no way for Drupal to know exactly what pages depend upon what nodes.

There are similar issues with any change to your site's content. Posting a comment to a node, for instance, changes that node's page, but it may also add the comment to a "recent comments" block shown on every page. Changing user information could affect how the user's name is listed on all pages, comments, and forum posts that they've authored. Changing the block layout, editing a menu, configuring a view, adding a poll, adding a new module, altering theme features, changing taxonomy terms, setting site permissions, or updating aggregated news feeds also all change one or more site pages.

All of these changes must remove old pages from the page cache. Because it isn't practical to keep track of exactly what pages need to be removed for a particular change, Drupal always clears the whole page cache on every site change.

Setting the minimum cache lifetime

For a busy site, something is always changing. Every few seconds somebody is posting a comment, adding to a forum, or editing a wiki page. If every one of these changes cleared the page cache, the page cache wouldn't be much use. It would be empty more than it's full and most site visitors would have to wait for Drupal to build pages fresh every time.

To slow down page cache clearing, you can set Drupal's Minimum cache lifetime on the Administer > Site configuration > Performance page. If you set it to the maximum value of 1 day, the page cache is cleared no more often than once a day. This gives the cache a chance to fill in a little and be useful to speed up the site.

However, with a cache lifetime of 1 day, site changes won't be visible to anonymous visitors for up to 1 day. Old pages in the cache will continue to be shown even after you've made changes. If you shorten this time so that updates are seen sooner, the cache is cleared more often and visitors may have to wait for a page to be rebuilt. This reduces the overall performance of the site. It's a trade-off you must consider: rapid updates or better performance.

How to preload the page cache

Once the page cache is cleared, the next visit to each site page will build the page fresh and take longer. To avoid making visitors wait longer, you can preload the cache by building and caching every page before any visitor asks for one. Then, when a visitor does ask for the page, the page is already in Drupal's page cache and it's returned quickly.

Crawling the entire site

To fill the page cache with the site's pages you'll need to issue a page request to Drupal for every page at the site. You can do this manually by visiting your own site as an anonymous visitor and clicking through to every page. For any site with more than a few pages, this isn't practical.

To automatically visit every page, you can set up a web spider to crawl your own site. There are many open source spiders available. In this article, I'll use wget. This is an open source command-line program that runs on Linux, Mac, and Windows platforms.

Wget gets a web page from a web server, but it can also process the page to find embedded URLs and then get those pages too. As wget crawls from page to page, it keeps track of where it is so that it doesn't visit the same page twice. It also can be configured to stop at the boundaries of your site so that it doesn't crawl onwards into the web at large.

Wget has a lot of command-line arguments available and it can do much more than crawl a web site. For our purposes, there are eight important arguments which I'll discuss below. While you can type them all in manually, it's convenient to build a script to run the program for you. Here's a bash script for the Mac and Linux. You can create a similar batch command script for Windows. Fill in your own site's domain, of course (shown in bold at the top of the listing below). See the Downloads section to download this script.

#!/bin/bash
#
# Preload a web site's cache
#
site="example.com"
tmp="downloads"
log="log.txt"

echo "Crawling $site."

# Remove any prior downloaded files.
rm -rf $tmp

# Crawl the site
time wget \
        --recursive \
        --domains=$site \
        --level=inf \
        --directory-prefix=$tmp \
        --force-directories \
        --delete-after \
        --output-file=$log \
        --no-verbose \
        http://$site/

# When the crawl is done, the download files are removed.
# Now remove the leftover directories too
rm -rf $tmp

echo
echo "Done.  A log of the crawl is in '$log'."

The first few wget arguments control its crawling:

--recursive
Process page URLs and crawl those pages too.
--level=inf
Keep going no matter how deep it gets into your site. If you want to stop wget after a few steps, replace inf (infinite) with a number, as in --level=3.
--domains=SITE
Only crawl pages at your site. Set SITE to your site's domain name, such as --domain=Example.com. If you have more than one domain for the crawl, you can list them with commas between them.

The next few arguments control how wget stores temporary files during a crawl, and deletes them afterwards:

--directory-prefix=DIR
Temporarily store downloaded pages into a directory DIR. For instance, --directory-prefix=tmp puts the pages into a directory named tmp.
--force-directories
Create subdirectories for the downloaded pages.
--delete-after
Delete the downloaded pages after the crawl is done.

And the last few arguments configure its logging of results:

--output-file=FILE
Save a log to FILE. The log lists every page downloaded, and any errors in getting them.
--no-verbose
Keep that log file to a minimum.

The last command-line argument is the page to start the crawl on. This is probably your site's home page.

Once you start this running, walk away for awhile. The bigger the site, the longer it'll take to crawl through all of its pages. Expect this to take from minutes to an hour.

Crawling part of a site

The wget crawl started by the script above crawls the entire site starting at the home page. To speed up the crawl, you can restrict it to visit only the most important parts of your site. There are several ways to do this. Here are a few arguments to try:

--no-parent
Restrict the crawl to traveling down through a site, and not up. So if you start the crawl at http://Example.com/products it will never leave the products portion of the site.
--include-directories=LIST
Restrict the crawl to a comma-separated list of directories at your site. For example, --include-directories=books,forums would only crawl the books and forums sections of your site.
--exclude-directories=LIST
Restrict the crawl to avoid a comma-separated list of directories at your site. For example, --exclude-directories=shop would skip the shopping pages at your site.

How to clear the page cache

When your site's minimum cache lifetime is set, any change to the site starts a timer going. The first visitor to ask for a page after that cache timer goes off triggers Drupal to automatically clear the page cache.

If you're trying to keep your page cache filled in, Drupal's automatic cache clearing is a problem because it's hard to predict exactly when it will occur. And if you don't know when the cache is cleared, you can't schedule a wget crawler to rebuild the cache afterwards. If the timer goes off after you crawl your site, the cache will be cleared and the crawl was a waste of time.

To insure that a crawl is useful, you need to clear the page cache first. There are several manual ways to do this:

  • Disable then enable again the page cache on the Administer > Site configuration > Performance page.
  • Change the page cache minimum lifetime to "None" and back again on that same page.
  • Install the Devel module and enable it's "Devel" block. Click on the block's "Empty cache" menu item to clear the cache.

To clear the page cache automatically, you need to call Drupal's cache_clear_all method before the crawl. To do this, create a PHP script that clears the cache, such as "clear.php" below. Put this in the top directory for Drupal, next to "cron.php". See the Downloads section to download this script.

<?php
/**
* Clears the page cache.
*/
include_once './includes/bootstrap.inc';
drupal_bootstrap(DRUPAL_BOOTSTRAP_FULL);
cache_clear_all( '*', 'cache_page', TRUE );

Then trigger a page cache clear by using wget to get the page:

wget --quiet --delete-after http://Example.com/clear.php

While letting just anybody visit this page to clear your cache won't break your site, it would be better if they didn't do that. It's unlikely somebody will guess that this page is here, but you might want to hide it. You could name it something obscure or you could block access to it using your Apache configuration like this:

<FilesMatch "^clear.php$">
    Order allow,deny
    Allow from 192.168.
</filesMatch>

Add these lines to the virtual host section of your Apache configuration files to deny access to the page except from hosts with an IP address that starts with 192.168. This is a typical IP address prefix for local networks, but you should set this to something appropriate for your situation.

How to automatically clear and preload the page cache

Now you can adjust the earlier crawler script to clear the page cache first by requesting the "clear.php" page before it does the crawl (shown in bold below). See the Downloads section to download this script.

#!/bin/bash
#
# Preload a web site's cache
#
site="example.com"
tmp="downloads"
log="log.txt"

echo "Crawling $site."

# Remove any prior downloaded files.
rm -rf $tmp

# Clear the page cache first.
wget --quiet --delete-afterwards http://$site/clear.php

# Crawl the site
time wget \
	--recursive \
	--domains=$site \
	--level=inf \
	--directory-prefix=$tmp \
	--force-directories \
	--delete-after \
	--output-file=$log \
	--no-verbose \
	http://$site/

# When the crawl is done, the download files are removed.
# Now remove the leftover directories too
rm -rf $tmp

echo
echo "Done.  A log of the crawl is in '$log'."

If you run this every midnight, your page cache will always be preloaded for the next day's visitor traffic. This script can be run from anywhere. It doesn't have to be run on your web server. Just make sure your Apache configuration allows access to "clear.php" from whatever IP address(es) you run the crawl from.

To run the script automatically you'll need to use your platform's own mechanisms for running programs periodically. On a Mac, create a LaunchDaemon plist manually or use the open source Lingon tool. On Linux, use cron. On Windows, there are several options discussed on Drupal's Configuring cron jobs on Windows page.

Benchmarking page builds

All of the above may be unnecessary if your site runs on a fast dedicated web server or you don't have complex pages to build. In those cases, building a page fresh may not take long enough that you need to worry about it. To find out, benchmark your site by timing page builds for some or all of your site's pages.

There are many ways to measure the time it takes to get a page. Here are a few, but if you want to skip ahead to my recommended method, jump to Benchmarking with curl.

Benchmarking with Safari or Firefox

The Safari browser for the Mac and Windows has built-in features to benchmark page load times. To enable web developer features:

  • Show the preferences window by selecting "Preferences" from the application menu.
  • Click on the window's "Advanced" tab.
  • Check the "Show Develop menu in menu bar" item.
  • Close the window.

You only have to do this the first time. Thereafter the "Develop" menu for web developers is available on the menu bar. To use the developer features, select "Show Web Inspector" from the "Develop" menu. This opens an area in the lower half of the browser that shows you information about the current web page. By default, benchmarking is disabled because it slows down web browsing. To enable it, go to the "Resources" tab and click on the "Enable resource tracking" button.

Once enabled, Safari's "Resources" tab shows a list of files loaded for the current page (HTML, CSS, images, etc.) and a timeline that shows when those files were loaded. Each file has a bar on the timeline and the longer the bar, the longer it took to load. Move your cursor over the bar to see its load time. The top file on the list is always the HTML page.

The Firefox browser for Mac, Linux, and Windows has lots of developer add-ons. I use Firebug. Once installed, click on the bug icon on the status bar at the bottom of the browser. This opens an area in the lower half of the browser to show you information about the current web page. The "Net" tab shows a list of files loaded for the current page (HTML, CSS, images, etc.) and a timeline to show when those files were loaded. Each file has a bar on the timeline, and beside each bar is the number of milliseconds (1/1000th of a second) it took to load the file. The top file is always the HTML page.

To use either browser to benchmark your site, clear your page cache and load pages one by one while jotting down the load times for the HTML page. You don't have to check every page, just the ones most likely to be slow, such as pages with views.

Remember to load pages while logged out of your site since Drupal's page cache is only used by anonymous visitors.

Benchmarking with php and curl

Loading pages one by one manually is slow work. To do it automatically, you could write a PHP script to get a web page using PHP's CURL functions, then time each page get. You could even write your own spider using PHP, though it will probably run much slower than wget.

I have several relevant articles at my site if you want to take the PHP route:

Writing a PHP script will work fine, but it requires programming. I'd rather use existing tools if I can.

Benchmarking with wget

Unfortunately wget does not report page load times itself, so you can't time pages while you crawl your site. Instead you can use the Mac or Linux time command and run wget for one page at a time:

time wget --quiet --delete-after http://Example.com/

The output looks like this:

real	0m0.155s
user 0m0.002s
sys 0m0.005s

The "real" line gives the time it took to get the page (technically, this is the wall clock time from start to finish, including any delays from a slow network or your own computer pausing to do something else, but it will be close enough to the server's page build time that it can be treated as such).

To get a list of pages at your site, crawl your site first and then process the wget log file like this on Linux or the Mac:

awk '/URL/{ print $3 }' log.txt | sed 's/URL://' > files.txt

You could then write a bash script to build up a list of pages and these times, but it's awkward.

Benchmarking with ab

The open source ab command-line program also gets a web page. Ab stands for "Apache Bench" and it's included with the Apache server. Macs come with ab pre-installed, and you can get it easily for Linux and Windows platforms.

ab is not a spider, so it won't crawl from page to page. However, it does report the time it takes to get a page. Ab has a lot of command-line arguments, but you don't need any of them to simply time a single page request. Just give ab the URL of the page to get:

ab http://Example.com/

The output looks like this:

This is ApacheBench, Version 2.0.40-dev <$Revision: 1.146 $> apache-2.0
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Copyright 2006 The Apache Software Foundation, http://www.apache.org/

Benchmarking Example.com (be patient).....done


Server Software:        Apache/2.2.3
Server Hostname:        Example.com
Server Port:            80

Document Path:          /
Document Length:        438 bytes

Concurrency Level:      1
Time taken for tests:   0.34371 seconds
Complete requests:      1
Failed requests:        0
Write errors:           0
Total transferred:      708 bytes
HTML transferred:       438 bytes
Requests per second:    29.09 [#/sec] (mean)
Time per request:       34.371 [ms] (mean)
Time per request:       34.371 [ms] (mean, across all concurrent requests)
Transfer rate:          0.00 [Kbytes/sec] received

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:       15   15   0.0     15      15
Processing:    19   19   0.0     19      19
Waiting:       18   18   0.0     18      18
Total:         34   34   0.0     34      34

And the part we need for benchmarking is the line "Time taken for tests" (in bold above). (Technically, the "Waiting" line near the bottom of the output is the better measure of the server's page build time. You can use this if you like, but doing so won't change the overall picture of which pages are slow.)

To extract that time from the output of ab use awk:

ab http://Example.com/ | awk '/Time taken for tests/{ print $5 }'

To get a list of pages at your site, crawl your site first and process the wget log file using awk and sed on Linux or the Mac:

awk '/URL/{ print $3 }' log.txt | sed 's/URL://' > files.txt

Now use a bash script to loop through all of those files and run ab to time each one. Clear your page cache first:

for url in `cat files.txt`; do
    tm=`ab $url | awk '/Time taken for tests/{ print $5 }'`
    echo $tm $url
done > tmp.txt

Finally, use sort to sort the output from high (slow) to low (fast). Look at the top few entries of the sorted list to see if your site is slow enough to make page cache preloading worth the effort.

sort -rn tmp.txt > files_times.txt

Benchmarking with curl

My recommended method uses the open source curl command-line program for Mac, Linux, and Windows. The curl command gets a web page and reports the time it took. Unfortunately it can't crawl the site, so you still need wget for that.

While there are a lot of arguments available for curl, the one's you need for benchmarking are these:

--output tmp.txt
Output the page to a temporary file. Without this argument, curl sends the page to the screen, which is inconvenient.
--silent
Don't output any other status information while getting the page. Without this argument, curl prints a table showing the number of bytes and the time it took to get them. That's more information than we need.
--write-out "%{time_total} %{url_effective}\n"
This is the important argument. It asks curl to print one line of output that includes the total time to get the page and the URL of that page.

To time getting a single web page, use:

curl --output tmp.txt --silent --write-out "%{time_total} %{url_effective}\n" http://Example.com/

And the output looks like this with the time, in milliseconds (1/1000th of a second), and the page's URL:

0.030 http://Example.com/

To get a list of pages at your site, crawl your site first and process the wget log file using awk and sed on Linux or the Mac:

awk '/URL/{ print $3 }' log.txt | sed 's/URL://' > files.txt

Then loop through those files and time getting each one using curl. Clear your page cache first and use the sort command after the loop to sort the timing results from slowest to fastest.

Here's a bash script for the Mac and Linux. Again, set the domain name to be your own site (in bold below). See the Downloads section to download this script.

#!/bin/bash
#
# Benchmark getting each page at the site
#
site='example.com'
log='log.txt'
result='benchmark.txt'

tmp='tmp.txt'
tmp_files='tmp_files.txt'
tmp_times='tmp_times.txt'

echo "Benchmarking $site."

# Check that we have a log file to process to build
# a list of files to get
if [ ! -e $log ]; then
	echo "The crawl log file '$log' is missing.  Please crawl the"
	echo "site first and save the crawl log to '$log'."
	exit 1
fi
rm -f $tmp_files $tmp_times $result $tmp

# Create a list of files
awk '/URL/{ print $3 }' $log | sed 's/URL://' | sort > $tmp_files

# Clear the page cache
curl --output $tmp \
	--silent \
	http://$site/clear.php
rm -f $tmp

# Time getting each file
touch $tmp_times
for url in `cat $tmp_files`; do
	curl --output $tmp \
		--silent \
		--write-out "%{time_total} %{url_effective}\n" \
		$url >> $tmp_times
	rm -f $tmp
done

# Sort the results from highest to lowest
sort -rn $tmp_times > $result
rm -f $tmp_files $tmp_times

echo
echo "Done.  The benchmark results are in '$result'."

Understanding what preloading the page cache will and won't do

Using Drupal's page cache always improves site performance. But it may not be a noticeable improvement. When a visitor's web browser requests a page, it also requests all of the CSS, JavaScript, images, and flash used by that page. The time it takes to get all of that then render the page for the visitor may swamp the time Drupal took to build the page.

To see where a browser is spending its time, use the Safari or Firefox benchmarking features discussed earlier. Clear your page cache and the browser's cache, then load a page and look at the loading timeline shown by Safari's Web Inspector or Firefox's firebug. A rule of thumb is that the total page load and render time should be under 3 seconds. Under 1 second would be even better.

For a simple node page, getting the HTML page may be the slowest part of the page load. Using Drupal's page cache will speed this up. Still, much of the time spent getting the page is spent waiting on the network as the browser's page request works its way to the server and back. This network latency is a function of the speed of light through network optical fiber and the distance between the visitor's browser and the server. Times of a few tenths of a second are normal. Using Durpal's page cache won't speed up this part.

For a more complex page, such as a view list, Drupal takes longer to build the page, the page is probably bigger, it takes longer to send it to the browser, the HTML is more complex, and the browser spends more time rendering it. If a view list has a lot of images, such as an image gallery or a product list with thumbnails, the browser has to get all of those images too and wait for them to arrive before it can render the page. Getting the images will probably take much longer than getting the HTML page. Speeding up that HTML page by using the page cache won't hurt, but the image load time is the real problem. And the only fix for that is to load fewer images.

Downloads

  • preload_cache.zip
    • Contains the "preload_cache.sh" and "benchmark.sh" bash scripts to crawl a site and benchmark page loads, respectively. The zip file also includes "clear.php" for a Drupal site to clear the page cache on demand. The code is covered by the OSI BSD license so you can use, modify, redistribute, and sell as you see fit.

Further reading

Comments

Post new comment

The content of this field is kept private and will not be shown publicly.
  • Allowed HTML tags: <a> <em> <strong> <cite> <code> <ul> <ol> <li> <dl> <dt> <dd>
  • Lines and paragraphs break automatically.
  • Web page addresses and e-mail addresses turn into links automatically.

More information about formatting options

Nadeau software consulting
Nadeau software consulting