mod_perl caching

"All programming is an exercise in caching." - Terje Mathisen

Summary

This document covers different details of web application level caching. That is, caching data within a mod_perl process or between mod_perl processes. It does not cover proxy/caching server details, browser caching, etc

Who caches

Just about everything to do with getting performance out of computers has to do with caching. Even below your application layer, we have the basic hardware layers of:

When we consider the web, there's many, many layers:

What can you cache

What we're dealing with here is data caching within a mod_perl process or between several mod_perl processes. Even then, we can consider several types of caching:

In each case, the general aim is reduce the amount of time required to perform an operation by using more memory or disk space to store the previously calculated result.

A quick process versus threaded web-server summary

There are two main ways to develop a web server (though hybrids are possible):

In a multi-process system, a number of individual processes are forked, each of which randomly handles any new incoming request. In a multi-threaded system, there is one process, but it has an internal set of threads, each of which handles a new incoming request. Basically you can summarise the differences as follows:

Apache server model

Given the above discussion, Apache 1.x uses a traditional multi-process model to handle web requests (except on Windows). Apache 2.0 will have a variable hybrid model that allows multi-thread, multi-process or some in-between combination.

For the moment, this discussion concerns multi-process caching.

How can you cache

As discussed above, caching involves saving the result of a complex calculation/slow query/etc based on the assumption that another web request will soon require the results of that query. Here's a really basic example of what we might do:

use vars qw(%Cache);

sub GetSlowQuery {
  return $Cache{SlowQuery} ||= $dbh->selectall_arrayref('select slow_query_view');
}

Which basically runs the query once, and saves the result in the %Cache hash. If we call GetSlowQuery again, it will retrieve the result from the hash rather than running the query again.

A couple of important points to note:

  1. If the result of 'select slow_query_view' actually changes, then this won't work because it will always use the saved result of calling the query the first time
  2. If GetSlowQuery is called for the first time in each sub-process, then each sub-process will have a copy of the same data rather than the data being shared between processes.
  3. There is no limit to the amount of data that is shared, or the time it is shared for. Each entry into the %Cache hash lives for the life of the process

Lets say now that we have some module that allows you to store data in some shared memory. We would rewrite the above code to be:

use Cache::SharedMemory;
use vars qw(%Cache);

tie %Cache, 'Cache::SharedMemory';
sub GetSlowQuery {
  return $Cache{SlowQuery} ||= $dbh->selectall_arrayref('select slow_query_view');
}

In this case, the Cache::SharedMemory does some magic to make the %Cache hash look like a hash, but internally it uses shared memory (locked as appropriate).

A couple of important points to note about this:

  1. Again, if the result of 'select slow_query_view' actually changes, then this won't work because it will always use the saved result of calling the query the first time
  2. When GetSlowQuery is called for the first time in any sub-process, then every other sub-process will use the same data stored in the shared cache.
  3. There is no limit to the amount of data that is shared, or the time it is shared for. Each entry into the %Cache hash lives for the life of the process

Thus these are the two main ways you can cache data:

Within each process

Given that the multi-process model doesn't allow any implicit data sharing, then why share any data at all? These is the equivalent of keeping some sort of global variable, and each process then keeps a separate cached copy of the data. This works reasonably with data is basically static, or not modified by other Apache processes. Note though that this means you end up with a copy of the same data within each process. If you have 50 forked Apache servers, and each cache ends up with 1M of data, this means you'll really be using 50M of data.

There is a way around this problem, that uses the fact that most modern operating systems use a 'copy-on-write' technique to fork processes. What this means is that if you load data into the Apache process at startup before it starts forking children, and that data is only every read and not written to, then the data will automatically be shared between processes. Remember though, that for this to work it has to happen before the child processes are forked, so it means you need to know the common read only data that you want to cache