My sites are hosted on a VPS (with very sparse memory allocation), so i’m looking to minimise memory usage wherever possible. I also had the joy of watching my Django FCGIs for http://the-hive-mind.com get completely annihilated by metafilter.org a few weeks ago – i think it stood up to the first 200-300 requests.. then completely died.
Anyway, I have just launched a website for a friend in Nepal, who runs a trekking business (he comes highly recommended if you ever want to go hiking in that region). Basically, I decided to try and load test the Django instances on the site, to see how much traffic it could handle… i downloaded and installed siege.. which (as the name suggests) lays siege to your web server.
My out of the box Django FCGI being called by a Nginx instance was only capable of reliably handling up to about 10 concurrent connections before it started dropping requests.. not really acceptable (caching, at this point.. was completely turned off).
So I set about sorting out caching. I read the Django documentation.. and quickly decided that the built in caching wasn’t quite to my liking. I already knew a little about memcached, and wanted to use it to cache my generated responses, so the fact that Django supported it was nice. However the idea of using a the Django cache middleware doesn’t really cut it. Nginx supports memcached.. so why would i want to fire the request off to my (inherently bulky and inneficient) python FCGI instance? just to use python’s undoubtedly slower memcached library to return the cached content? I wouldn’t.
The solution I’ve come up with is somewhat simplistic, however it DOES solve my immediate problem… and it’s done wonders for the server’s load capacity.
THE SOLUTION : DJANGO CREATES THE CACHE OBJECT, BUT NGINX RETRIEVES IT.
The Django caching middleware is halfway to the perfect model, it correctly creates cached objects, however it uses a strance combination of . separated python words, the URI and a md5 hexdigest. That all seems like a little much to expect Nginx to replicate (remembering that my goal here is to avoid hitting the FCGI at all for cached content).
So after some digging and background, I decided it would be fun to write a simple MiddleWare for Django that would allow me to cache out my content’s responses with nice, Nginx friendly keys, so I could then implement my cache override directly in my Nginx config.
STEP 1: CREATING THE CACHE OBJECTS: MIDDLEWARE
This is really very simple. I’d never written Django Middleware before, however it was surprisingly simple. A MiddleWare is just a Python object that implements any of a series of methods. My NginxMiddleWare looks like this:
def save(self):
theUrl = self.trek.get_absolute_url()
key = '%s-%s' % (settings.CACHE_KEY_PREFIX,theUrl)
cache.delete(key)
key = '%s-/' % (settings.CACHE_KEY_PREFIX)
cache.delete(key)
super(Photo, self).save()
We also need to install our new MiddleWare into the site. I saved the above class definition into a file called NginxMiddleWare.py and installed it into my site-packages.. i intend to implement this caching scheme on my other Django sites (including this blog). So in my settings.py I add ‘NginxMiddleWare.NginxMemCacheMiddleWare’ to the MIDDLEWARE_CLASSES.
Also in the Django projects settings file we add the following:
CACHE_BACKEND = 'memcached://127.0.0.1:11211/'
CACHE_KEY_PREFIX = '/your-site-name'
CACHE_IGNORE_REGEXPS = (
r'/admin.*',
)
Firstly we are telling Django to use memcached (I’m assuming you already have memcached set up – go here if you don’t).
I have defined two new settings variables, CACHEKEYPREFIX and CACHEIGNOREREGEXPS. These allow me to control the caching. CACHEKEYPREFIX allows me to store multiple sites in the same memcached.. but creating a unique string key. And CACHEIGNOREREGEXPS allows me to define a set of regular expression URLS that I DO NOT want to cache – like the admin site.
STEP 2: NGINX CONFIGURATION
The configuration of Nginx was a bit fiddly. I really wanted my Django project to continue to run in exactly the same way as it was previously – ie no silly fake URL prefixes or other cruft to confuse my urls.py..
So I needed to get Nginx to firstly serve my Django pages off a fake, internal server, like so:
server {
listen 9004;
location / {
# host and port to fastcgi server
# @fastcgi_pass unix:/var/www/trekkingnepaltours.com/django.sock;
fastcgi_pass 127.0.0.1:8004;
fastcgi_param PATH_INFO $fastcgi_script_name;
fastcgi_param REQUEST_METHOD $request_method;
fastcgi_param QUERY_STRING $query_string;
fastcgi_param CONTENT_TYPE $content_type;
fastcgi_param CONTENT_LENGTH $content_length;
fastcgi_pass_header Authorization;
fastcgi_intercept_errors off;
include /etc/nginx/fastcgi_params;
}
}
Pick a high port, then configure your FCGI however you like it to be configured, above is what I use.. but I’m assuming if you’ve read this far that you know enough to configure your FCGI under Nginx.
Basically this creates us another server that we can talk to, allowing us to use Nginx like a proxy to server our Django pages. The logic is something like:
check if url is cached
if url IS cached then
return the cached response
else
proxy the connection to our django server
So the guts of the logic for the Nginx config are as follows:
location / {
if ($request_method = POST) {
proxy_pass http://localhost:9004;
break;
}
default_type "text/html; charset=utf-8";
set $memcached_key "/your-site-name-$uri";
memcached_pass localhost:11211;
error_page 404 502 = /django;
}
location = /django {
proxy_pass http://localhost:9004;
break;
}
Before this definition, i have a whole bunch of locations set to handle my static content, so it never reaches this stage. The / location first checks if it’s a POST request, if so it will proxy the request off to django directly, we never cache POSTs. Then if we get past that point, we set the default type of our response to html and utf8. Then we set $memcachedkey to the string that we used in our Django settings.py plus a dash plus the $uri from nginx. Next we pass off to memcached locally to check for the cached object, if it exists memcached will return it, otherwise we get an error. Errors are handled by our errorpage directive, which farms them off to a virtual location called /django, which again sends the request to the internal Django instance.
So, if there is no cached object, then Nginx will get Django to render the page, and the MiddleWare we defined above will save off our cached objects with the correct keys.
BUT WAIT – WHAT IF I CHANGE DATA IN THE DB?
Of course, Django is a Content Management Framework and i make extensive use of the provided admin system.. so when I add a new trek to http://trekkingnepaltours.com i want the site to update itself. To achieve this we override the save methods on the relevant DB models and get them to clear the cache.
I assume here that you use the standard Django convention of defining a getabsoluteurl method on your models, so we just override the save function and call our django cache delete function on the correct cache key, to remove if from memcached. Below is the save method off my Photo model:
As you can see, I’m removing the cache for the Trek model that this photo is associated to (since that’s where the actual page is) and also the cache for the home page, the photos often get rendered there too – obviously we can be as granular as we like in this save method, removing whatever we need to in order to update the site. We could probably even write some code to wipe the entire cache..
THE END RESULT
Siege now shows the site successfully handling 2000 odd concurrent connections on a constant load for 1 minute… while hardly registering any memory usage at all on my VPS – problem solved.
DISCLAIMER..
I know this is probably not the most elegant solution, however it has solved my problem for me. Having said that, if you have ANY comments about how this could be made better, please leave a comment… also if Django can already do this out of the box… someone smarter than me needs to tell me how..
from django.core.cache import cache
import re
import settings
class NginxMemCacheMiddleWare:
def process_response(self, request, response):
cacheIt = True
theUrl = request.get_full_path()
# if it's a GET then store it in the cache:
if request.method != 'GET':
cacheIt = False
# loop on our CACHE_INGORE_REGEXPS and ignore
# certain urls.
for exp in settings.CACHE_IGNORE_REGEXPS:
if re.match(exp,theUrl):
cacheIt = False
if cacheIt:
key = '%s-%s' % (settings.CACHE_KEY_PREFIX,theUrl)
cache.set(key,response.content)
return response