Hi guys,
I'm having a difficult time diagnosing an intermittent issue, where our site users are occasionally getting a 408 Request Timed-out page. Our situation is a cluster with 3 nginx+gunicorn+django nodes with a Linode Node Balancer in front. (More details below)
My boss, who uses the site for several hours per day, sees it happen once a day, but not with any discernible consistency or pattern - it's not occurring on the same page/time/user action.
Our site isn't yet live, so our traffic is pretty much just some part-time testers, at most around 6 users at the one time. I have load-tests that show a single server can handle up to 30 very active users, and Linode's graphs show that all servers had light loads at time of the most recent incident (<10% cpu load, <100 blocks/sec IO) and, also that NodeBalancer considered all nodes to be UP. I didn't see anything interesting in nginx or app logs at the time.
Reading up on 408 errors, they are reported when a client opens a connection with the web server, but takes too long to finish sending its request. It may also happen if the socket closes early. (But I don't see how the error page would be delivered at all then.)
Strange thing is that apparently the 408 error page comes up immediately, without any waiting from the user -- the
relevant settings for nginx, "client_body_timeout" and "client_header_timeout" both default to 60s. The user isn't forced to wait nearly this long, so I guess that suggests the socket is getting disconnected prematurely?
Until 2 weeks ago, we were hosting the whole site from a single server for almost a year - hadn't seen the 408 error, until we migrated into the cluster we have now.
Our architecture:
NodeBalancer:
- Port: 80
- Protocol: HTTP
- Algorithm: Round Robin
- Session Stickiness: None
- Health Check Type: HTTP Body Regex
- Check Interval: 15
- Check Timeout: 5
- Check Attempts: 1
- Check HTTP Path: /heartbeat
- Expected HTTP Body: Server OK
- configuration for the 3 web nodes: using private IPs, Weight: 100, Mode: Accept
3 Web servers - Linode 512mb (Fremont), Ubuntu 10.04, all configured identically:
- Nginx (config below), 4 workers
- Gunicorn (Python app server, like unicorn or mongrel), running 4 workers
- Django 1.3 app
1 DB server - Linode 512mb (Fremont), Ubuntu 10.04:
- MySQL
- Runs celery (task queue) for our app
And we're using a CDN for (most) static media files.
So, this problem has me stumped. Since NodeBalancer is a managed service, I can't check logs or do any diagnostics there (although my network-fu / unix-fu is rather weak, I wouldn't know what to look for), so I was hoping someone might have a clue or suggestion to get me started in the right direction.
* Is there some extra logging I can turn on, to give me more clues when it occurs again?
* I could try to build another Linode with a manually configured nginx setup as a balancer, to act as a control, but this would be the first time I'd be doing so - I'd also rather hoped to leverage the fact that Linode would be better at setting up a balancer than me.
* I have load-tests which I am in the middle of updating, so my next step is probably to check if the load-tests can trigger the issue. I'm rather concerned I'm not able to simulate the full range of interactions users have with the system, and may likely not trigger the issue, but it's all I have right now.
Thanks all, I really appreciate any advice you can give me.
Cheers,
-asavoy
/etc/nginx/nginx.conf
Code:
user www-data;
worker_processes 4;
error_log /var/log/nginx/error.log;
events {
worker_connections 1024;
# multi_accept on;
}
http {
include /etc/nginx/mime.types;
access_log /var/log/nginx/access.log;
sendfile on;
#tcp_nopush on;
#keepalive_timeout 0;
keepalive_timeout 65;
tcp_nodelay on;
gzip on;
gzip_http_version 1.1;
gzip_vary on;
gzip_comp_level 6;
gzip_types text/plain text/css application/json text/javascript application/x-javascript application/xml;
gzip_disable "MSIE [1-6]\.(?!.*SV1)";
include /etc/nginx/sites-enabled/*;
}
/etc/nginx/sites-enabled/example.comCode:
server {
listen 80;
server_name example.com 192.168.166.41 "";
root /var/www/example.com;
access_log /var/www/example.com/log/nginx_access.log;
error_log /var/www/example.com/log/nginx_error.log;
client_max_body_size 4G;
keepalive_timeout 5;
# Prevents proxied content from being written to temp files on disk;
# should improve nginx speed and may help resolve 502 Bad Gateway
# errors in some cases.
proxy_buffers 32 16k;
if ($host = 'www.example.com' ) {
rewrite ^/(.*)$ http://example.com/$1 permanent;
}
location / {
if (-f /var/www/example.com/maintenance/maintenance.html) {
return 503;
}
location /favicon.ico {
access_log off;
empty_gif;
}
location /media/ {
access_log off;
alias /var/www/example.com/application/public/media/;
}
location /admin-media/ {
access_log off;
alias /var/www/example.com/application/public/admin-media/;
}
location /static/ {
access_log off;
alias /var/www/example.com/application/public/static/;
}
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header Host $http_host;
proxy_redirect off;
if (!-f $request_filename) {
proxy_pass http://127.0.0.1:8000;
break;
}
error_page 503 @maintenance;
error_page 502 @error502;
error_page 504 @error504;
}
# 503 Service unavailable
location @maintenance {
rewrite ^(.*)$ /maintenance/maintenance.html break;
}
# 502 Bad Gateway error
location @error502 {
rewrite ^(.*)$ /maintenance/error502.html break;
}
# 504 Gateway Timeout error
location @error504 {
rewrite ^(.*)$ /maintenance/error504.html break;
}
}