Fixing a CPU-consuming logrotate
Last week I noticed one CI server was taking a huge amount of time to finish a test suite. It was supposed to run a few minutes, but was taking hours instead. A few months before, when the server was just set up, it ran in a few minutes as expected. Here’s how it was diagnosed and fixed.
So, I ssh-ed into the server, to see what was going on and saw this:
$ top
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
6479 root 20 0 413m 409m 860 R 99 40.9 927:36.64 /usr/sbin/logrotate /etc/logrotate.conf
Logrotate was using 99% CPU and 40% memory! Why? Isn’t logrotate a well known and tested piece of software? It is, but certain circumstanses and bad configuration files can make it do this kind of things.
Let’s start by tailing logrotate status log:
$ tail /var/lib/logrotate/status
...
"/usr/local/nginx/logs/access.log.1.1.1.1.1.1.2.gz.1.1.1.1.1.1.1.2.gz.1.1.1.1.1.1.2.gz.1.1.1.1.1.1" 2013-8-7
"/usr/local/nginx/logs/access.log.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.2.gz.1.1.2.gz.1.1.1.1.1.1.1.1.2.gz.1.1.1.1" 2013-8-7
"/usr/local/nginx/logs/access.log.2.gz.1.1.1.1.1.1.1.1.1.1.1.1.2.gz.1.1.1.1.1.1.1.1.1.1.1.1.1.2.gz.1.1.1.1.1.1.1.1.1.1" 2013-8-7
Aha! A lot of lines about nginx logs, and with strange suffixes. That’s a good start. Just to get an idea of how many lines that log has, here’s the file size:
$ ls /var/lib/logrotate/status -lh
-rw-r--r-- 1 root root 66M Aug 24 08:18 /var/lib/logrotate/status
66 megabytes on a logrotate status log just a few months old. Most of them were similar to the ones above. That was the place to look at.
But before doing anything else, stop logrotate:
$ sudo kill -9 6479 # got the PID from the previous top command
Now, let’s check nginx logs directory and see what’s in there:
$ ls /usr/local/nginx/logs
...
access.log.1.1.1.1.2.gz.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.2.gz.1.1.1.2.gz.1.1.1.1.1.1.1.1
access.log.1.1.1.1.2.gz.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.2.gz.1.1.1.2.gz.1.1.1.1.1.1.1.1.1
access.log.1.1.1.1.2.gz.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.2.gz.1.1.1.2.gz.1.1.1.1.1.1.1.1.1.1
access.log.1.1.1.1.2.gz.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.2.gz.1.1.2.gz
access.log.1.1.1.1.2.gz.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.2.gz.1.1.2.gz.1
access.log.1.1.1.1.2.gz.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.2.gz.1.1.2.gz.1.1
access.log.1.1.1.1.2.gz.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.2.gz.1.1.2.gz.1.1.1
access.log.1.1.1.1.2.gz.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.2.gz.1.1.2.gz.1.1.1.1
access.log.1.1.1.1.2.gz.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.2.gz.1.1.2.gz.1.1.1.1.1
access.log.1.1.1.1.2.gz.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.2.gz.1.1.2.gz.1.1.1.1.1.1
access.log.1.1.1.1.2.gz.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.2.gz.1.1.2.gz.1.1.1.1.1.1.1
access.log.1.1.1.1.2.gz.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.2.gz.1.1.2.gz.1.1.1.1.1.1.1.1
access.log.1.1.1.1.2.gz.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.2.gz.1.1.2.gz.1.1.1.1.1.1.1.1.1
access.log.1.1.1.1.2.gz.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.2.gz.1.1.2.gz.1.1.1.1.1.1.1.1.1.1
access.log.1.1.1.1.2.gz.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.2.gz.1.2.gz
access.log.1.1.1.1.2.gz.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.2.gz.1.2.gz.1
access.log.1.1.1.1.2.gz.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.2.gz.1.2.gz.1.1
access.log.1.1.1.1.2.gz.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.2.gz.1.2.gz.1.1.1
...
Lots and lots of rotated log files. Seems like the logratate nginx config was doing something wrong.
$ cat /etc/logrotate.d/nginx
/usr/local/nginx/logs/* {
daily
missingok
rotate 52
compress
delaycompress
notifempty
create 640 nobody root
sharedscripts
postrotate
[ ! -f /user/local/nginx/logs/nginx.pid ] || kill -USR1 `cat /usr/local/nginx/logs/nginx.pid`
endscript
}
Exactly! See that first line? It’s matching all files in the directory. Even the already rotated files. That means the first week access.log
was moved to access.log.1
. The following week, both files were moved, to access.log.1
and access.log.1.1
. And that happened every week, each one with even more files, growing exponentially.
It had created almost half a millon files to date:
$ ls /usr/local/nginx/logs/ | wc -l
504970
Fortunately, it’s easy to fix. Just change that first line to the following:
/usr/local/nginx/logs/*.log {
That way, it only rotates the real .log files.
But in order to run logrotate correctly again, it’s a good idea to remove all previously rotated files. We would like to keep the current access and error log so nginx doesn’t complain.
$ rm /usr/local/nginx/logs/{error,access}.log.*
-bash: /bin/rm: Argument list too long
Wow, too many files! Well, makes sense. We had half a millon files, remember?
In that case, let’s remove the whole directory. But for that we should stop nginx first (default nginx.pid is located there too), and start it again afterward.
$ sudo /etc/init.d/nginx stop
* Stopping Nginx Server... [OK]
$ sudo rm -rf /usr/local/nginx/logs # this could take quite some time, and use a lot of CPU too
$ sudo mkdir /usr/local/nginx/logs # we must be sure to create it again
$ sudo /etc/init.d/nginx start
* Starting Nginx Server... [OK]
Also delete the 66Mb logorate status log file. A log file that big can cause it to use 100% CPU too.
$ sudo rm -rf /var/lib/logrotate/status
And that’s it! It should be working fine now. To test it and be completely sure it’s working OK, we can force a run of logrotate:
$ sudo logrotate -f /etc/logrotate
And it worked! We can get back to having our CI results in just minutes!
This was happening in a Ubuntu 12.04.1 server with nginx installed via apt-get
. Seems to be fixed now, but hope this helped if you were suffering the same problem.