Fixing a CPU-consuming logrotate

August 27th, 2013

Last week I noticed one CI server was taking a huge amount of time to finish a test suite. It was supposed to run a few minutes, but was taking hours instead. A few months before, when the server was just set up, it ran in a few minutes as expected. Here’s how it was diagnosed and fixed.

So, I ssh-ed into the server, to see what was going on and saw this:

$ top
 PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
6479 root      20   0  413m 409m  860 R   99 40.9 927:36.64 /usr/sbin/logrotate /etc/logrotate.conf

Logrotate was using 99% CPU and 40% memory! Why? Isn’t logrotate a well known and tested piece of software? It is, but certain circumstanses and bad configuration files can make it do this kind of things.

Let’s start by tailing logrotate status log:

$ tail /var/lib/logrotate/status
...
"/usr/local/nginx/logs/access.log.1.1.1.1.1.1.2.gz.1.1.1.1.1.1.1.2.gz.1.1.1.1.1.1.2.gz.1.1.1.1.1.1" 2013-8-7
"/usr/local/nginx/logs/access.log.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.2.gz.1.1.2.gz.1.1.1.1.1.1.1.1.2.gz.1.1.1.1" 2013-8-7
"/usr/local/nginx/logs/access.log.2.gz.1.1.1.1.1.1.1.1.1.1.1.1.2.gz.1.1.1.1.1.1.1.1.1.1.1.1.1.2.gz.1.1.1.1.1.1.1.1.1.1" 2013-8-7

Aha! A lot of lines about nginx logs, and with strange suffixes. That’s a good start. Just to get an idea of how many lines that log has, here’s the file size:

$ ls /var/lib/logrotate/status  -lh
-rw-r--r-- 1 root root 66M Aug 24 08:18 /var/lib/logrotate/status

66 megabytes on a logrotate status log just a few months old. Most of them were similar to the ones above. That was the place to look at.

But before doing anything else, stop logrotate:

$ sudo kill -9 6479 # got the PID from the previous top command

Now, let’s check nginx logs directory and see what’s in there:

$ ls /usr/local/nginx/logs
...
access.log.1.1.1.1.2.gz.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.2.gz.1.1.1.2.gz.1.1.1.1.1.1.1.1
access.log.1.1.1.1.2.gz.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.2.gz.1.1.1.2.gz.1.1.1.1.1.1.1.1.1
access.log.1.1.1.1.2.gz.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.2.gz.1.1.1.2.gz.1.1.1.1.1.1.1.1.1.1
access.log.1.1.1.1.2.gz.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.2.gz.1.1.2.gz
access.log.1.1.1.1.2.gz.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.2.gz.1.1.2.gz.1
access.log.1.1.1.1.2.gz.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.2.gz.1.1.2.gz.1.1
access.log.1.1.1.1.2.gz.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.2.gz.1.1.2.gz.1.1.1
access.log.1.1.1.1.2.gz.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.2.gz.1.1.2.gz.1.1.1.1
access.log.1.1.1.1.2.gz.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.2.gz.1.1.2.gz.1.1.1.1.1
access.log.1.1.1.1.2.gz.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.2.gz.1.1.2.gz.1.1.1.1.1.1
access.log.1.1.1.1.2.gz.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.2.gz.1.1.2.gz.1.1.1.1.1.1.1
access.log.1.1.1.1.2.gz.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.2.gz.1.1.2.gz.1.1.1.1.1.1.1.1
access.log.1.1.1.1.2.gz.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.2.gz.1.1.2.gz.1.1.1.1.1.1.1.1.1
access.log.1.1.1.1.2.gz.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.2.gz.1.1.2.gz.1.1.1.1.1.1.1.1.1.1
access.log.1.1.1.1.2.gz.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.2.gz.1.2.gz
access.log.1.1.1.1.2.gz.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.2.gz.1.2.gz.1
access.log.1.1.1.1.2.gz.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.2.gz.1.2.gz.1.1
access.log.1.1.1.1.2.gz.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.2.gz.1.2.gz.1.1.1
...

Lots and lots of rotated log files. Seems like the logratate nginx config was doing something wrong.

$ cat /etc/logrotate.d/nginx
/usr/local/nginx/logs/* {
  daily
  missingok
  rotate 52
  compress
  delaycompress
  notifempty
  create 640 nobody root
  sharedscripts
  postrotate
  [ ! -f /user/local/nginx/logs/nginx.pid ] || kill -USR1 `cat /usr/local/nginx/logs/nginx.pid`
  endscript
}

Exactly! See that first line? It’s matching all files in the directory. Even the already rotated files. That means the first week access.log was moved to access.log.1. The following week, both files were moved, to access.log.1 and access.log.1.1. And that happened every week, each one with even more files, growing exponentially.

It had created almost half a millon files to date:

$ ls /usr/local/nginx/logs/ | wc -l
504970

Fortunately, it’s easy to fix. Just change that first line to the following:

/usr/local/nginx/logs/*.log {

That way, it only rotates the real .log files.

But in order to run logrotate correctly again, it’s a good idea to remove all previously rotated files. We would like to keep the current access and error log so nginx doesn’t complain.

$ rm /usr/local/nginx/logs/{error,access}.log.*
-bash: /bin/rm: Argument list too long

Wow, too many files! Well, makes sense. We had half a millon files, remember?

In that case, let’s remove the whole directory. But for that we should stop nginx first (default nginx.pid is located there too), and start it again afterward.

$ sudo /etc/init.d/nginx stop
 * Stopping Nginx Server... [OK]
$ sudo rm -rf /usr/local/nginx/logs # this could take quite some time, and use a lot of CPU too
$ sudo mkdir /usr/local/nginx/logs # we must be sure to create it again
$ sudo /etc/init.d/nginx start
 * Starting Nginx Server... [OK]

Also delete the 66Mb logorate status log file. A log file that big can cause it to use 100% CPU too.

$ sudo rm -rf /var/lib/logrotate/status

And that’s it! It should be working fine now. To test it and be completely sure it’s working OK, we can force a run of logrotate:

$ sudo logrotate -f /etc/logrotate

And it worked! We can get back to having our CI results in just minutes!

This was happening in a Ubuntu 12.04.1 server with nginx installed via apt-get. Seems to be fixed now, but hope this helped if you were suffering the same problem.

Esteban Pastorino

Fixing a CPU-consuming logrotate

Comments