We are a software consultancy based in Berlin, Germany. We deliver
high quality web apps in short timespans.

Upstream Agile GmbH

Monitoring the internals of your Rails application with Nagios

April 10, 2008 by alex

autoki has recently grown into a more and more complex application. Besides two clusters of mongrels and the mysql database we have a memcached server, ferret and starling plus clients for asynchronous processing. WIth so many services running (and sometimes not running) the need to monitor all these grew. We decided to set up nagios on one of the servers - it’s ugly but it lets you monitor all sorts of stuff pretty easily via remote agents that run on each monitored server.

With nagios we have access to quite a number of monitoring plugins, e.g. for monitoring TCP ports (e.g. for checking memcached is still alive), HTTP, server load, free disk space etc. Yesterday I came to a point where I wanted to monitor something nagios couldn’t: When a user uploads a bunch of photos, the task of creating copies of the photos in different sizes is put in a starling queue for asycnhronous processing. If something with that processing goes wrong and the queue gets too big I want nagios to pick this up and notify me. Time for my own nagios plugin.

Nagios plugins are actually very simple. All you have to provide is something that can be executed in a shell and that returns either 0, 1 or 2 for an OK, Warning or Critical state of the monitored service. So here’s the source code for monitoring the number of photo uploads in the queue (RAILS_ROOT/lib/check_photo_uploads.rb):

#!/usr/bin/env ruby


# load rails

RAILS_ENV = 'production'
require File.dirname(__FILE__) + '/../config/environment'

# print out a warning if the queue has 10, critical when 20 entries

warn_count = 10
fatal_count = 20
actual_count = PhotosUpload.count
error = 0

print "photo_uploads "

if actual_count < warn_count
  print "OK"
elsif actual_count >= fatal_count
  print "CRITICAL"
  error = 2
elsif actual_count >= warn_count
  print 'WARNING'
  error = 1
end

puts " #{actual_count} uploads in queue"
exit error # exit with the error code that is then interpreted by nagios

At autoki the server that runs the asynchronous processes is called jobs1 and this is also where the queue should be checked, so I added this to the /etc/nagios/nrpe.cfg file (the config file for the remote nagios agent):

command[check_photo_uploads]=/usr/bin/ruby /var/www/production/current/lib/check_photo_uploads.rb

Then I had to add the service to the nagios configuration on the monitoring server:

define service {
        host jobs1
        service_description photo_uploads
        check_command check_nrpe_1arg!check_photo_uploads
        use generic-service
}

Well, that was it, and this is how it looks - ugly but it works :)

nagios_photo_uploads.png

(I recently signup with scout - looks much prettier and the setup is much easier than nagios, plugins are written as ruby classes and it comes as a ruby gem - sweet concept so far, could have been my idea, more on that later)