Nagios: Monitor Remote Disk Free Space

I was working with Nagios-3.3.1, which I love because it shows me everything I want to know and is easier than Zabbix to set up.

Yes, I had to write a shell script to make the config file-generation faster and less troublesome, but after I got that part done, I really started to see a great system.  Nagios sends emails to me when it has an issue, like being unable to access a given server for a test or if the web server is down.  All of this went up in a relatively simple way.  Not as easy as Pandora FMS but still pretty simple, if you consider command-line configuration files simple to edit.

A little while ago, I got back from a weekend, during which Nagios had not sent any alerts or notices, and discovered two web-servers in my test network were locked up.  They both were running a particular application that filled their hard drives with 4GB files.  Nagios should have been able to sort this out.

What Happened?!?

What I discovered is that Nagios though t they both had 15GB free and were neither in alert status. It appeared that the service-checker I had set to check the drives was checking only the Nagios server’s drive. It was about 15GB free, thanks. I dug into the details of the issue and discovered that the service-checker I had chosen was never intended to read a remote drive. I got a slight clue, when one of my colleagues asked if I was launching the checker with Samba (Windows sharing protocol) or with SSH (Secure Shell, the workhorse network communications protocol of the UNIX world and by extension – the internet.

My Solution

I had been resistant to having to load anything on the remote servers related to Nagios, as I had the fo0nd illusion that Nagios was a clientless network-management application. I knew this wasn’t entirely true. There had to be a username and password supplied for reading the health of the Postgresql service. In this case, there needed to be a small shell-script on the client that Nagios’s plugin could access.
The plugin was called check_diskfree
check_diskfree needed to have another service plug-in, check_by_ssh, to work.

Configuration Steps

Open the file /usr/local/nagios/etc/objects/commands.cfg and add the following block of code:

define command{
command_name check_by_ssh
command_line $USER1$/check_by_ssh -H $HOSTADDRESS$ -p $ARG1$ -C "$ARG2$ $ARG3$ $ARG4$ $ARG5$ $ARG6$"
}
# The example at Nagios extensions showed it as $HOSTNAME$ but that looked for the machine name in my network. Apparently it is designed to work with Fully-Qualifies-Domain-Names (FQDN). On my network, it just showed the error"Invalid HOSTNAME/IP ADDRESS." "HOSTADDRESS will work in any network.
# Below is the command definition that read only the local Nagios-Server disk.
define command{
command_name check_disk
command_line $USER1$/check_disk -w $ARG1$ -c $ARG2$ -p $ARG3$
}

Service Definition Madness

I have a script, as I mentioned that makes putting together a server-monitoring config much easier. It makes a separate file for each server with the servers HOSTNAME (machine name) and the extension .cfg. I found this far simpler than running through a 4000-line config file, which was what it looked like before I wrote the script. I am going to presume for now that you have some way of associating services with machines. If anybody asks, I might show that script to you, someday.
Thus I open a 80-line HOSTNAME.cfg file and under the services header add the following service definition.


define service{
use local-service
host_name HOSTNAME
service_description Remote Root Partition
check_command check_by_ssh!22!/nagios/check_diskfree.sh!sda2!75!90
notifications_enabled 1
max_check_attempts 3
check_interval 5
retry_interval 3
check_period 24x7
notification_interval 15
notification_period 24x7
notification_options w,c,r
contact_groups admins
register 1
}

# The following is the service-definition that was not working

define service{
use local-service
host_name HOSTNAME
service_description Root Partition
check_command check_disk!20%!5%!/
notifications_enabled 1
max_check_attempts 3
check_interval 5
retry_interval 3
check_period 24x7
notification_interval 15
notification_period 24x7
notification_options w,c,r
contact_groups admins
register 0
}

Sanity-Checking your Configuration

To check your configurations without starting up the nagios service, you want to run the following helpful command.

/usr/local/nagios/bin/nagios -v /usr/local/nagios/etc/nagios.cfg

This will check all of your config files and even tell you what line is bad in which file is wrong. This is far quicker than restarting, or attempting to restart, the nagios service

# /etc/init.d/nagios restart

Failures are extremely terse, and unhelpfull, while the sanity-checking command is extremely helpful.

Setting up Communications with the Remote server

You will need to get the public key from your Nagios server’s nagios user and put it in the file of authorized ssh keys on the remote machine somehow. There are a couple of ways but this is how I usually do this thing, because I usually have one or two terminal windows open on my desktop. That lets me use the copy buffer to do this configuration. I also use Debian servers, so the commands might not work exactly like this on your flavour of Linux.
In the terminal (or one of the many) shell into the remote server as its root user.

ssh root@192.168.0.230

Add a user called ‘nagios.’ This is the user name that nagios will attempt to contact the remote server under, so rather than pushing the river, you just have to have that user to receive the request.

adduser nagios

Add a directory called /nagios

mkdir /nagios

Switch users to that nagios user and make another directory – this directory will be in your nagios user’s home folder.

su - nagios
mkdir .ssh

shell back into your Nagios server’s nagios user’s home folder

ssh nagios@NAGIOSSERVER # Change NAGIOSSERVER TO it's actual IP address or domain name

Make a public key (if you don’t already have one created). Leave the pass-phrases blank.

ssh-keygen

Copy your Nagios server’s nagios user’s public key

less .ssh/id_rsa.pub

Here I just highlight and copy the entire block from “ssh-rsa” to the end. (it may have the user name at the end, and you can include that too, i.e., nagios@nagiosserver
Click “q” to quit the less application and type “exit” to close the secure shell session
now your prompt will show you to be “nagios@remoteserver:~$” so you need to paste the key in.

vi .ssh/authorized_keys

Type “i” for “insert” and either with the mouse or keyboard shortcuts paste the key into the file. Make sure there is a blank line at the end of the file. Because you just made this user, this file did not exist before you invoked it with vi.
Save the file and quit vi

:wq

Now type “exit” until you are again looking at a prompt showing you to be nagios@nagiosserver:~$
To check the set-up, shell back into the remote server. You should not need to have a password this time.
now to finish the set-up on the remote server, you need to copy the check_diskfree.sh file into the nagios directory you just created at root level.

scp /usr/local/nagios/libexec/check_diskfree.sh 192.168.0.230:/nagios/

This drops the file where nagios will be looking for it.

Looking at the Nagios Front-End

go to your web browser and look at the services page. Your remote server should have a new service called “Remote Root Partition”
If there is an error or no useful info, click on that service name and then the link to reschedule the next check of that service. When you go back to the services page you should see something like this

--------------------------------------------------------------------------------------------------------
| Remote Root Partition | OK | 03-19-2012 17:52:18 | 0d 1h 26m 33s | 1/3 | OK. | Free Space: 24GB, 95% |
--------------------------------------------------------------------------------------------------------

8 comments

  • HI

    Could you please post the script which you have mentioned in “Service Definition Madness”. Like I am trying to monitor remote server disk free space.

    The thing is I have installed Nagwin in windows 2008 box and trying to configure it.

    Please help.

    Thanks
    Anand

  • Can you tell me where to get hols og check_by_ssh. It does not appear in my nagios installation.

  • Anand –
    The script for filling the Nagios templates is at GitHub as of a minute ago.
    https://github.com/wolf29/Nagios-Template-filler

    Read it before you use it, as it was developed for library use and some of the pieces may be unnecessary in your network. If you want to add to the idea, there is a lot of space for creative additions. Let me know what you think.
    Wolf Halton´s last blog post ..Is Open Source For You?

  • I keep getting Permission denied:

    Remote Root Partition
    UNKNOWN 06-26-2012 15:51:18 0d 0h 24m 22s 4/4 Remote command execution failed: bash: /home/nagios/check_diskfree.sh: Permission denied

    I changed the command a bit to use the nagios ~ dir so i wouldnt have a dir in / I can manually run the script with this argument and it works: sh check_diskfree.sh /dev/mapper/i5–2500k–01-root 75 90

    I also wrote a small script on my Nagios server : ssh nagios@NAGIOSSERVER-INFO sh check_diskfree.sh /dev/mapper/i5–2500k–01-root 75 90

    and that echo’s the correct info so ssh keys are working and the directories have the correct permissions so who us the $USER1$ ? is it not Nagios user?

    THANKS!!!

  • The check_diskfree.sh program has to be on the remote machine and the nagios user from the nagios server has to be able to run it remotely. You already write a script for that, however there is a script you can get and add into libexec folder that lets you connect with ssh. I am working on my script for nagios templates https://github.com/wolf29/Nagios-Template-filler and I think I committed the change I am using on my production servers.

    Some examples from the template I have up there. (this has changed in production, as I used to put the agent plugins in the /nagios/ folder, but that is not possix compliant and causes issues for some Linux installs. Now I am putting my plugins in /usr/local/nagios/libexec/. I changed the second one to look in the /usr/local/nagios/libexec/ directory for plugins


    define service{
    use local-service
    host_name TEMPLATE-HOSTNAME
    service_description Remote Root Partition
    check_command check_by_ssh!22!/nagios/check_diskfree2.sh!/!75!90
    notifications_enabled 1
    max_check_attempts 3
    check_interval 5
    retry_interval 3
    check_period 24x7
    notification_interval 15
    notification_period 24x7
    notification_options w,c,r
    contact_groups admins
    register 1
    }

    # Check the number of users currently logged onto the local machine.
    # Warning if > 20 users, critical if > 50 users.
    define service{
    use local-service
    host_name TEMPLATE-HOSTNAME
    service_description Current Users
    check_command check_by_ssh!22!/usr/local/nagios/libexec/check_local_users!20!50
    notifications_enabled 1
    max_check_attempts 3
    check_interval 5
    retry_interval 3
    check_period 24x7
    notification_interval 15
    notification_period 24x7
    notification_options w,c,r
    contact_groups admins
    register 0
    }

  • Pingback: monitoring disk space using nagios

  • Hey Guys,

    Constantly I am receiving the error as below,
    [1391434121] SERVICE ALERT: test;Disk Usage;UNKNOWN;SOFT;1;(null)
    [1391434241] SERVICE ALERT: test;Disk Usage;UNKNOWN;HARD;2;(null)

    Also, in the GUI it shows, Disk Usage Unknown
    Can some help.

    Regards,

Leave a Reply