Yes, I had to write a shell script to make the config file-generation faster and less troublesome, but after I got that part done, I really started to see a great system. Nagios sends emails to me when it has an issue, like being unable to access a given server for a test or if the web server is down. All of this went up in a relatively simple way. Not as easy as Pandora FMS but still pretty simple, if you consider command-line configuration files simple to edit.
A little while ago, I got back from a weekend, during which Nagios had not sent any alerts or notices, and discovered two web-servers in my test network were locked up. They both were running a particular application that filled their hard drives with 4GB files. Nagios should have been able to sort this out.
What I discovered is that Nagios though t they both had 15GB free and were neither in alert status. It appeared that the service-checker I had set to check the drives was checking only the Nagios server’s drive. It was about 15GB free, thanks. I dug into the details of the issue and discovered that the service-checker I had chosen was never intended to read a remote drive. I got a slight clue, when one of my colleagues asked if I was launching the checker with Samba (Windows sharing protocol) or with SSH (Secure Shell, the workhorse network communications protocol of the UNIX world and by extension – the internet.
I had been resistant to having to load anything on the remote servers related to Nagios, as I had the fo0nd illusion that Nagios was a clientless network-management application. I knew this wasn’t entirely true. There had to be a username and password supplied for reading the health of the Postgresql service. In this case, there needed to be a small shell-script on the client that Nagios’s plugin could access.
The plugin was called check_diskfree
check_diskfree needed to have another service plug-in, check_by_ssh, to work.
Open the file /usr/local/nagios/etc/objects/commands.cfg and add the following block of code:
command_line $USER1$/check_by_ssh -H $HOSTADDRESS$ -p $ARG1$ -C "$ARG2$ $ARG3$ $ARG4$ $ARG5$ $ARG6$"
# The example at Nagios extensions showed it as $HOSTNAME$ but that looked for the machine name in my network. Apparently it is designed to work with Fully-Qualifies-Domain-Names (FQDN). On my network, it just showed the error"Invalid HOSTNAME/IP ADDRESS." "HOSTADDRESS will work in any network.
# Below is the command definition that read only the local Nagios-Server disk.
command_line $USER1$/check_disk -w $ARG1$ -c $ARG2$ -p $ARG3$
Service Definition Madness
I have a script, as I mentioned that makes putting together a server-monitoring config much easier. It makes a separate file for each server with the servers HOSTNAME (machine name) and the extension .cfg. I found this far simpler than running through a 4000-line config file, which was what it looked like before I wrote the script. I am going to presume for now that you have some way of associating services with machines. If anybody asks, I might show that script to you, someday.
Thus I open a 80-line HOSTNAME.cfg file and under the services header add the following service definition.
service_description Remote Root Partition
# The following is the service-definition that was not working
service_description Root Partition
Sanity-Checking your Configuration
To check your configurations without starting up the nagios service, you want to run the following helpful command.
/usr/local/nagios/bin/nagios -v /usr/local/nagios/etc/nagios.cfg
This will check all of your config files and even tell you what line is bad in which file is wrong. This is far quicker than restarting, or attempting to restart, the nagios service
# /etc/init.d/nagios restart
Failures are extremely terse, and unhelpfull, while the sanity-checking command is extremely helpful.
Setting up Communications with the Remote server
You will need to get the public key from your Nagios server’s nagios user and put it in the file of authorized ssh keys on the remote machine somehow. There are a couple of ways but this is how I usually do this thing, because I usually have one or two terminal windows open on my desktop. That lets me use the copy buffer to do this configuration. I also use Debian servers, so the commands might not work exactly like this on your flavour of Linux.
In the terminal (or one of the many) shell into the remote server as its root user.
Add a user called ‘nagios.’ This is the user name that nagios will attempt to contact the remote server under, so rather than pushing the river, you just have to have that user to receive the request.
Add a directory called /nagios
Switch users to that nagios user and make another directory – this directory will be in your nagios user’s home folder.
su - nagios
shell back into your Nagios server’s nagios user’s home folder
ssh nagios@NAGIOSSERVER # Change NAGIOSSERVER TO it's actual IP address or domain name
Make a public key (if you don’t already have one created). Leave the pass-phrases blank.
Copy your Nagios server’s nagios user’s public key
Here I just highlight and copy the entire block from “ssh-rsa” to the end. (it may have the user name at the end, and you can include that too, i.e., nagios@nagiosserver
Click “q” to quit the less application and type “exit” to close the secure shell session
now your prompt will show you to be “nagios@remoteserver:~$” so you need to paste the key in.
Type “i” for “insert” and either with the mouse or keyboard shortcuts paste the key into the file. Make sure there is a blank line at the end of the file. Because you just made this user, this file did not exist before you invoked it with vi.
Save the file and quit vi
Now type “exit” until you are again looking at a prompt showing you to be nagios@nagiosserver:~$
To check the set-up, shell back into the remote server. You should not need to have a password this time.
now to finish the set-up on the remote server, you need to copy the check_diskfree.sh file into the nagios directory you just created at root level.
scp /usr/local/nagios/libexec/check_diskfree.sh 192.168.0.230:/nagios/
This drops the file where nagios will be looking for it.
Looking at the Nagios Front-End
go to your web browser and look at the services page. Your remote server should have a new service called “Remote Root Partition”
If there is an error or no useful info, click on that service name and then the link to reschedule the next check of that service. When you go back to the services page you should see something like this
| Remote Root Partition | OK | 03-19-2012 17:52:18 | 0d 1h 26m 33s | 1/3 | OK. | Free Space: 24GB, 95% |