(Ping! Zine Issue 12) – Downtime is evil.
Whether you’re hosting two customers or two thousand on your network, the services you offer them need to be up and running as close to 100% of the time as possible. As soon as customers realize people can’t get to their sites or they can’t send or receive email, they’re going to run to your competitors.
This is what plagued me when I took over my current position. There was no 24×7 monitoring of the network, with the biggest gap between shifts spanning 2pm Saturday afternoon to 9am Monday morning. Because technicians were not using our own network for connectivity, email, etc., they didn’t know anything was wrong until they checked the voice mail Monday morning. Using my work email address as a personal address and using my own DNS servers while surfing helped, but it was less than desirable as I don’t live at the computer (contrary to my wife’s opinion). With our limited budget, hiring NOC staff — or even a monitoring service — was not an option.
The Nagios network monitor is a free, Open Source package that operates under Linux and Unix. Once configured, it will monitor given hosts and services on a specified basis and notify the appropriate technicians of problems and/or outages via email or pager. Its plugin system makes it very configurable, and the web interface makes it easy to check system status remotely. There’s even a WAP page in the web interface to allow admins to check things out via an Internet-enabled cell phone or handheld device.
Nagios has made my life a lot easier. When we were having troubles with our mail gateway, I knew to SSH in and make the quick fix before traffic got passed along to the main server and bogged it down. When a customer’s router went down (because a telco tech unplugged the Ethernet cable), we were already troubleshooting the connection when the phone rang.
In fact, we have yet to have a customer notify us of an outage (read: complain that “the Internet is down”) before we knew about it since the implementation of Nagios.
Installing and configuring Nagios is somewhat tedious, but it’s not difficult. Like most Linux apps, it mostly involves editing a handful of text files and starting up the service. Once the initial setup is completed, Nagios just runs contentedly until you need to make a change in who it notifies or what it monitors.
The nitty gritty of the installation is outside the scope of this article, and would require quite a bit more space than I have available. The Nagios documentation at http://www.nagios.org/docs/ is very thorough and well written, and I strongly recommend you check it out. For the most part, the initial installation involves little more than creating a user for the Nagios daemon, compiling and installing the software, and then configuring Apache.
There are two components you will need to download: the Nagios software itself and then the plugins, which include the software to perform the monitoring functions. As of this writing, version 1.2 is the latest stable release and is what I’m using. Version 2 is still in the beta development stage. Both RPMs and source tarballs are available for each version.
For a base Nagios install, the only requirements are Linux or similar system and a C compiler. If you want the web interface (which also includes documentation), you’ll need Apache. Some web features, such as the graphical maps, are also optional and require extra packages. The installation creates the necessary files to start the Nagios server at boot; on Slackware, a startup script was automatically generated in /etc/rc.d.
The configuration process will let you know if there is anything broken or missing from your system. The Nagios FAQ database at http://www.nagios.org/faqs/ covered every problem I ran into during installation and configuration. For example, when Nagios wouldn’t compile at first, I found out through the FAQ that there was a conflict with the RADIUS client included with Slackware 10.0. I removed the offending software (a ppp client) and Nagios compiled just fine.
Next comes the tedious part: configuring the contacts and services. I spent about two hours setting up six technicians with 17 hosts and 26 services. Hosts are defined as physical equipment, such as servers, routers, or switches. Services are defined as the daemon services running on that equipment, such as HTTP, SMTP, and so forth. This configuration time included consulting docs, some trial and error, and fixing typos.
By default, the Nagios home directory is /usr/local/nagios with the configuration files in ~/etc. The config files, again, are simple plain text files to be edited. Samples are included for each file for syntax and make a great base to start with. The complexity of these files is daunting at first, but if you think of them as a sort of tree you’ll realize they offer a tremendous amount of power and flexibility.
For example, a contact definition looks like this:
Each of your contacts, or the techs who will be notified of service problems, will get their own definition. The definition includes: a time period, such as 24×7 or a shift schedule; the instances for which the contact is notified, such as host warning or service critical states; and the notification method and address to reach the contact.
Each contact is then placed in a contact group, which allows for a breakdown in service tiers; for example, Level 1, 2 and 3 Helpdesk notifications. A contact group contains the contact_names for each contact, and looks like this:
The contact group is then assigned to a service or host. So, if HTTP goes down on a Unix box, Nagios triggers a notice to the contact group. Within the contact group, only the technicians whose time period matches up (i.e., the techs on the current shift) will be notified.
Similarly, the developers had the forethought to allow for dependency trees. If a remote router goes down, there’s no sense burying you in outage notifications for the three servers and their web, email and FTP services on the other side of the router. The host and service definitions look very similar to the contact definitions above, only they include time periods and intervals for checks and which commands to execute to perform checks.
Ideally you will have a full list of all of your hosts and services that require monitoring prior to sitting down to create the above definitions. I pulled it off with a lot of copying and pasting, but if you have a very large organization you may want to look into scripting the output.
Nagios comes with a full complement of utilities to check services such as FTP, HTTP, SMTP, and so forth. There are a wide variety included in the ~/libexec directory, and it’s also possible to customize them or create your own. To monitor disk space and other local events on remote servers, there are two add-ons that will run on the remote hosts and communicate this information to the main server.
All of these configurations leave room for error, perhaps over scheduling certain checks or leaving a typo somewhere. When the Nagios daemon is started, it performs a “pre-flight check” and outputs any errors it finds. This makes it easy to troubleshoot problems with configuration problems quickly and without a lot of hair pulling.
So far, so good, right? Not bad for free software! However, were this an infomercial, this would be the part where I proclaim “But wait! There’s more!” because all of the above can be accomplished without the web server portion.
By tying Nagios in to Apache, a tech or administrator can observe the entire system at a glance. The current status of monitored items are color-coded in red, yellow and green depending on state. Icons can be assigned to each host or service for quick identification. Downtime can be scheduled, comments can be placed by techs, and, with some extra configuration and software, and full 3D maps can be drawn up of the network. And let’s not forget the WAP interface for mobile phone users.
If you want to get really fancy, you’ll want to look into event handlers, database integration, and more. An event handler will conceivably resolve a problem before contacting a tech by clearing a queue or restarting a service. If you want to track all of Nagios’s uptime data over long periods of time, you could tie it into a MySQL database. Just another example of the flexibility Nagios offers.
One last add-on I’d like to mention is environmental monitoring. ESensors has a device that will connect to your Ethernet network and provide monitoring for temperature, relative humidity, and illumination. This hardware isn’t free, but it does work with Nagios and will send notifications if environment conditions change. Very handy if your A/C breaks down in the middle of the weekend when you have no staff.
Running without the web component, Nagios is fairly secure. The server itself isn’t really accepting external connections as all of the system and host checks are run as if by a user. Administration is as simple as accessing the host box via console or SSH, editing the config files, and restarting the service.
The first concern comes in running the remote monitor add-ons, nrpe and ncsa. These open the system to further connections and possibly result in sensitive information being sniffed on the wire. They both work a little differently, depending on your needs, but nrpe can be secured by tcpwrappers or internal configuration files, thus limiting connections, and ncsa traffic can be encrypted, thus hampering sniffing attempts.
The real concern is the web interface, which is about as secure as any other web server only you have the extra CGI scripts (if enabled) to worry about. Conceivably, this can give an assailant a convenient map of your entire network (including what hosts run what services!) or allow them to manipulate your monitoring to prevent early detection of something he may be doing.
One way around this is to run the Nagios box on a private LAN. Without having it exposed to the Internet, the risk is minimal (assuming you’re not running a wireless connection on the same internal network, of course). However, if your monitoring techs are working remotely or you or your techs need to access the server from home or other remote locations, this isn’t going to be an option.
Read the security sections thoroughly, as it covers a lot of the lockdown capabilities. There are several options in the main Nagios configure file which can be tweaked, including limitations on who is allowed to run the various CGI scripts from the web interface. The developers also recommend the use of Apache’s .htaccess files to help limit access.
Nagios includes a nice set of reporting features. Accessed through the web, uptime and status statistics can be found for each host or service. If the displayed tables aren’t what you’re looking for, Nagios can output the data to CSV, allowing Excel-happy admins to have their way with it. The rest of us will be content to know whether or not we’re hitting the much-hallowed five nines or not, or if there’s a chronic problem that needs to be examined more closely.
While Nagios will send messages to a pager, it’s essentially doing so with email. If you look at the pager entry in contact definitions, you’ll notice it’s just another email address. If you have pagers using other methods of communication or you would like to trigger some kind of audible alarm or visual cue, you may have to do rig up some tricks with Procmail and scripting. External speakers and the right wav (http://www.moviewavs.com/0056218974/WAVS/Movies/Lost_In_Space/danger1.wav) can really get your attention.
If you’re monitoring a small network with a single pipe, Nagios isn’t going to be able to warn you if that connection goes down. This isn’t a concern for me, but had it been I was going to look into an event handler to trigger modem dial-up over a POTS line. This would require an account with a dial-up ISP, but it would get the job done.
Finally, keep an eye on your bandwidth, especially in larger networks. Frequent checks to multiple hosts and services can quickly generate a lot of network traffic, so consult apps like MRTG or ntop. It’s also a good idea to test Nagios’s checks during off hours if you’re running non-standard server applications; we have a custom server written in Java that used to lock up whenever Nagios queried its status.
As if you couldn’t already tell, I’ve been very impressed with Nagios and it’s been a great addition to our network. Given we’re a small ISP with a negligible budget for these kind of things, it’s paid off the minimal time investment to get it going a hundred times over. Used in combination with utilities like LogWatch and MRTG, it’s really helped us get a handle on what’s happening on our network. Nagios’s scalability also ensures it will grow with our company as we expand our service offerings.
But hey, why not let the product speak for itself? There’s a Nagios demo available online at http://www.nagios.org/demo.php. There are a few features disabled for the guest user, but system monitoring and reporting is available.