Nagios - harbinger of doom

A. Nagios eye view

Big Minus - system admin is political. Good nagios setup will corner poor planners.

Big Plus - Anything you can script with a return value and text output you can check.

B. Meaning of Life

Tell your customers about problems before they tell you.

C. There are three things you can do with Nagios

1. Watch systems - Nagios is good at this. It has a web interface with user definable views and many ways to drill down and sort data

2. Remote alerts - Nagios is good at this. You can beep/email/run scripts for any event with an frequency.

3. Trends/Statistics - due to event logging there is some ability here, but this should not be your prime objective. Consider using something like cacti for network/system stats.

Don't be discouraged, Nagios will improve security, but it is only the breakfast cereal of an IDS solution. Including snort, iptables, syslog-ng, cacti, and so forth.

D. Parental hierarchies, Custom host checks, and Dependencies.

Downside.

The financial derivatives of the Nagios network. Nagios can make complex decisions with a little help. Just be cautious. Diagnosis can become complex. Always refer back to the checkcomands.cfg and run the source commands manually.

Upside.

1. Hierarchies mean that meaningful physical maps may be a click away in the web interface.

2. Fewer Beeps

3. Elaborate Conditional Structures

E. Avoiding Parental Hierarchies

Upside.

More simple setup. More simple diagnosis.

Downside.

Your monitoring server pounds the mail server into oblivion because there are so many beeps.

F. Structural Design

Function: finite, limited or no meaning, reflects implementation
Purpose: implementation sorted into meaningful groups

G. Functionality driven examples

Windows vs Linux checks
drive c: vs drive d: vs drive e:
port 5556 on host A vs port 5556 on host B

H. Purpose driven examples

Fred vs nixgroup vs storageadmins vs deptXYZadmins
port 25 on server vs port 25 on many mail servers vs 25-on-MX-hosts
localdrives vs systemdrives vs datadrives

I. Beep frequency and escalation

1. Who - contact group or contact
2. When - timeperiods
3. How - email? beep? opening all cdroms on earth?
4. How often? - How many trys, how often to retries, rebeep?
5. Acceleration (escalation) - Speed up, slow down, beep someone new.

J. Check methods

1. Network based scripts script open network connection directly to service from server
Ex: snmp, ssh, http

2. NRPE - Linux remote client runs commands local on watched servers (*locally configurable)

3. check_nt - Older Windows client to run commands locally watched servers

4. winnrpe - Newer Windows remote client runs commands local on watched servers (*locally configurable)

5. SNMP Trap server - contacts Nagios about SNMP trap events.

K. Big Picture conclusion

Downsides.

It takes brains

It's long hours of implementation and debugging

Any alert system will require many tough decisions and much manpower.

A coworker may throw you in the bay in a homicidal fit.

Upsides.

You actually know what is happening no matter where you are.

L. Installation

Basic install

Hardware
CPU - decently modern CPU 1+ghz Excellent target for SMP
RAM - to be safe, 1 gig of RAM for every thousand service checks.
DISK - negligible
NET - low latency (server quality card) Bandwidth is a non issue. It may end up on the backbone for resiliency and connectivity reasons.

Software
nagios RPM
nagios-plugins RPM
Web server
Mail server

/etc/nagios is the config directory

M. Configuring your first machine

You will need entrys in /etc/nagios for

nagios.cgi
nagios.cfg
checkcommands.cfg
contacts.cfg
contactgroups.cfg
timeperiods.cfg
hostgroups*.cfg
services*.cfg

Don't forget to configure Apache to look at the Nagios install.
Don't forget to configure you mail server.

N. Customization

Custom scripts may be desirable. Here are a few of mine.

cluster-ssh.pl
rsync scripts
check-snmp-cisco-port-pool.sh

O. Got-ya!

nagios.cmd disappeared.

nagios.status.log permissions.