MRTG Solaris/Linux/BSD Performance Monitoring Extensions

Introduction/FAQ

MRTG Solaris/Linux Performance Monitoring Extensions (PME v 1.0.2)

by Bill Lynch Oct. 10, 2000

MRTG is already awesome. What makes these extensions any better?

MRTG is awesome, but by default it’s limited to SNMP metrics. These extension packages allow for much finer detail for performance monitoring, and they’re easy to set up, configure and maintain. Plus, since you don’t need to have an SNMP daemon running, your system is more secure.

How do I install PME?

Just untar the file and copy the appropriate .pl files to your run directory, then create your own config file from the templates provided.

Where can I get the newest version of PME?
The newest version of PME can always be found here. Mirrors are forthcoming.

What kind of metrics can I monitor with these extensions?

Just about anything that comes out of vmstat, plus a few extras. For Solaris you can monitor CPU and memory utilization, paging and swapping, the run queue, the number of blocked processes, the scan rate, as well as disk (partition) capacity, and disk performance. Linux monitors include CPU and memory utilization, swapping, the run queue, the number of blocked processes, and disk (partition) capacity. BSD monitors include CPU, memory, the run queue, the number of blocked processes, disk (partition) capacity, and some disk performance characteristics.

Are there any screenshots available?

Right now, all I have available are some static screenshots of a Linux box, but the Solaris and xBSD monitors look almost identical. You can see these static (they don’t update) Linux screenshots of these metrics: CPU Memory Swap Memory Processes Swapping root (/). If someone would like to contribute live links, or some Solaris or xBSD static screenshots of a box that sees heavy use, that would be awesome.

What kind of warranty/support can I get?

This software is GPL and comes with ABSOLUTELY NO WARRANTY. Use at your own risk, but it’s stable and I’ve been using it in a production environment for a couple of months. As far as support goes, I do this in my spare time. Feel free to send me (or even Mark) an e-mail is you have questions or concerns, but please read the docs first! I went to a lot of trouble to make these extensions as easy as possible to set up and use.

Also, I don’t claim to be a performance-monitoring guru. If any of my explanations of what a performance characteristic means are incorrect, then let me know and we can discuss it. This is also my first open source contribution.

What’s coming in the future?

I’d like to port these monitors to other operating systems when I get the chance. This release includes the port to xBSD, and in the future I may develop ports to AIX and possibly HP-UX. If you have a suggestion or request let me know.

I’d like to create a disk performance monitor for Linux, but that requires a kernel patch that I can’t seem to find (see the details in section IV). Please let me know if you have any information in this area.

The next release will probably include some network monitors generated from netstat of some sort.

Who do you want to thank?

First, I have to thank my partner Mark, who got me started in this whole mess anyway. I also want to thank my employer, Crave Technology who supported all of my development efforts. Of course, I also want to thank Tobias Oetiker and Dave Rand, without whom there would be no MRTG project for me to write extensions for. Finally, I want to thank my Mom because without her I wouldn’t be here making cool software for the rest of you! If I left anyone out, just whine and I will include you.

II. Installation and Configuration of Secure Shell (ssh)
Overview

Perl scripts have been developed to be used in conjunction with MRTG to monitor Solaris, Linux, and xBSD systems remotely. This remote monitoring uses ssh to transfer performance statistics to the performance monitoring station. This section discusses the details of how to install and configure the ssh package.

Monitoring Explained

While performance data is gathered from remote machines, all scripts are executed on the machine that started the MRTG process. Localhost monitoring runs a local command such as "vmstat 1 2" and then parses the output of this command for data. Remote hosts are monitored by initiating a secure shell (ssh) connection to the remote host and then executing an identical command on that host. The output of this command is returned to the MRTG host which then parses the output in an identical fashion to the localhost version.

Background on ssh

Prior to the development of ssh, traffic between hosts was usually unencrypted or trivially encrypted. The ability to "sniff" the transactions of a telnet or remote shell (rsh) session meant that unscrupulous personnel could acquire passwords from wire traffic with relative ease. Secure shell (ssh) non-trivially encrypts traffic between hosts and prevents the acquisition of passwords through the use of "sniffers".

The use of ssh in the MRTG scripts mimics the functionality of rsh, in that a trusted host (as specified in an .rhosts file for rsh) is allowed to execute commands on a system remotely. For ssh, not only can access be restricted to specific hosts, it is restricted to specific users at specific hosts.

Each host with ssh installed has both a public and a private key for each user which are used for encryption. During the establishment of a trusted connection, the trusted host sends its public key for the user initiating the connection to the trusting host. The trusting host compares this key with its list of authorized keys for that user. The username on both the trusted and the trusting hosts must be identical. If the key exists in the trusting host’s authorized key list for the specified user, access is granted (without a password). If the key does not exist in the authorized key list, the user is prompted to present a password for authentication. If the password fails, no connection is established.

MRTG User Account

As described above, ssh trusts require that an identical login exists both on the monitoring station and the systems to be monitored.

It should also be noted that from a security standpoint if the MRTG user account on the monitoring station were compromised, then an attacker would be able to run remote shells as the MRTG user on any machine being monitored. Thus, the monitoring host should be hardened as much as possible. However, the reverse is not true because the monitoring station does not trust the monitored hosts by default.

Installing the ssh Package on Monitored Hosts

Version 1.2.27 is the recommended version of ssh to be installed, as these scripts have not been tested (but are likely to be compatible with other versions).

For specific installation instructions, please refer to the ssh documentation. However, the following are the general steps necessary to set up ssh for MRTG.

Download and install ssh on the monitoring host and all the machines that you want to monitor.
Create a user account for MRTG to use that is identical on the monitoring station and all of the monitored systems. Do NOT use the root account. If you do and your monitoring station gets hacked, the intruder will automatically have root access on all the boxes you are monitoring.
Create a ~/.ssh directory for the MRTG user account on each of the monitored hosts and set the permissions on the directory to 700.
On the monitoring station, log in as the MRTG user account and create a public key using the ssh-keygen command. This should create (among others) an ~/.ssh/identity.pub public key file.
Use scp to push the file ~/.ssh/identity.pub to all of the monitored hosts. On each of the monitored hosts this file needs to become the file ~/.ssh/authorized_keys which is used to authenticate the trust. Your command might look something like "scp ~/.ssh/identity.pub mrtguser@host:/home/mrtguser/.ssh/authorized_keys."
Test that the trust works by issuing the command ssh –v remotehost. If you are dropped to a shell on the remote host without being asked for a password, then your trust is working properly. If you are asked for a password, the debug messages should help you determine what the problem is.

Now that monitoring is ready to go, the next section discusses how monitoring actually works as well as what the results mean. It's also always a good idea to test each script outside of MRTG first just to make sure it works on your system. You may need to modify the shebang line depending on where you have perl installed on your system. III. MRTG Performance Monitoring Scripts for Solaris
Overview

This document discusses the details of how each of the perl scripts for gathering Solaris performance metrics functions. All scripts are provided as-is with no warranty.

CPU Monitoring
Script Name: cpu-solaris.pl hostname
Generic Config. File: solaris-cpu.cfg

The CPU monitoring script monitors both both user and system modes of CPU utilization. This CPU utilization is an average of all of the processors in the host. The values come from the output of "vmstat 1 2". User CPU utilization refers to the "us" column and System CPU utilization refers to the "sy" column. If User and System CPU utilization do not total 100%, then the balance is attributed to idle time, which is not graphed. As a rule of thumb, total CPU usage should not exceed a 3:1 user:system ratio for extended periods of time. If Total CPU Utilization reaches 100%, then check the run queue for any waiting processes. Chronically high CPU utilization coupled with waiting processes indicates a CPU bottleneck.

Memory Monitoring
Script Name: mem-solaris.pl hostname
Generic Config. File: solaris-mem.cfg

Memory monitoring is accomplished by taking metrics from available physical and swap memory in the host and based on the output of "vmstat 1 2". Total available virtual memory is listed in the column "swap" under "memory" while physical memory is listed in the "free" column. The MRTG graph of this data may look confusing initially, as the graph appears nearly full all the time. Actually, a nearly full graph really means that there is an abundance of free virtual memory, based on the sample configuration file. A troubled system would be low on physical memory, which is analogous to a very low blue line on the MRTG graph, which would almost always be coupled with low virtual memory signified by a drastic drop-off in the green shading. However, depending on the amount of physical memory in the system and the type of applications running on the system (especially databases), its possible that a low blue line for physical free memory signifies normal operation.

Paging Statistics
Script Name: page-solaris.pl hostname
Generic Config. File: solaris-page.cfg

Paging statistics monitor the number of memory pages written out to or read in from disk. Large values may represent disk thrashing and chronically large values could indicate that the system either has a memory leak in its software, or more physical memory may be required, especially when these statistics occur simultaneously with low values for free physical memory. Page-In and Page-Out statistics are generated from the "pi" and "po" columns of "vmstat 1 2", respectively.

Swapping Statistics
Script Usage: swap-solaris.pl hostname
Generic Config. File: solaris-swap.cfg

Swapping statistics monitor the number of processes swapped in or swapped out from disk. Swapping differs from paging in that when paging occurs, only unused portions of a process are sent to disk. When swapping occurs the entire process is written to disk and upon re-activation must be read back from disk. Swapping almost always degrades performance and thus a healthy system should usually show "0" for both the number of swapped in and swapped out processes. Statistics for Swap-In and Swap-Out are generated from the "si" and "so" columns of "vmstat –S 1 2", respectively.

Process Queuing Statistics
Script Usage: procs-solaris.pl hostname
Generic Config. File: solaris-procs.cfg

Through the use of this script it is possible to monitor both the numbers of blocked processes and queued processes. Blocked processes indicated processes that are waiting for some sort of I/O, such as disk reads/writes or network reads/writes. It is desirable for both of these counters to remain zero because positive numbers indicate performance degradation. A large number of blocked processes indicates either memory, disk or network slowness and queued processes indicate that the CPU has more work than it can currently handle.

Scan Rate Statistics
Script Usage: scan-solaris.pl hostname
Generic Config. File: solaris-scan.cfg

To make more room in physical memory for currently needed pages Solaris moves seldom used pages to disk. If many pages have been written to disk, then before these pages can be moved back from disk to physical memory, Solaris must scan the disk to find the right pages to return to physical memory. The rate at which this scan occurs is called the scan rate. Since scan rate is an indicator of heavy paging, it is desirable for the scan rate to have low or even zero values.

Disk Capacity Statistics
Script Usage: disk-solaris.pl hostname mount stat1 stat2
Generic Config. File: solaris-root.cfg (varies per disk partition)

Disk space is a premium on most systems and monitoring is important because full slices can crash applications and possibly the OS itself. The purpose of the disk capacity script is to return the used and total disk space on a given file system, through the use of the "df –k" command. If the sample config file settings are used, the file system monitor will show total disk space in a small blue line and the actual used space as a shaded green area. The difference between the blue line and the top of the green shaded area represents the free space on the partition.

Disk Performance Statistics
Script Usage: dstat-solaris.pl hostname disk stat1 stat2
Generic Config. File: solaris-c0t0d0wb.cfg, solaris-c0t0d0ks.cfg (varies per disk drive)

The disk performance data used by this script comes from the "iostat" command. These statistics monitor both the read and write performance in kB read per second. Disk activity is also monitored by the percentage of time the disk is busy, which indicates when the disk is in use, versus the percentage of time the disk is waiting with transactions in the disk queue. It is important to note that wait is not idle time, but rather the amount of time the disk is overutilized and causing disk reads or writes to queue. The disk can be busy at near 100% of time with no performance degradation so long as the wait time is acceptably small.

IV. MRTG Performance Monitoring Scripts for Linux
Overview

Perl scripts have been developed for use in conjunction with MRTG to monitor performance on Linux systems. This document discusses the details of how each of these scripts functions. All scripts are provided as-is with no warranty.

CPU Monitoring
Script Name: cpu-linux.pl hostname
Generic Config. File: linuxtemplate-cpu.cfg

The CPU monitoring script monitors both both user and system modes of CPU utilization. This CPU utilization is an average of all of the processors in the host. The values come from the output of "vmstat 1 2". User CPU utilization refers to the "us" column and System CPU utilization refers to the "sy" column. If User and System CPU utilization do not total 100%, then the balance is attributed to idle time, which is not graphed. As a rule of thumb, total CPU usage should not exceed a 3:1 user:system ratio for extended periods of time. If Total CPU Utilization reaches 100%, then check the run queue for any waiting processes. Chronically high CPU utilization coupled with waiting processes indicates a CPU bottleneck.

Memory Monitoring
Script Name: mem-linux.pl hostname stat1 stat2
Generic Config. File: linuxtemplate-mem.cfg

The Memory monitoring script monitors Linux’s built-in memory metrics located in /proc/meminfo. Total available memory is listed as "total" while free memory is listed as "free". The MRTG graph of this data may look confusing initially, as the graph is nearly full all the time. Linux can grow to fill almost all physical memory in the system, due to its IO caching, thus this graph may not be particularly useful. On a Linux system, memory performance can best be gauged by watching the amount of swap space actually in use. If a system is using all of its physical memory and most of its swap, it probably needs more physical memory.

Swapping Statistics
Script Usage: swap-linux.pl hostname
Generic Config. File: linuxtemplate-swap.cfg

I haven’t been able to get anyone to satisfactorily explain to me how Linux swap is different from Solaris swap, aside from the fact that Linux never panic swaps. Thus, this script reads the swap statistics from vmstat, but I leave it up to you to decide what it means. I’d be very happy if someone decides to enlighten me.

Process Queuing Statistics
Remote Script Usage: procs-linux.pl hostname
Generic Config. File: linuxtemplate-procs.cfg

Through the use of this script it is possible to monitor both the numbers of blocked processes and queued processes. Blocked processes indicated processes that are waiting for some sort of I/O, such as disk reads/writes or network reads/writes. It is desirable for both of these counters to remain zero because positive numbers indicate performance degradation. A large number of blocked processes indicates either memory, disk or network slowness and queued processes indicate that the CPU has more work than it can currently handle.

Disk Capacity Statistics
Script Usage: disk-linux.pl hostname mount stat1 stat2
Generic Config. File: linuxtemplate-root.cfg (varies per disk partition)

Disk space is a premium on most systems and monitoring is important because full slices can crash applications and possibly the OS itself. The purpose of the disk capacity script is to return the used and total disk space on a given file system, through the use of the "df –k" command. If the sample config file settings are used, the file system monitor will show total disk space in a small blue line and the actual used space as a shaded green area. The difference between the blue line and the top of the green shaded area represents the free space on the partition.

POSSIBLE BUG NOTE: While this script works great on its own from the command-line, I have been having some difficulty getting it to interface correctly with MRTG. As a result I sometimes get very odd looking graphs similar to this one, ONLY when monitoring LINUX hosts. My suspiction is that this is a result of improper clock settings on my monitoring server, but I only develop the problem with this particular monitor on Linux only. If you experience similar difficulties, please let me know.

Disk Performance Statistics

I do have a script written for pulling in metrics from iostat in Linux, but as it is practically useless, I have opted not to release it. If you read the man page for iostat, you will see that the command is functionally broken because of the way that the kernel handles disk io. The man page states that this can be fixed with a kernel patch, but I have been unable to locate such a patch. If you have information that would be useful for implementing this metric, please let me know.

IV. MRTG Performance Monitoring Scripts for xBSD
Overview

Perl scripts have been developed for use in conjunction with MRTG to monitor performance on xBSD systems. This document discusses the details of how each of these scripts functions. All scripts are provided as-is with no warranty. At this point, the scripts have only been tested on FreeBSD 4.1.1-RELEASE #0 and OpenBSD Generic #25. I haven't used these scripts in production yet (mainly because I don't have any xBSD boxes in production) but they seem to be working just fine. If you find any bugs, please let me know.

CPU Monitoring
Script Name: cpu-bsd.pl hostname
Generic Config. File: bsdtemplate-cpu.cfg

The CPU monitoring script monitors both both user and system modes of CPU utilization. This CPU utilization is an average of all of the processors in the host. The values come from the output of "vmstat 1 2". User CPU utilization refers to the "us" column and System CPU utilization refers to the "sy" column. If User and System CPU utilization do not total 100%, then the balance is attributed to idle time, which is not graphed. As a rule of thumb, total CPU usage should not exceed a 3:1 user:system ratio for extended periods of time. If Total CPU Utilization reaches 100%, then check the run queue for any waiting processes. Chronically high CPU utilization coupled with waiting processes indicates a CPU bottleneck.

WARNING! I don't have access to any xBSD machines that have more than one disk, so I don't know how the vmstat output changes. If extra disks mean extra columns in vmstat, then the CPU monitor may be affected. If you have multiple disks in your xBSD boxes, and the CPU monitor works for you, then please let me know!

Memory Monitoring
Script Name: mem-bsd.pl hostname
Generic Config. File: bsdtemplate-mem.cfg

Memory monitoring is accomplished by taking metrics from available physical and swap memory in the host and based on the output of "vmstat 1 2". I'm sure this isn't the best way to monitor memory under BSD so I would be happy to entertain any suggestions for a different system. As it stands, the active virtual pages (avm) and the size of the free list (fre) can be used for monitoring. AVM is the number of pages that have been accessed within the last 20 seconds. and the number reported has the units of kBytes. The size of the free list is given in 512 kB blocks I believe (that would seem to reconcile with the "top" output).

Process Queuing Statistics
Remote Script Usage: procs-bsd.pl hostname
Generic Config. File: bsdtemplate-procs.cfg

Through the use of this script it is possible to monitor both the numbers of blocked processes and queued processes. Blocked processes indicated processes that are waiting for some sort of I/O, such as disk reads/writes or network reads/writes. It is desirable for both of these counters to remain zero because positive numbers indicate performance degradation. A large number of blocked processes indicates either memory, disk or network slowness and queued processes indicate that the CPU has more work than it can currently handle.

Disk Capacity Statistics
Script Usage: disk-bsd.pl hostname mount stat1 stat2
Generic Config. File: bsdtemplate-root.cfg (varies per disk partition)

Disk space is a premium on most systems and monitoring is important because full slices can crash applications and possibly the OS itself. The purpose of the disk capacity script is to return the used and total disk space on a given file system, through the use of the "df –k" command. If the sample config file settings are used, the file system monitor will show total disk space in a small blue line and the actual used space as a shaded green area. The difference between the blue line and the top of the green shaded area represents the free space on the partition.

Disk Performance Statistics
Script Usage: dstat-bsd.pl hostname disk stat1 stat2
Generic Config. File: bsd-dstat.cfg
The disk performance data used by this script comes from the "iostat" command. The xBSD implementation of this monitor differs drastically from the Solaris version. Under xBSD, monitors can be set up to watch kilobytes per transfer (kbt), transfers per second (tps), and megabytes per second (mbs). Each of these metrics gives an idea of how often a drive is accessed, although an empirical baseline may need to be determined on a per disk basis.

Change Log

Version 1.0.2 - First official public release
- Added support for xBSD
- Rewrote all scripts to be more portable, especially for parsing the uptime
Version 1.0.1 - Unofficial release, never publically available
Version 1.0.0 - Unofficial release, publically available for < 10 mins.

Change Log

All copyrights are properties of their respective owners, and none of them endorse this site. Please don't sue me.