This note describes a heartbeat technique for validating the integrity of a NUT installation.

Introduction
------------

A NUT configuration may run for months with little or no output to a system administrator to assure that the combined processes are running correctly. The technique described in this note verifies that the ups driver, upsd, upsmon, upssched and upssched-cmd components are operational and that the flow of data between them is effective. The system administrator is warned if the overall combined process breaks.

Overview of the technique
-------------------------

An 11 minute upssched timer runs permanently, and when it completes, upssched-cmd sends a warning message to the sysadmin. During normal operation the timer is prevented from completing by a timed process with a shorter 10 minute period running in a dummy UPS known as "heartbeat". The dummy UPS "heartbeat" cycles through an OL and an OB every 10 minutes, and the status changes are communicated to upsd and then to upsmon and upssched. Thus every 10 minutes upssched stops and restarts the 11 minute timer. During normal operation the 11 minute timer will never complete, but if the driver -> upsd -> upsmon -> upssched chain is broken, it will complete and the sysadmin advised.

The technique requires a working NUT installation and an understanding of upssched timers and the upssched-cmd script.

Changes to configuration files
------------------------------

1. In ups.conf, add

[heartbeat]
        driver = dummy-ups
        port = heartbeat.dev
        desc = "Heart beat validation of NUT"

2. Create heartbeat.dev in the same directory as ups.conf with the contents

ups.status: OL
TIMER 300
ups.status: OB
TIMER 300

Remember that the are no comments in NUT .dev files.

3. In upsmon.conf, add

MONITOR heartbeat@localhost 1 upsmaster s3cr3t master

and make sure that you have specified

NOTIFYCMD /usr/sbin/upssched
NOTIFYFLAG ONBATT   SYSLOG+WALL+EXEC
NOTIFYFLAG ONLINE   SYSLOG+WALL+EXEC

Your upssched executable may be elsewhere. You may want to remove the WALL.

4. In upssched.conf, add

# Heart beat validation that NUT is operational.
# Restart timer which completes only if the dummy-ups heart beat has stopped.
# See timer values in heartbeat.dev AT ONBATT heartbeat@localhost CANCEL-TIMER heartbeat-failure-timer
AT ONBATT heartbeat@localhost START-TIMER  heartbeat-failure-timer 660

and make sure that there are no entries such as

AT ONLINE * ...
AT ONBATT * ...

Replace the "*" with the full address of the ups unit, e.g. myups@localhost

Make sure that you have specified

CMDSCRIPT /usr/sbin/upssched-cmd

Your upssched-cmd may be elsewhere.

5. In upssched-cmd, test for completion of the heartbeat-failure-timer and when it completes send a warning to the sysadmin, e-mail, SMS, pigeon, ...

Testing the heartbeat setup
---------------------------

1. Test that you can send a warning to the sysadmin with the command

   upssched-cmd heartbeat-failure-timer

2. When you start NUT, check that "heartbeat" is running. Command ps aux | grep ups should show something like

upsd     14785  0.0  0.0  13228   652 ?        Ss   22:48   0:00 
/usr/lib/ups/driver/usbhid-ups -a myups
upsd     14787  0.0  0.0  19624   704 ?        Ss   22:48   0:00 
/usr/lib/ups/driver/dummy-ups -a heartbeat
upsd     14791  0.0  0.0  17560   744 ?        Ss   22:48   0:00 /usr/sbin/upsd 
-u upsd
root     14794  0.0  0.0  19432   664 ?        Ss   22:48   0:00 
/usr/sbin/upsmon
upsd     14795  0.0  0.0  19856  1616 ?        S    22:48   0:00 
/usr/sbin/upsmon
upsd     14845  0.0  0.0   6408   448 ?        S    22:53   0:00 
/usr/sbin/upssched UPS heartbeat@localhost: On battery

3. Shorten the heartbeat-failure-timer in upssched.conf to 540 seconds, and you should receive a warning every 10 minutes.

4. If you leave the WALL in the NOTIFYFLAG ONBATT and NOTIFYFLAG ONLINE declarations in upsmon.conf you will see the action of the dummy-ups displayed in an xterm or equivalent console.

I have tested this setup with NUT 2.7.4 on openSUSE 13.2 and 42.2.
Comments and suggestions welcome.

Roger


_______________________________________________
Nut-upsuser mailing list
Nut-upsuser@lists.alioth.debian.org
http://lists.alioth.debian.org/cgi-bin/mailman/listinfo/nut-upsuser

Reply via email to