I wrote this little check-script for nrpe/nagios to get the status of various raids in a box, and output the failed volumes if any such exist.
SYNTAX:
$path/check_smartarray.sh [email] [email]
If no arguments are specified, the script will assume its run for NRPE.
If one or more email addresses are specified, the script will send an email in case an array reports an error.
da1: DEGRADED / da2: rebuilding / da0: ok / da3: okFailed/rebuilding volumes will always be first in the output string, to help diagnose the problem when recieving the output via pager/sms.
Various outputs explained:
ok | The device is reported as ok by the smart array controller |
DEGRADED | The RAID volume is degraded, it's still working but without the safety of RAID, and in some cases with severe performance loss. |
rebuilding | The RAID is rebuilding, will return to OK when done |
expanding | The RAID is expanding, will return to OK when done |
ready for recovery | The RAID is ready for recovery, but not recovering. This can happen if automatic recovery is disabled, and on some smaller versions of the SmartArray Controllers where only one RAID volume can be rebuild at a time |
unknown state | Volume is in an unknown state. Please report this to me (soren at klintrup.dk) so I can update the script
|
Tested on the following controllers:
Should work on all smartarray controllers though - if you test on another (working or not) controller, I would like to know, please mail me on soren at klintrup.dk.
HP SmartArray 6i
HP SmartArray 5i
HP SmartArray P400
HP SmartArray P410
HP SmartArray P800
Latest version 1.5 check_smartarray.sh
Changelog:
1.5:
o Can now email an address of choice, just use email address(es) as arguments to shellscript
o check if camcontrol binary exists on system before running script
1.4.5:
o Problems with status of ADG (Advanced Data Guarding) Volumes fixed.
Thanks to Peter Larsen for reporting this
1.4.4:
o Added online expansion
Thanks to Mikael Antonsen for reporting this :)
1.4.3:
o Changed tr A-Z a-z to tr [:upper:] [:lower:] to prevent problems with various locales.
Thanks to Oliver Fromme for reporting this :)
1.4.2:
o The nagios web interface would only show one RAID volume, it seems nagios blocks "|" in the input and throws everything after that away.
Changed the "|" to a "/"
Thanks to Kai Gallasch for reporting this :)
1.4.1:
o Patch by Christoph Schug applied to replace two (cut) systemcalls with one (sed) when getting DEVICESTRING.
o Added quotes in various places for consistency
o Don't set state to unknown if state is already critical (for code added in 1.4)
o unset $ERR before doing anything to avoid problems if the variable is already set
1.4:
o If a volume didn't have a known state, it just wouldn't show that volume, it now exits with errorcode3 and outputs as "unknown state"
1.3.1:
o Using tr to replace the string-output from camcontrol, for a more human-readable script, no changes in functionality.
1.3:
o Initial public release