Zabbix triggers with "flap-detection" and a grace period.

Monitoring an environment with some monitoring system gives control, so it's pretty important. But it can be a challenge to setup a monitoring system; it should not alert too fast, but also not too slow.

Nagios uses "flap detection" to prevent many ERROR's and OK's being sent right after each other. Zabbix calls this "hysteresis". Zabbix's hysteresis is rather difficult to understand, so I'd like to share some triggers that I have setup for Zabbix that implement both flap detection/hysteresis and grace.

Grace can be defined like this: "When a value is higher (or lower) then a threshold, make sure it's a little lower (or higher) as the threshold that caused the trigger to alert, before recovering a trigger." I know; it's not easy to understand... Let's look at some examples.

Thresholds that should be above a certain value

With values that need to be below a threshold, like cpu load, number of users logged in or number of processes running:

({TRIGGER.VALUE}=0&{TEMPLATE:CHECK[ITEM].min(300)>ALERTVALUE)|({TRIGGER.VALUE}=1&{TEMPLATE:CHECK[ITEM].max(300)<RECOVERYVALUE)

Just to clarify the different part of the trigger:

  1. {TRIGGER.VALUE} makes sure the first part (before the |) is evaluated when there is no alert, the part after | indicates the trigger should be on/in alert.
  2. .min(300) makes sure the values are minimally as high as ALERTVALUE for 300 seconds.
  3. The last part (after the |) makes sure the trigger recovers when the measured value is lower than RECOVERYVALUE for 300 seconds.

For example CPU load with an ALERTVALUE of 5 and a RECOVERYVALUE of 4:

({TRIGGER.VALUE}=0&{Template_Linux:system.cpu.load[,avg1].min(300)}>5)|({TRIGGER.VALUE}=1&{Template_Linux:system.cpu.load[,avg1].max(300)}<4)

Thresholds that should be below a certain value

With values that need to be above a threshold, like percentage diskspace free, number of inodes free or number of httpd processes running:

({TRIGGER.VALUE}=0&{TEMPLATE:CHECK[ITEM].max(300)<ALERTVALUE)|({TRIGGER.VALUE}=1&{TEMPLATE:CHECK[ITEM].min(300)>RECOVERYVALUE)

For example disk space of /var free in percent with an ALERTVALUE of 10 and a RECOVERYVALUE of 11:

({TRIGGER.VALUE}=0&{Template_Linux:vfs.fs.size[/var,pfree].max(300)}<10)|({TRIGGER.VALUE}=1&{Template_Linux:vfs.fs.size[/var,pfree].min(300)}>11)

These rather complex triggers will prevent spikes of load or diskusage to cause an alert, but the drawback it that you might miss certain interesting spikes too. Overall my opinion is that a monitoring system should not drive people crazy because alerts will be ignored when too many are received.

About Consultancy Articles Contact




References Red Hat Certified Architect By Robert de Bock Robert de Bock
Curriculum Vitae By Fred Clausen +31 6 14 39 58 72
By Nelson Manning [email protected]