Zabbix triggers with "flap-detection" and a grace period.

Monitoring an environment with some monitoring system gives control, so it's pretty important. But it can be a challenge to setup a monitoring system; it should not alert too fast, but also not too slow.

Nagios uses "flap detection" to prevent many ERROR's and OK's being sent right after each other. Zabbix calls this "hysteresis". Zabbix's hysteresis is rather difficult to understand, so I'd like to share some triggers that I have setup for Zabbix that implement both flap detection/hysteresis and grace.

Grace can be defined like this: "When a value is higher (or lower) then a threshold, make sure it's a little lower (or higher) as the threshold that caused the trigger to alert, before recovering a trigger." I know; it's not easy to understand... Let's look at some examples.

Thresholds that should be above a certain value

With values that need to be below a threshold, like cpu load, number of users logged in or number of processes running:

({TRIGGER.VALUE}=0&{TEMPLATE:CHECK[ITEM].min(300)>ALERTVALUE)|({TRIGGER.VALUE}=1&{TEMPLATE:CHECK[ITEM].max(300)<RECOVERYVALUE)

Just to clarify the different part of the trigger:

  1. {TRIGGER.VALUE} makes sure the first part (before the |) is evaluated when there is no alert, the part after | indicates the trigger should be on/in alert.
  2. .min(300) makes sure the values are minimally as high as ALERTVALUE for 300 seconds.
  3. The last part (after the |) makes sure the trigger recovers when the measured value is lower than RECOVERYVALUE for 300 seconds.

For example CPU load with an ALERTVALUE of 5 and a RECOVERYVALUE of 4:

({TRIGGER.VALUE}=0&{Template_Linux:system.cpu.load[,avg1].min(300)}>5)|({TRIGGER.VALUE}=1&{Template_Linux:system.cpu.load[,avg1].max(300)}<4)

Thresholds that should be below a certain value

With values that need to be above a threshold, like percentage diskspace free, number of inodes free or number of httpd processes running:

({TRIGGER.VALUE}=0&{TEMPLATE:CHECK[ITEM].max(300)<ALERTVALUE)|({TRIGGER.VALUE}=1&{TEMPLATE:CHECK[ITEM].min(300)>RECOVERYVALUE)

For example disk space of /var free in percent with an ALERTVALUE of 10 and a RECOVERYVALUE of 11:

({TRIGGER.VALUE}=0&{Template_Linux:vfs.fs.size[/var,pfree].max(300)}<10)|({TRIGGER.VALUE}=1&{Template_Linux:vfs.fs.size[/var,pfree].min(300)}>11)

These rather complex triggers will prevent spikes of load or diskusage to cause an alert, but the drawback it that you might miss certain interesting spikes too. Overall my opinion is that a monitoring system should not drive people crazy because alerts will be ignored when too many are received.

Comments

Linux besides several storys

Linux besides several storys of Unix are open-source operating systems. This effects it legitimate to advantage, revise also allowance the operating systems beside differents minus infringing copyright laws.
statement of interest

Hi, trigger expression should

Hi,

trigger expression should be like this:

({TRIGGER.VALUE}=0&{Template_Linux:system.cpu.load[,avg1].min(300)}>5)|({TRIGGER.VALUE}=1&{Template_Linux:system.cpu.load[,avg1].max(300)}>4)

correct me if i'm wrong, but with statement which you posted you will receive two alerts.

First with PROBLEM because trigger was equal 0 value and minimum load value in 5m period is more then 5. And second one (OK status) because both conditions aren't fulfill (trigger is 1 and in first condition it should be 0; max value in 5m period is not lower then 4).

Regards,
p.

Shouldn't it be >

Shouldn't it be > RECOVERYVALUE? The trigger condition should evaluate to false to turn it off. So if for the last five minutes some values are above the recovery value, the trigger should remain on.

I don't understand your

I don't understand your comment. What should be replaced by RECOVERYVALUE exactly?

Thanks for the blog post, it

Thanks for the blog post, it was very helpful to me.

However please correct it to reflect the other comments.
When you trigger on a '>' event, then the recovery part should also remain '>' : The error condition is for something 'too big'. If it's not too big any more then the trigger clears.

I think he's right. Let's

I think he's right. Let's assume that a trigger happens on min(300)>5. Now say the value get's below 5 (actually 4 in this example), resulting in the last two values being 5 and 3. Then the "| ({TRIGGER.VALUE}=1" part will become true, which then will evaluate if the value was max(300)<4. With these 2 values it's outcome is actually false instead of true (because there is a higher value in the last 300 sec), which is necessarily to keep the trigger active for a while. So when you make that max(300)>4, it will be stay true, until no value in the last 300 sec will be above 4.

Hopefully this makes sense, it took me a while to figure it out.. But please, let us know if you think otherwise. Thanks.

About Consultancy Articles Contact




References Red Hat Certified Architect By Robert de Bock Robert de Bock
Curriculum Vitae By Fred Clausen +31 6 14 39 58 72
By Nelson Manning [email protected]