Monitoring an environment with some monitoring system gives control, so it's pretty important. But it can be a challenge to setup a monitoring system; it should not alert too fast, but also not too slow.
Nagios uses "flap detection" to prevent many ERROR's and OK's being sent right after each other. Zabbix calls this "hysteresis". Zabbix's hysteresis is rather difficult to understand, so I'd like to share some triggers that I have setup for Zabbix that implement both flap detection/hysteresis and grace.
Grace can be defined like this: "When a value is higher (or lower) then a threshold, make sure it's a little lower (or higher) as the threshold that caused the trigger to alert, before recovering a trigger." I know; it's not easy to understand... Let's look at some examples.
With values that need to be below a threshold, like cpu load, number of users logged in or number of processes running:
({TRIGGER.VALUE}=0&{TEMPLATE:CHECK[ITEM].min(300)>ALERTVALUE)|({TRIGGER.VALUE}=1&{TEMPLATE:CHECK[ITEM].max(300)<RECOVERYVALUE)Just to clarify the different part of the trigger:
For example CPU load with an ALERTVALUE of 5 and a RECOVERYVALUE of 4:
({TRIGGER.VALUE}=0&{Template_Linux:system.cpu.load[,avg1].min(300)}>5)|({TRIGGER.VALUE}=1&{Template_Linux:system.cpu.load[,avg1].max(300)}<4)With values that need to be above a threshold, like percentage diskspace free, number of inodes free or number of httpd processes running:
({TRIGGER.VALUE}=0&{TEMPLATE:CHECK[ITEM].max(300)<ALERTVALUE)|({TRIGGER.VALUE}=1&{TEMPLATE:CHECK[ITEM].min(300)>RECOVERYVALUE)For example disk space of /var free in percent with an ALERTVALUE of 10 and a RECOVERYVALUE of 11:
({TRIGGER.VALUE}=0&{Template_Linux:vfs.fs.size[/var,pfree].max(300)}<10)|({TRIGGER.VALUE}=1&{Template_Linux:vfs.fs.size[/var,pfree].min(300)}>11)These rather complex triggers will prevent spikes of load or diskusage to cause an alert, but the drawback it that you might miss certain interesting spikes too. Overall my opinion is that a monitoring system should not drive people crazy because alerts will be ignored when too many are received.
| About | Consultancy | Articles | Contact |
|
|
|
|
|
| References | Red Hat Certified Architect | By Robert de Bock | Robert de Bock |
| Curriculum Vitae | By Fred Clausen | +31 6 14 39 58 72 | |
| By Nelson Manning | robert@meinit.nl |
Comments
Hi, trigger expression should
Hi,
trigger expression should be like this:
({TRIGGER.VALUE}=0&{Template_Linux:system.cpu.load[,avg1].min(300)}>5)|({TRIGGER.VALUE}=1&{Template_Linux:system.cpu.load[,avg1].max(300)}>4)
correct me if i'm wrong, but with statement which you posted you will receive two alerts.
First with PROBLEM because trigger was equal 0 value and minimum load value in 5m period is more then 5. And second one (OK status) because both conditions aren't fulfill (trigger is 1 and in first condition it should be 0; max value in 5m period is not lower then 4).
Regards,
p.
Shouldn't it be >
Shouldn't it be > RECOVERYVALUE? The trigger condition should evaluate to false to turn it off. So if for the last five minutes some values are above the recovery value, the trigger should remain on.
I don't understand your
I don't understand your comment. What should be replaced by RECOVERYVALUE exactly?
I think he's right. Let's
I think he's right. Let's assume that a trigger happens on min(300)>5. Now say the value get's below 5 (actually 4 in this example), resulting in the last two values being 5 and 3. Then the "| ({TRIGGER.VALUE}=1" part will become true, which then will evaluate if the value was max(300)<4. With these 2 values it's outcome is actually false instead of true (because there is a higher value in the last 300 sec), which is necessarily to keep the trigger active for a while. So when you make that max(300)>4, it will be stay true, until no value in the last 300 sec will be above 4.
Hopefully this makes sense, it took me a while to figure it out.. But please, let us know if you think otherwise. Thanks.