Re: Ambari Metrics Collector Process alert - CRITICAL threshold rule

Siddharth Wagle Fri, 28 Oct 2016 13:49:17 -0700

It means that the alert runner/thread on the agent waits for 1.5 seconds before 
raising a WARNING and 5 seconds before raising a CRITICAL message.



The configured interval for often this runs is independent of these threasholds 
consumed by the alert instance.


I am not sure what you mean by "point-in-time", alert_history table does log 
execution results although we bubble up the last execution status. @Jonathan 
Hurely might be able to shed more light on the finer details.

BR,
Sid


________________________________
From: Jonathan Hurley <[email protected]>
Sent: Friday, October 28, 2016 1:44 PM
To: Ganesh Viswanathan
Cc: [email protected]
Subject: Re: Ambari Metrics Collector Process alert - CRITICAL threshold rule

In your version of Ambari, the alert will trigger right away. In Ambari 2.4, we 
have the notion of "soft" and "hard" alerts. You can configure it so that it 
doesn't trigger alert notifications until n number of CRITICAL alerts have been 
received in a row.

On Oct 28, 2016, at 4:07 PM, Ganesh Viswanathan 
<[email protected]<mailto:[email protected]>> wrote:

Thanks Jonathan, that explains some of the behavior I'm seeing.

Two additional questions:
1)  How do I make sure the Ambari "Metrics Collector Process" does not alert 
immediately when the process is down? I am using Ambari 2.2.1.0 and it has a 
bug [1] which can trigger restarts of the process. The fix for 
AMBARI-15492<http://issues.apache.org/jira/browse/AMBARI-15492> has been 
documented in that wiki as "comment out auto-recovery". But that would mean the 
process would not restart (when the bug hits) bringing down visibility into the 
cluster metrics. We have disabled the auto-restart count alert because of the 
bug, but what is a good way to say "if the metrics collector process has been 
down for 15mins, then alert".

2) Will restarting "Metrics Collector Process"  impact the other hbase or hdfs 
health alerts? Or is this process only for the Ambari-Metrics system 
(collecting usage and internal ambari metrics). I am trying to see if the 
Ambari Metrics Collector Process can be disabled while still keep the other 
hbase and hdfs alerts.

[1] https://cwiki.apache.org/confluence/display/AMBARI/Known+Issues


-Ganesh


On Fri, Oct 28, 2016 at 12:36 PM, Jonathan Hurley 
<[email protected]<mailto:[email protected]>> wrote:
It sounds like you're asking two different questions here. Let me see if I can 
address them:

Most "CRITICAL" thresholds do contain different text then their OK/WARNING 
counterparts. This is because there is different information which needs to be 
conveyed when an alert has gone CRITICAL. In the case of this alert, it's a 
port connection problem. When that happens, administrators are mostly 
interested in the error message and the attempted host:port combination. I'm 
not sure what you mean by "CRITICAL is a point in time alert". All alerts of 
the PORT/WEB variety are point-in-time alerts. They represent the connection 
state of a socket and the data returned over that socket at a specific point in 
time. The alert which gets recorded in Ambari's database maintains the time of 
the alert. This value is available via a tooltip hover in the UI.

The second part of your question is asking why increasing the timeout value to 
something large, like 600, doesn't prevent the alert from triggering. I believe 
this is how the python sockets are being used in that a failed connection is 
not limited to the same timeout restrictions as a socket which won't respond. 
If the machine is available and refuses the connection outright, then the 
timeout wouldn't take effect.



On Oct 28, 2016, at 1:37 PM, Ganesh Viswanathan 
<[email protected]<mailto:[email protected]>> wrote:

Hello,

The Ambari "Metrics Collector Process" Alert has a different defintion for 
CRITICAL threshold vs. OK and WARNING thresholds. What is the reason for this?

In my tests, CRITICAL seems like a "point-in-time" alert and the value of that 
field is not being used. When the metrics collector process is killed or 
restarts, the alert fires in 1min or less even when I set the threshold value 
to 600s. This means the alert description of "This alert is triggered if the 
Metrics Collector cannot be confirmed to be up and listening on the configured 
port for number of seconds equal to threshold." NOT VALID for CRITICAL 
threshold. Is that true and what is the reason for this discrepancy? Has anyone 
else gotten false pages because of this and what is the fix?

"ok": {
"text": "TCP OK - {0:.3f}s response on port {1}"
},
"warning": {
"text": "TCP OK - {0:.3f}s response on port {1}",
"value": 1.5
},
"critical": {
"text": "Connection failed: {0} to {1}:{2}",
"value": 5.0
}

Ref:
https://github.com/apache/ambari/blob/2ad42074f1633c5c6f56cf979bdaa49440457566/ambari-server/src/main/resources/common-services/AMBARI_METRICS/0.1.0/alerts.json#L102

Thanks,
Ganesh

Re: Ambari Metrics Collector Process alert - CRITICAL threshold rule

Reply via email to