RE: QueueMetrics.AppsKilled/Failed metrics and failure reasons

Rohith Sharma K S Tue, 03 Feb 2015 22:42:05 -0800

There are several ways to confirm from YARN that total number of Killed/Failed 
applications in cluster
1. Get from RM web UI lists OR
2. From admin try using this to get numbers of failed and killed applications: 
./yarn application -list -appStates FAILED,KILLED
3. Using client API's

Since metrics values are displayed in ganglia is incorrect, I get doubt that 
1. does ganglia is pointing out to correct RM cluster? Or 
2. what is the method ganglia uses to retrieve QueueMetrics? 
3. Any client program calculates you have written retrieve apps and calculate 
it?

Thanks & Regards
Rohith Sharma K S

-----Original Message-----
From: Suma Shivaprasad [mailto:sumasai.shivapra...@gmail.com] 
Sent: 04 February 2015 11:03
To: u...@hadoop.apache.org
Cc: yarn-dev@hadoop.apache.org
Subject: Re: QueueMetrics.AppsKilled/Failed metrics and failure reasons

Using hadoop 2.4.0. #of Applications running on average is small ~ 40 -60.
The metrics in Ganglia shows around around 10-30 apps killed every 5 mins which 
is very high wrt to the apps running at any given time(40-60). The RM logs 
though show 0 failed apps in audit logs during that hour.
The RM UI also doesnt show any apps in Applications->Failed tab . The logs are 
getting rolled over at a slower rate ..every 1-2 hours. Am searching for 
"Application Finished - Failed" to find the apps failed. Please let me know if 
I am missing something here.

Thanks
Suma

On Wed, Feb 4, 2015 at 10:03 AM, Rohith Sharma K S < rohithsharm...@huawei.com> 
wrote:

>  Hi
>
>
>
> Could you give more information, which version of hadoop are you using?
>
>
>
> >> QueueMetrics.AppsKilled/Failed metrics shows much higher nos i.e ~100.
> However RMAuditLogger shows 1 or 2 Apps as Killed/Failed in the logs.
>
> May be I suspect that Logs might be rolled out. Does more applications 
> are running?
>
>
>
> All the applications history will be displayed  on RM web UI (provided 
> RM is not restarted or RM recovery enabled). May be you can check 
> these applications lists.
>
>
>
> For finding reasons for application killed/failed, one way is you can 
> check in NodeManager logs also. Here  you need to check using 
> container_id for corresponding application.
>
>
>
> Thanks & Regards
>
> Rohith Sharma K S
>
>
>
> *From:* Suma Shivaprasad [mailto:sumasai.shivapra...@gmail.com]
> *Sent:* 03 February 2015 21:35
> *To:* u...@hadoop.apache.org; yarn-dev@hadoop.apache.org
> *Subject:* QueueMetrics.AppsKilled/Failed metrics and failure reasons
>
>
>
> Hello,
>
>
> Was trying to debug reasons for Killed/Failed apps and was checking 
> for the applications that were killed/failed in RM logs - from RMAuditLogger.
>
>  QueueMetrics.AppsKilled/Failed metrics shows much higher nos i.e ~100.
> However RMAuditLogger shows 1 or 2 Apps as Killed/Failed in the logs. 
> Is it possible that some logs are missed by AuditLogger or is it the 
> other way round and metrics are being reported higher ?
>
> Thanks
>
> Suma
>

RE: QueueMetrics.AppsKilled/Failed metrics and failure reasons

Reply via email to