There are several ways to confirm from YARN that total number of Killed/Failed applications in cluster 1. Get from RM web UI lists OR 2. From admin try using this to get numbers of failed and killed applications: ./yarn application -list -appStates FAILED,KILLED 3. Using client API's
Since metrics values are displayed in ganglia is incorrect, I get doubt that 1. does ganglia is pointing out to correct RM cluster? Or 2. what is the method ganglia uses to retrieve QueueMetrics? 3. Any client program calculates you have written retrieve apps and calculate it? Thanks & Regards Rohith Sharma K S -----Original Message----- From: Suma Shivaprasad [mailto:sumasai.shivapra...@gmail.com] Sent: 04 February 2015 11:03 To: u...@hadoop.apache.org Cc: yarn-dev@hadoop.apache.org Subject: Re: QueueMetrics.AppsKilled/Failed metrics and failure reasons Using hadoop 2.4.0. #of Applications running on average is small ~ 40 -60. The metrics in Ganglia shows around around 10-30 apps killed every 5 mins which is very high wrt to the apps running at any given time(40-60). The RM logs though show 0 failed apps in audit logs during that hour. The RM UI also doesnt show any apps in Applications->Failed tab . The logs are getting rolled over at a slower rate ..every 1-2 hours. Am searching for "Application Finished - Failed" to find the apps failed. Please let me know if I am missing something here. Thanks Suma On Wed, Feb 4, 2015 at 10:03 AM, Rohith Sharma K S < rohithsharm...@huawei.com> wrote: > Hi > > > > Could you give more information, which version of hadoop are you using? > > > > >> QueueMetrics.AppsKilled/Failed metrics shows much higher nos i.e ~100. > However RMAuditLogger shows 1 or 2 Apps as Killed/Failed in the logs. > > May be I suspect that Logs might be rolled out. Does more applications > are running? > > > > All the applications history will be displayed on RM web UI (provided > RM is not restarted or RM recovery enabled). May be you can check > these applications lists. > > > > For finding reasons for application killed/failed, one way is you can > check in NodeManager logs also. Here you need to check using > container_id for corresponding application. > > > > Thanks & Regards > > Rohith Sharma K S > > > > *From:* Suma Shivaprasad [mailto:sumasai.shivapra...@gmail.com] > *Sent:* 03 February 2015 21:35 > *To:* u...@hadoop.apache.org; yarn-dev@hadoop.apache.org > *Subject:* QueueMetrics.AppsKilled/Failed metrics and failure reasons > > > > Hello, > > > Was trying to debug reasons for Killed/Failed apps and was checking > for the applications that were killed/failed in RM logs - from RMAuditLogger. > > QueueMetrics.AppsKilled/Failed metrics shows much higher nos i.e ~100. > However RMAuditLogger shows 1 or 2 Apps as Killed/Failed in the logs. > Is it possible that some logs are missed by AuditLogger or is it the > other way round and metrics are being reported higher ? > > Thanks > > Suma >