Thanks for your inputs. The cluster Metrics API is giving correct numbers for the failed/killed apps and is matching with the RM audit logs and we are planning to use that instead.
Suma On Wed, Feb 4, 2015 at 12:04 PM, Rohith Sharma K S < [email protected]> wrote: > There are several ways to confirm from YARN that total number of > Killed/Failed applications in cluster > 1. Get from RM web UI lists OR > 2. From admin try using this to get numbers of failed and killed > applications: ./yarn application -list -appStates FAILED,KILLED > 3. Using client API's > > Since metrics values are displayed in ganglia is incorrect, I get doubt > that > 1. does ganglia is pointing out to correct RM cluster? Or > 2. what is the method ganglia uses to retrieve QueueMetrics? > 3. Any client program calculates you have written retrieve apps and > calculate it? > > > Thanks & Regards > Rohith Sharma K S > > -----Original Message----- > From: Suma Shivaprasad [mailto:[email protected]] > Sent: 04 February 2015 11:03 > To: [email protected] > Cc: [email protected] > Subject: Re: QueueMetrics.AppsKilled/Failed metrics and failure reasons > > Using hadoop 2.4.0. #of Applications running on average is small ~ 40 -60. > The metrics in Ganglia shows around around 10-30 apps killed every 5 mins > which is very high wrt to the apps running at any given time(40-60). The RM > logs though show 0 failed apps in audit logs during that hour. > The RM UI also doesnt show any apps in Applications->Failed tab . The logs > are getting rolled over at a slower rate ..every 1-2 hours. Am searching > for "Application Finished - Failed" to find the apps failed. Please let me > know if I am missing something here. > > Thanks > Suma > > > > > On Wed, Feb 4, 2015 at 10:03 AM, Rohith Sharma K S < > [email protected]> wrote: > > > Hi > > > > > > > > Could you give more information, which version of hadoop are you using? > > > > > > > > >> QueueMetrics.AppsKilled/Failed metrics shows much higher nos i.e ~100. > > However RMAuditLogger shows 1 or 2 Apps as Killed/Failed in the logs. > > > > May be I suspect that Logs might be rolled out. Does more applications > > are running? > > > > > > > > All the applications history will be displayed on RM web UI (provided > > RM is not restarted or RM recovery enabled). May be you can check > > these applications lists. > > > > > > > > For finding reasons for application killed/failed, one way is you can > > check in NodeManager logs also. Here you need to check using > > container_id for corresponding application. > > > > > > > > Thanks & Regards > > > > Rohith Sharma K S > > > > > > > > *From:* Suma Shivaprasad [mailto:[email protected]] > > *Sent:* 03 February 2015 21:35 > > *To:* [email protected]; [email protected] > > *Subject:* QueueMetrics.AppsKilled/Failed metrics and failure reasons > > > > > > > > Hello, > > > > > > Was trying to debug reasons for Killed/Failed apps and was checking > > for the applications that were killed/failed in RM logs - from > RMAuditLogger. > > > > QueueMetrics.AppsKilled/Failed metrics shows much higher nos i.e ~100. > > However RMAuditLogger shows 1 or 2 Apps as Killed/Failed in the logs. > > Is it possible that some logs are missed by AuditLogger or is it the > > other way round and metrics are being reported higher ? > > > > Thanks > > > > Suma > > >
