Prashanth, Please file a jira.
One thing to be aware of - AMs get restarted a certain number of times for fault-tolerance - which means we can't just assume that failure of a single AM is equivalent to failure of the job. Only the ResourceManager is in the appropriate position to judge failure of AM v/s failure-of-job. hth, Arun On Jun 22, 2013, at 2:44 PM, Prashant Kommireddi <[email protected]> wrote: > Thanks Ravi. > > Well, in this case its a no-effort :) A failure of AM init should be > considered as failure of the job? I looked at the code and best-effort makes > sense with respect to retry logic etc. You make a good point that there would > be no notification in case AM OOMs, but I do feel AM init failure should send > a notification by other means. > > > > On Sat, Jun 22, 2013 at 2:38 PM, Ravi Prakash <[email protected]> wrote: > Hi Prashant, > > I would tend to agree with you. Although job-end notification is only a > "best-effort" mechanism (i.e. we cannot always guarantee notification for > example when the AM OOMs), I agree with you that we can do more. If you feel > strongly about this, please create a JIRA and possibly upload a patch. > > Thanks > Ravi > > > From: Prashant Kommireddi <[email protected]> > To: "[email protected]" <[email protected]> > Sent: Thursday, June 20, 2013 9:45 PM > Subject: Job end notification does not always work (Hadoop 2.x) > > Hello, > > I came across an issue that occurs with the job notification callbacks in > MR2. It works fine if the Application master has started, but does not send a > callback if the initializing of AM fails. > > Here is the code from MRAppMaster.java > > ..... > ....... > // set job classloader if configured > MRApps.setJobClassLoader(conf); > initAndStartAppMaster(appMaster, conf, jobUserName); > } catch (Throwable t) { > LOG.fatal("Error starting MRAppMaster", t); > System.exit(1); > } > } > > protected static void initAndStartAppMaster(final MRAppMaster appMaster, > final YarnConfiguration conf, String jobUserName) throws IOException, > InterruptedException { > UserGroupInformation.setConfiguration(conf); > UserGroupInformation appMasterUgi = UserGroupInformation > .createRemoteUser(jobUserName); > appMasterUgi.doAs(new PrivilegedExceptionAction<Object>() { > @Override > public Object run() throws Exception { > appMaster.init(conf); > appMaster.start(); > if(appMaster.errorHappenedShutDown) { > throw new IOException("Was asked to shut down."); > } > return null; > } > }); > } > appMaster.init(conf) does not dispatch JobFinishEventHandler which is > responsible for sending a HTTP callback (via shutDownJob()). If there was an > exception at this time, the process would simply terminate (via > System.exit(1) ) > > appMaster.start() however rightly uses the JobFinishEventHandler and things > work fine. > > Shouldn't a failure on init(..) also send a callback suggesting the job > failed? > > Thanks, > Prashant > > > > -- Arun C. Murthy Hortonworks Inc. http://hortonworks.com/
