RE: How to troubleshoot failed or stuck jobs

Rohith Sharma K S Sun, 01 Mar 2015 22:10:03 -0800

Hi


1.       For the Failed jobs, you can directly check the MRAppMaster logs.  
There you get reason for failed jobs.

2.       For the stuck job, you need to do some ground work to identify what is 
going wrong. It can be either YARN issue or MapReduce issue.

2.1   In a recent time, I have face job stuck many times if headroom 
calculation goes wrong.  Headroom is sent by RM to ApplicationMaster and AM 
uses this as deciding factors ( 
https://issues.apache.org/jira/i#browse/YARN-1680 ).  Corresponding parent jira 
is  https://issues.apache.org/jira/i#browse/YARN-1198

2.2   When the job is stuck,
YARN – try to get ClusterMemory Used, ClusterMemory Reserved, Total Memory, How 
many NodeManagers? What is the headroom sent to AM.
                 MapReduce – Any NM’s are blacklisted, Does all the reducers 
tasks are using ClusterMemory? By default Reducers start before Mapper 
completion. In case if Mapper fails because of some unstable node, then 
reducers take over the cluster. Here, it is expected reducers should be 
pre-empted. Need to identify whether reducers are getting pre-empted.
MRAppMaster log would help for some extent to analyze the issue.

Thanks & Regards
Rohith Sharma K S

From: Krish Donald [mailto:[email protected]]
Sent: 02 March 2015 11:09
To: [email protected]
Subject: Re: How to troubleshoot failed or stuck jobs

Thanks for Link Ted,

However wanted to understand the approach which should be taken when 
troubleshooting failed or stuck jobs ?


On Sun, Mar 1, 2015 at 8:52 PM, Ted Yu 
<[email protected]<mailto:[email protected]>> wrote:
Here are some related discussions and JIRA:

http://search-hadoop.com/m/LgpTk2gxrGx
http://search-hadoop.com/m/LgpTk2YLArE

https://issues.apache.org/jira/browse/MAPREDUCE-6190

Cheers

On Sun, Mar 1, 2015 at 8:41 PM, Krish Donald 
<[email protected]<mailto:[email protected]>> wrote:
Hi,

Wanted to understand,  How to troubleshoot failed or stuck jobs ?

Thanks
Krish

RE: How to troubleshoot failed or stuck jobs

Reply via email to