[jira] [Commented] (MAPREDUCE-5817) mappers get rescheduled on node transition even after all reducers are completed

2015-08-11 Thread Chris Douglas (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14692298#comment-14692298
 ] 

Chris Douglas commented on MAPREDUCE-5817:
--

Does this work if the reducer fails subsequently? Presumably reexecution will 
be triggered by fetch failures?

 mappers get rescheduled on node transition even after all reducers are 
 completed
 

 Key: MAPREDUCE-5817
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5817
 Project: Hadoop Map/Reduce
  Issue Type: Bug
  Components: applicationmaster
Affects Versions: 2.3.0
Reporter: Sangjin Lee
Assignee: Sangjin Lee
  Labels: BB2015-05-TBR
 Attachments: MAPREDUCE-5817.001.patch, mapreduce-5817.patch


 We're seeing a behavior where a job runs long after all reducers were already 
 finished. We found that the job was rescheduling and running a number of 
 mappers beyond the point of reducer completion. In one situation, the job ran 
 for some 9 more hours after all reducers completed!
 This happens because whenever a node transition (to an unusable state) comes 
 into the app master, it just reschedules all mappers that already ran on the 
 node in all cases.
 Therefore, if any node transition has a potential to extend the job period. 
 Once this window opens, another node transition can prolong it, and this can 
 happen indefinitely in theory.
 If there is some instability in the pool (unhealthy, etc.) for a duration, 
 then any big job is severely vulnerable to this problem.
 If all reducers have been completed, JobImpl.actOnUnusableNode() should not 
 reschedule mapper tasks. If all reducers are completed, the mapper outputs 
 are no longer needed, and there is no need to reschedule mapper tasks as they 
 would not be consumed anyway.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAPREDUCE-6439) AM may fail instead of retrying if RM is restarting/shutting down during the allocate call

2015-08-11 Thread Anubhav Dhoot (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14692484#comment-14692484
 ] 

Anubhav Dhoot commented on MAPREDUCE-6439:
--

Filed followup Jira to fix MR code where we throw and catch 
YarnRuntimeException 

 AM may fail instead of retrying if RM is restarting/shutting down during the 
 allocate call 
 ---

 Key: MAPREDUCE-6439
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6439
 Project: Hadoop Map/Reduce
  Issue Type: Bug
Reporter: Anubhav Dhoot
Assignee: Anubhav Dhoot
 Attachments: MAPREDUCE-6439.001.patch, MAPREDUCE-6439.002.patch


 We are seeing cases where MR AM gets a YarnRuntimeException thats thrown in 
 RM and gets sent back to AM causing it to think that it has exhausted the 
 number of retries. Copying the error which causes the heartbeat thread to 
 quit.
 {noformat}
 2015-07-25 20:07:27,346 ERROR [RMCommunicator Allocator] 
 org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Error 
 communicating with RM: java.lang.InterruptedException
   at 
 org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:245)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.allocate(ApplicationMasterService.java:469)
   at 
 org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.allocate(ApplicationMasterProtocolPBServiceImpl.java:60)
   at 
 org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.callBlockingMethod(ApplicationMasterProtocol.java:99)
   at 
 org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:587)
   at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1026)
   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013)
   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009)
   at java.security.AccessController.doPrivileged(Native Method)
   at javax.security.auth.Subject.doAs(Subject.java:415)
   at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1642)
   at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007)
 Caused by: java.lang.InterruptedException
   at 
 java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireInterruptibly(AbstractQueuedSynchronizer.java:1219)
   at 
 java.util.concurrent.locks.ReentrantLock.lockInterruptibly(ReentrantLock.java:340)
   at 
 java.util.concurrent.LinkedBlockingQueue.put(LinkedBlockingQueue.java:338)
   at 
 org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:240)
   ... 11 more
 org.apache.hadoop.yarn.exceptions.YarnRuntimeException: 
 java.lang.InterruptedException
   at 
 org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:245)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.allocate(ApplicationMasterService.java:469)
   at 
 org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.allocate(ApplicationMasterProtocolPBServiceImpl.java:60)
   at 
 org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.callBlockingMethod(ApplicationMasterProtocol.java:99)
   at 
 org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:587)
   at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1026)
   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013)
   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009)
   at java.security.AccessController.doPrivileged(Native Method)
   at javax.security.auth.Subject.doAs(Subject.java:415)
   at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1642)
   at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007)
 Caused by: java.lang.InterruptedException
   at 
 java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireInterruptibly(AbstractQueuedSynchronizer.java:1219)
   at 
 java.util.concurrent.locks.ReentrantLock.lockInterruptibly(ReentrantLock.java:340)
   at 
 java.util.concurrent.LinkedBlockingQueue.put(LinkedBlockingQueue.java:338)
   at 
 org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:240)
   ... 11 more
   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
   at 
 sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
   at 
 sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
   at 

[jira] [Created] (MAPREDUCE-6449) MR Code should not throw and catch YarnRuntimeException to communicate internal exceptions

2015-08-11 Thread Anubhav Dhoot (JIRA)
Anubhav Dhoot created MAPREDUCE-6449:


 Summary: MR Code should not throw and catch YarnRuntimeException 
to communicate internal exceptions
 Key: MAPREDUCE-6449
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6449
 Project: Hadoop Map/Reduce
  Issue Type: Bug
Reporter: Anubhav Dhoot
Assignee: Anubhav Dhoot


In discussion of MAPREDUCE-6439 we discussed how throwing and catching 
YarnRuntimeException in MR code is incorrect and we should instead use some MR 
specific exception.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAPREDUCE-5817) mappers get rescheduled on node transition even after all reducers are completed

2015-08-11 Thread Sangjin Lee (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14692599#comment-14692599
 ] 

Sangjin Lee commented on MAPREDUCE-5817:


The test failures appear unrelated. The checkstyle is about the length of file 
{{JobImpl.java}} which is pretty much an existing issue.

 mappers get rescheduled on node transition even after all reducers are 
 completed
 

 Key: MAPREDUCE-5817
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5817
 Project: Hadoop Map/Reduce
  Issue Type: Bug
  Components: applicationmaster
Affects Versions: 2.3.0
Reporter: Sangjin Lee
Assignee: Sangjin Lee
 Attachments: MAPREDUCE-5817.001.patch, mapreduce-5817.patch


 We're seeing a behavior where a job runs long after all reducers were already 
 finished. We found that the job was rescheduling and running a number of 
 mappers beyond the point of reducer completion. In one situation, the job ran 
 for some 9 more hours after all reducers completed!
 This happens because whenever a node transition (to an unusable state) comes 
 into the app master, it just reschedules all mappers that already ran on the 
 node in all cases.
 Therefore, if any node transition has a potential to extend the job period. 
 Once this window opens, another node transition can prolong it, and this can 
 happen indefinitely in theory.
 If there is some instability in the pool (unhealthy, etc.) for a duration, 
 then any big job is severely vulnerable to this problem.
 If all reducers have been completed, JobImpl.actOnUnusableNode() should not 
 reschedule mapper tasks. If all reducers are completed, the mapper outputs 
 are no longer needed, and there is no need to reschedule mapper tasks as they 
 would not be consumed anyway.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAPREDUCE-6447) reduce shuffle throws java.lang.OutOfMemoryError: Java heap space

2015-08-11 Thread shuzhangyao (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14681538#comment-14681538
 ] 

shuzhangyao commented on MAPREDUCE-6447:


https://github.com/apache/hadoop/blob/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/task/reduce/MergeManagerImpl.java#L254
  
   return  (requestedSize  maxSingleShuffleLimit) ; =  return (requestedSize 
 maxSingleShuffleLimit) ((usedMemory +requestedSize) memoryLimit) ;

 reduce shuffle throws java.lang.OutOfMemoryError: Java heap space
 ---

 Key: MAPREDUCE-6447
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6447
 Project: Hadoop Map/Reduce
  Issue Type: Bug
Affects Versions: 2.5.0, 2.6.0, 2.5.1, 2.7.1
Reporter: shuzhangyao
Assignee: shuzhangyao
Priority: Minor

 2015-08-11 14:03:54,550 WARN [main] org.apache.hadoop.mapred.YarnChild: 
 Exception running child : 
 org.apache.hadoop.mapreduce.task.reduce.Shuffle$ShuffleError: error in 
 shuffle in fetcher#10
   at org.apache.hadoop.mapreduce.task.reduce.Shuffle.run(Shuffle.java:134)
   at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:376)
   at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)
   at java.security.AccessController.doPrivileged(Native Method)
   at javax.security.auth.Subject.doAs(Subject.java:415)
   at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614)
   at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163)
 Caused by: java.lang.OutOfMemoryError: Java heap space
   at 
 org.apache.hadoop.io.BoundedByteArrayOutputStream.init(BoundedByteArrayOutputStream.java:56)
   at 
 org.apache.hadoop.io.BoundedByteArrayOutputStream.init(BoundedByteArrayOutputStream.java:46)
   at 
 org.apache.hadoop.mapreduce.task.reduce.InMemoryMapOutput.init(InMemoryMapOutput.java:63)
   at 
 org.apache.hadoop.mapreduce.task.reduce.MergeManagerImpl.unconditionalReserve(MergeManagerImpl.java:303)
   at 
 org.apache.hadoop.mapreduce.task.reduce.MergeManagerImpl.reserve(MergeManagerImpl.java:293)
   at 
 org.apache.hadoop.mapreduce.task.reduce.Fetcher.copyMapOutput(Fetcher.java:511)
   at 
 org.apache.hadoop.mapreduce.task.reduce.Fetcher.copyFromHost(Fetcher.java:329)
   at org.apache.hadoop.mapreduce.task.reduce.Fetcher.run(Fetcher.java:193)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MAPREDUCE-6447) reduce shuffle throws java.lang.OutOfMemoryError: Java heap space

2015-08-11 Thread shuzhangyao (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-6447?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

shuzhangyao updated MAPREDUCE-6447:
---
Affects Version/s: 2.5.0
   2.6.0
   2.5.1
   2.7.1

 reduce shuffle throws java.lang.OutOfMemoryError: Java heap space
 ---

 Key: MAPREDUCE-6447
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6447
 Project: Hadoop Map/Reduce
  Issue Type: Bug
Affects Versions: 2.5.0, 2.6.0, 2.5.1, 2.7.1
Reporter: shuzhangyao
Assignee: shuzhangyao
Priority: Minor

 2015-08-11 14:03:54,550 WARN [main] org.apache.hadoop.mapred.YarnChild: 
 Exception running child : 
 org.apache.hadoop.mapreduce.task.reduce.Shuffle$ShuffleError: error in 
 shuffle in fetcher#10
   at org.apache.hadoop.mapreduce.task.reduce.Shuffle.run(Shuffle.java:134)
   at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:376)
   at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)
   at java.security.AccessController.doPrivileged(Native Method)
   at javax.security.auth.Subject.doAs(Subject.java:415)
   at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614)
   at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163)
 Caused by: java.lang.OutOfMemoryError: Java heap space
   at 
 org.apache.hadoop.io.BoundedByteArrayOutputStream.init(BoundedByteArrayOutputStream.java:56)
   at 
 org.apache.hadoop.io.BoundedByteArrayOutputStream.init(BoundedByteArrayOutputStream.java:46)
   at 
 org.apache.hadoop.mapreduce.task.reduce.InMemoryMapOutput.init(InMemoryMapOutput.java:63)
   at 
 org.apache.hadoop.mapreduce.task.reduce.MergeManagerImpl.unconditionalReserve(MergeManagerImpl.java:303)
   at 
 org.apache.hadoop.mapreduce.task.reduce.MergeManagerImpl.reserve(MergeManagerImpl.java:293)
   at 
 org.apache.hadoop.mapreduce.task.reduce.Fetcher.copyMapOutput(Fetcher.java:511)
   at 
 org.apache.hadoop.mapreduce.task.reduce.Fetcher.copyFromHost(Fetcher.java:329)
   at org.apache.hadoop.mapreduce.task.reduce.Fetcher.run(Fetcher.java:193)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MAPREDUCE-6447) reduce shuffle throws java.lang.OutOfMemoryError: Java heap space

2015-08-11 Thread shuzhangyao (JIRA)
shuzhangyao created MAPREDUCE-6447:
--

 Summary: reduce shuffle throws java.lang.OutOfMemoryError: Java 
heap space
 Key: MAPREDUCE-6447
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6447
 Project: Hadoop Map/Reduce
  Issue Type: Bug
Reporter: shuzhangyao
Assignee: shuzhangyao
Priority: Minor


2015-08-11 14:03:54,550 WARN [main] org.apache.hadoop.mapred.YarnChild: 
Exception running child : 
org.apache.hadoop.mapreduce.task.reduce.Shuffle$ShuffleError: error in shuffle 
in fetcher#10
at org.apache.hadoop.mapreduce.task.reduce.Shuffle.run(Shuffle.java:134)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:376)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163)
Caused by: java.lang.OutOfMemoryError: Java heap space
at 
org.apache.hadoop.io.BoundedByteArrayOutputStream.init(BoundedByteArrayOutputStream.java:56)
at 
org.apache.hadoop.io.BoundedByteArrayOutputStream.init(BoundedByteArrayOutputStream.java:46)
at 
org.apache.hadoop.mapreduce.task.reduce.InMemoryMapOutput.init(InMemoryMapOutput.java:63)
at 
org.apache.hadoop.mapreduce.task.reduce.MergeManagerImpl.unconditionalReserve(MergeManagerImpl.java:303)
at 
org.apache.hadoop.mapreduce.task.reduce.MergeManagerImpl.reserve(MergeManagerImpl.java:293)
at 
org.apache.hadoop.mapreduce.task.reduce.Fetcher.copyMapOutput(Fetcher.java:511)
at 
org.apache.hadoop.mapreduce.task.reduce.Fetcher.copyFromHost(Fetcher.java:329)
at org.apache.hadoop.mapreduce.task.reduce.Fetcher.run(Fetcher.java:193)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MAPREDUCE-6448) interrupted waiting to send rpc request to server

2015-08-11 Thread duyanlong (JIRA)
duyanlong created MAPREDUCE-6448:


 Summary: interrupted waiting to send rpc request to server
 Key: MAPREDUCE-6448
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6448
 Project: Hadoop Map/Reduce
  Issue Type: Bug
Affects Versions: 2.5.2
Reporter: duyanlong
Assignee: duyanlong
Priority: Critical


Pyton scripts execute HQL process stuck, CPU has been above 100%, hive in the 
background the background error below, can you tell me how to solve,Thank you 
so much
2015-08-05 06:30:51,986 WARN  [Thread-854]: ipc.Client (Client.java:call(1389)) 
- interrupted waiting to send rpc request to server
java.lang.InterruptedException
at java.util.concurrent.FutureTask.awaitDone(FutureTask.java:400)
at java.util.concurrent.FutureTask.get(FutureTask.java:187)
at 
org.apache.hadoop.ipc.Client$Connection.sendRpcRequest(Client.java:1030)
at org.apache.hadoop.ipc.Client.call(Client.java:1384)
at org.apache.hadoop.ipc.Client.call(Client.java:1364)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
at com.sun.proxy.$Proxy33.getTaskReports(Unknown Source)
at 
org.apache.hadoop.mapreduce.v2.api.impl.pb.client.MRClientProtocolPBClientImpl.getTaskReports(MRClientProtocolPBClientImpl.java:188)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at 
org.apache.hadoop.mapred.ClientServiceDelegate.invoke(ClientServiceDelegate.java:320)
at 
org.apache.hadoop.mapred.ClientServiceDelegate.getTaskReports(ClientServiceDelegate.java:444)
at 
org.apache.hadoop.mapred.YARNRunner.getTaskReports(YARNRunner.java:572)
at org.apache.hadoop.mapreduce.Job$3.run(Job.java:543)
at org.apache.hadoop.mapreduce.Job$3.run(Job.java:541)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1642)
at org.apache.hadoop.mapreduce.Job.getTaskReports(Job.java:541)
at org.apache.hadoop.mapred.JobClient.getTaskReports(JobClient.java:639)
at 
org.apache.hadoop.mapred.JobClient.getMapTaskReports(JobClient.java:629)
at 
org.apache.hadoop.hive.ql.exec.mr.HadoopJobExecHelper.progress(HadoopJobExecHelper.java:259)
at 
org.apache.hadoop.hive.ql.exec.mr.HadoopJobExecHelper.progress(HadoopJobExecHelper.java:547)
at 
org.apache.hadoop.hive.ql.exec.mr.ExecDriver.execute(ExecDriver.java:426)
at 
org.apache.hadoop.hive.ql.exec.mr.MapRedTask.execute(MapRedTask.java:136)
at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:153)
at 
org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:85)
at org.apache.hadoop.hive.ql.exec.TaskRunner.run(TaskRunner.java:72)
2015-08-05 06:30:51,988 WARN  [Thread-854]: mapred.ClientServiceDelegate 
(ClientServiceDelegate.java:invoke(338)) - ClientServiceDelegate invoke call 
interrupted
java.lang.InterruptedException: sleep interrupted
at java.lang.Thread.sleep(Native Method)
at 
org.apache.hadoop.mapred.ClientServiceDelegate.invoke(ClientServiceDelegate.java:336)
at 
org.apache.hadoop.mapred.ClientServiceDelegate.getTaskReports(ClientServiceDelegate.java:444)
at 
org.apache.hadoop.mapred.YARNRunner.getTaskReports(YARNRunner.java:572)
at org.apache.hadoop.mapreduce.Job$3.run(Job.java:543)
at org.apache.hadoop.mapreduce.Job$3.run(Job.java:541)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1642)
at org.apache.hadoop.mapreduce.Job.getTaskReports(Job.java:541)
at org.apache.hadoop.mapred.JobClient.getTaskReports(JobClient.java:639)
at 
org.apache.hadoop.mapred.JobClient.getMapTaskReports(JobClient.java:629)
at 
org.apache.hadoop.hive.ql.exec.mr.HadoopJobExecHelper.progress(HadoopJobExecHelper.java:259)
at 
org.apache.hadoop.hive.ql.exec.mr.HadoopJobExecHelper.progress(HadoopJobExecHelper.java:547)
at 
org.apache.hadoop.hive.ql.exec.mr.ExecDriver.execute(ExecDriver.java:426)
at 
org.apache.hadoop.hive.ql.exec.mr.MapRedTask.execute(MapRedTask.java:136)
at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:153)
at 
org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:85)
at 

[jira] [Updated] (MAPREDUCE-5817) mappers get rescheduled on node transition even after all reducers are completed

2015-08-11 Thread Sangjin Lee (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-5817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sangjin Lee updated MAPREDUCE-5817:
---
Attachment: MAPREDUCE-5817.001.patch

v.1 patch posted.

This implements option (1).

 mappers get rescheduled on node transition even after all reducers are 
 completed
 

 Key: MAPREDUCE-5817
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5817
 Project: Hadoop Map/Reduce
  Issue Type: Bug
  Components: applicationmaster
Affects Versions: 2.3.0
Reporter: Sangjin Lee
Assignee: Sangjin Lee
  Labels: BB2015-05-TBR
 Attachments: MAPREDUCE-5817.001.patch, mapreduce-5817.patch


 We're seeing a behavior where a job runs long after all reducers were already 
 finished. We found that the job was rescheduling and running a number of 
 mappers beyond the point of reducer completion. In one situation, the job ran 
 for some 9 more hours after all reducers completed!
 This happens because whenever a node transition (to an unusable state) comes 
 into the app master, it just reschedules all mappers that already ran on the 
 node in all cases.
 Therefore, if any node transition has a potential to extend the job period. 
 Once this window opens, another node transition can prolong it, and this can 
 happen indefinitely in theory.
 If there is some instability in the pool (unhealthy, etc.) for a duration, 
 then any big job is severely vulnerable to this problem.
 If all reducers have been completed, JobImpl.actOnUnusableNode() should not 
 reschedule mapper tasks. If all reducers are completed, the mapper outputs 
 are no longer needed, and there is no need to reschedule mapper tasks as they 
 would not be consumed anyway.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAPREDUCE-6439) AM may fail instead of retrying if RM is restarting/shutting down during the allocate call

2015-08-11 Thread Anubhav Dhoot (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14692479#comment-14692479
 ] 

Anubhav Dhoot commented on MAPREDUCE-6439:
--

Jira for not throwing YarnRuntimeException to client

 AM may fail instead of retrying if RM is restarting/shutting down during the 
 allocate call 
 ---

 Key: MAPREDUCE-6439
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6439
 Project: Hadoop Map/Reduce
  Issue Type: Bug
Reporter: Anubhav Dhoot
Assignee: Anubhav Dhoot
 Attachments: MAPREDUCE-6439.001.patch, MAPREDUCE-6439.002.patch


 We are seeing cases where MR AM gets a YarnRuntimeException thats thrown in 
 RM and gets sent back to AM causing it to think that it has exhausted the 
 number of retries. Copying the error which causes the heartbeat thread to 
 quit.
 {noformat}
 2015-07-25 20:07:27,346 ERROR [RMCommunicator Allocator] 
 org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Error 
 communicating with RM: java.lang.InterruptedException
   at 
 org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:245)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.allocate(ApplicationMasterService.java:469)
   at 
 org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.allocate(ApplicationMasterProtocolPBServiceImpl.java:60)
   at 
 org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.callBlockingMethod(ApplicationMasterProtocol.java:99)
   at 
 org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:587)
   at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1026)
   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013)
   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009)
   at java.security.AccessController.doPrivileged(Native Method)
   at javax.security.auth.Subject.doAs(Subject.java:415)
   at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1642)
   at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007)
 Caused by: java.lang.InterruptedException
   at 
 java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireInterruptibly(AbstractQueuedSynchronizer.java:1219)
   at 
 java.util.concurrent.locks.ReentrantLock.lockInterruptibly(ReentrantLock.java:340)
   at 
 java.util.concurrent.LinkedBlockingQueue.put(LinkedBlockingQueue.java:338)
   at 
 org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:240)
   ... 11 more
 org.apache.hadoop.yarn.exceptions.YarnRuntimeException: 
 java.lang.InterruptedException
   at 
 org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:245)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.allocate(ApplicationMasterService.java:469)
   at 
 org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.allocate(ApplicationMasterProtocolPBServiceImpl.java:60)
   at 
 org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.callBlockingMethod(ApplicationMasterProtocol.java:99)
   at 
 org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:587)
   at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1026)
   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013)
   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009)
   at java.security.AccessController.doPrivileged(Native Method)
   at javax.security.auth.Subject.doAs(Subject.java:415)
   at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1642)
   at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007)
 Caused by: java.lang.InterruptedException
   at 
 java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireInterruptibly(AbstractQueuedSynchronizer.java:1219)
   at 
 java.util.concurrent.locks.ReentrantLock.lockInterruptibly(ReentrantLock.java:340)
   at 
 java.util.concurrent.LinkedBlockingQueue.put(LinkedBlockingQueue.java:338)
   at 
 org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:240)
   ... 11 more
   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
   at 
 sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
   at 
 sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
   at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
   

[jira] [Commented] (MAPREDUCE-6447) reduce shuffle throws java.lang.OutOfMemoryError: Java heap space

2015-08-11 Thread XiaopengLi (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14692856#comment-14692856
 ] 

XiaopengLi commented on MAPREDUCE-6447:
---

Hi,shuzhangyao.I have shared the same experiences. I ever thought about  your 
method .This method is mandatory to limit within the memoryLimt.  It maybe is 
effective. But I think that the original code  is  reasonable .  We take the 
default parameter to calculate it.  maxSingleShuffleLinit = memorylimit  * 
0.25. The number of Fetcher is 5. memorylimit =  
Runtime.getRuntime().maxMemory() * 0.7 . While all fetcher is working, 
5*0.25*0.7 1,in theory,it does not occur the OutOfMemory of Java heap. Even 
though we do not add the code usedMemory +requestedSize) memoryLimit ,in 
theory ,it should not occur this phenomenon of Outof Memory. We can talk about 
this problem. whether It be casued by Jvm and  unreasonable allocation of 
memory in  special data input ?



 reduce shuffle throws java.lang.OutOfMemoryError: Java heap space
 ---

 Key: MAPREDUCE-6447
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6447
 Project: Hadoop Map/Reduce
  Issue Type: Bug
Affects Versions: 2.5.0, 2.6.0, 2.5.1, 2.7.1
Reporter: shuzhangyao
Assignee: shuzhangyao
Priority: Minor

 2015-08-11 14:03:54,550 WARN [main] org.apache.hadoop.mapred.YarnChild: 
 Exception running child : 
 org.apache.hadoop.mapreduce.task.reduce.Shuffle$ShuffleError: error in 
 shuffle in fetcher#10
   at org.apache.hadoop.mapreduce.task.reduce.Shuffle.run(Shuffle.java:134)
   at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:376)
   at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)
   at java.security.AccessController.doPrivileged(Native Method)
   at javax.security.auth.Subject.doAs(Subject.java:415)
   at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614)
   at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163)
 Caused by: java.lang.OutOfMemoryError: Java heap space
   at 
 org.apache.hadoop.io.BoundedByteArrayOutputStream.init(BoundedByteArrayOutputStream.java:56)
   at 
 org.apache.hadoop.io.BoundedByteArrayOutputStream.init(BoundedByteArrayOutputStream.java:46)
   at 
 org.apache.hadoop.mapreduce.task.reduce.InMemoryMapOutput.init(InMemoryMapOutput.java:63)
   at 
 org.apache.hadoop.mapreduce.task.reduce.MergeManagerImpl.unconditionalReserve(MergeManagerImpl.java:303)
   at 
 org.apache.hadoop.mapreduce.task.reduce.MergeManagerImpl.reserve(MergeManagerImpl.java:293)
   at 
 org.apache.hadoop.mapreduce.task.reduce.Fetcher.copyMapOutput(Fetcher.java:511)
   at 
 org.apache.hadoop.mapreduce.task.reduce.Fetcher.copyFromHost(Fetcher.java:329)
   at org.apache.hadoop.mapreduce.task.reduce.Fetcher.run(Fetcher.java:193)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MAPREDUCE-5817) mappers get rescheduled on node transition even after all reducers are completed

2015-08-11 Thread Sangjin Lee (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-5817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sangjin Lee updated MAPREDUCE-5817:
---
Labels:   (was: BB2015-05-TBR)

 mappers get rescheduled on node transition even after all reducers are 
 completed
 

 Key: MAPREDUCE-5817
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5817
 Project: Hadoop Map/Reduce
  Issue Type: Bug
  Components: applicationmaster
Affects Versions: 2.3.0
Reporter: Sangjin Lee
Assignee: Sangjin Lee
 Attachments: MAPREDUCE-5817.001.patch, mapreduce-5817.patch


 We're seeing a behavior where a job runs long after all reducers were already 
 finished. We found that the job was rescheduling and running a number of 
 mappers beyond the point of reducer completion. In one situation, the job ran 
 for some 9 more hours after all reducers completed!
 This happens because whenever a node transition (to an unusable state) comes 
 into the app master, it just reschedules all mappers that already ran on the 
 node in all cases.
 Therefore, if any node transition has a potential to extend the job period. 
 Once this window opens, another node transition can prolong it, and this can 
 happen indefinitely in theory.
 If there is some instability in the pool (unhealthy, etc.) for a duration, 
 then any big job is severely vulnerable to this problem.
 If all reducers have been completed, JobImpl.actOnUnusableNode() should not 
 reschedule mapper tasks. If all reducers are completed, the mapper outputs 
 are no longer needed, and there is no need to reschedule mapper tasks as they 
 would not be consumed anyway.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAPREDUCE-5817) mappers get rescheduled on node transition even after all reducers are completed

2015-08-11 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14692507#comment-14692507
 ] 

Hadoop QA commented on MAPREDUCE-5817:
--

\\
\\
| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | pre-patch |  16m 50s | Pre-patch trunk compilation is 
healthy. |
| {color:green}+1{color} | @author |   0m  0s | The patch does not contain any 
@author tags. |
| {color:green}+1{color} | tests included |   0m  0s | The patch appears to 
include 1 new or modified test files. |
| {color:green}+1{color} | javac |   8m 11s | There were no new javac warning 
messages. |
| {color:green}+1{color} | javadoc |  10m  4s | There were no new javadoc 
warning messages. |
| {color:green}+1{color} | release audit |   0m 24s | The applied patch does 
not increase the total number of release audit warnings. |
| {color:red}-1{color} | checkstyle |   0m 35s | The applied patch generated  1 
new checkstyle issues (total was 108, now 107). |
| {color:green}+1{color} | whitespace |   0m  0s | The patch has no lines that 
end in whitespace. |
| {color:green}+1{color} | install |   1m 26s | mvn install still works. |
| {color:green}+1{color} | eclipse:eclipse |   0m 35s | The patch built with 
eclipse:eclipse. |
| {color:green}+1{color} | findbugs |   1m 10s | The patch does not introduce 
any new Findbugs (version 3.0.0) warnings. |
| {color:red}-1{color} | mapreduce tests |   9m  6s | Tests failed in 
hadoop-mapreduce-client-app. |
| | |  48m 26s | |
\\
\\
|| Reason || Tests ||
| Failed unit tests | hadoop.mapreduce.v2.app.TestJobEndNotifier |
\\
\\
|| Subsystem || Report/Notes ||
| Patch URL | 
http://issues.apache.org/jira/secure/attachment/12749938/MAPREDUCE-5817.001.patch
 |
| Optional Tests | javadoc javac unit findbugs checkstyle |
| git revision | trunk / 7c796fd |
| checkstyle |  
https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/5934/artifact/patchprocess/diffcheckstylehadoop-mapreduce-client-app.txt
 |
| hadoop-mapreduce-client-app test log | 
https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/5934/artifact/patchprocess/testrun_hadoop-mapreduce-client-app.txt
 |
| Test Results | 
https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/5934/testReport/ |
| Java | 1.7.0_55 |
| uname | Linux asf905.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP 
PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux |
| Console output | 
https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/5934/console |


This message was automatically generated.

 mappers get rescheduled on node transition even after all reducers are 
 completed
 

 Key: MAPREDUCE-5817
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5817
 Project: Hadoop Map/Reduce
  Issue Type: Bug
  Components: applicationmaster
Affects Versions: 2.3.0
Reporter: Sangjin Lee
Assignee: Sangjin Lee
 Attachments: MAPREDUCE-5817.001.patch, mapreduce-5817.patch


 We're seeing a behavior where a job runs long after all reducers were already 
 finished. We found that the job was rescheduling and running a number of 
 mappers beyond the point of reducer completion. In one situation, the job ran 
 for some 9 more hours after all reducers completed!
 This happens because whenever a node transition (to an unusable state) comes 
 into the app master, it just reschedules all mappers that already ran on the 
 node in all cases.
 Therefore, if any node transition has a potential to extend the job period. 
 Once this window opens, another node transition can prolong it, and this can 
 happen indefinitely in theory.
 If there is some instability in the pool (unhealthy, etc.) for a duration, 
 then any big job is severely vulnerable to this problem.
 If all reducers have been completed, JobImpl.actOnUnusableNode() should not 
 reschedule mapper tasks. If all reducers are completed, the mapper outputs 
 are no longer needed, and there is no need to reschedule mapper tasks as they 
 would not be consumed anyway.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAPREDUCE-5817) mappers get rescheduled on node transition even after all reducers are completed

2015-08-11 Thread Sangjin Lee (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14692455#comment-14692455
 ] 

Sangjin Lee commented on MAPREDUCE-5817:


The current patch skips re-running mappers only if all reducers are complete. 
So I don't think reducers will fail beyond that point? Did I understand your 
question right?

 mappers get rescheduled on node transition even after all reducers are 
 completed
 

 Key: MAPREDUCE-5817
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5817
 Project: Hadoop Map/Reduce
  Issue Type: Bug
  Components: applicationmaster
Affects Versions: 2.3.0
Reporter: Sangjin Lee
Assignee: Sangjin Lee
  Labels: BB2015-05-TBR
 Attachments: MAPREDUCE-5817.001.patch, mapreduce-5817.patch


 We're seeing a behavior where a job runs long after all reducers were already 
 finished. We found that the job was rescheduling and running a number of 
 mappers beyond the point of reducer completion. In one situation, the job ran 
 for some 9 more hours after all reducers completed!
 This happens because whenever a node transition (to an unusable state) comes 
 into the app master, it just reschedules all mappers that already ran on the 
 node in all cases.
 Therefore, if any node transition has a potential to extend the job period. 
 Once this window opens, another node transition can prolong it, and this can 
 happen indefinitely in theory.
 If there is some instability in the pool (unhealthy, etc.) for a duration, 
 then any big job is severely vulnerable to this problem.
 If all reducers have been completed, JobImpl.actOnUnusableNode() should not 
 reschedule mapper tasks. If all reducers are completed, the mapper outputs 
 are no longer needed, and there is no need to reschedule mapper tasks as they 
 would not be consumed anyway.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAPREDUCE-6447) reduce shuffle throws java.lang.OutOfMemoryError: Java heap space

2015-08-11 Thread shuzhangyao (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14692802#comment-14692802
 ] 

shuzhangyao commented on MAPREDUCE-6447:


the default  mapreduce.reduce.shuffle.input.buffer.percent is 0.9 ,and the 
default mapreduce.reduce.shuffle.memory.limit.percent is 0.25。
if we want to avoid the issue,need to decrease the value。
 
eg: mapreduce.reduce.shuffle.input.buffer.percent =0.6

 reduce shuffle throws java.lang.OutOfMemoryError: Java heap space
 ---

 Key: MAPREDUCE-6447
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6447
 Project: Hadoop Map/Reduce
  Issue Type: Bug
Affects Versions: 2.5.0, 2.6.0, 2.5.1, 2.7.1
Reporter: shuzhangyao
Assignee: shuzhangyao
Priority: Minor

 2015-08-11 14:03:54,550 WARN [main] org.apache.hadoop.mapred.YarnChild: 
 Exception running child : 
 org.apache.hadoop.mapreduce.task.reduce.Shuffle$ShuffleError: error in 
 shuffle in fetcher#10
   at org.apache.hadoop.mapreduce.task.reduce.Shuffle.run(Shuffle.java:134)
   at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:376)
   at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)
   at java.security.AccessController.doPrivileged(Native Method)
   at javax.security.auth.Subject.doAs(Subject.java:415)
   at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614)
   at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163)
 Caused by: java.lang.OutOfMemoryError: Java heap space
   at 
 org.apache.hadoop.io.BoundedByteArrayOutputStream.init(BoundedByteArrayOutputStream.java:56)
   at 
 org.apache.hadoop.io.BoundedByteArrayOutputStream.init(BoundedByteArrayOutputStream.java:46)
   at 
 org.apache.hadoop.mapreduce.task.reduce.InMemoryMapOutput.init(InMemoryMapOutput.java:63)
   at 
 org.apache.hadoop.mapreduce.task.reduce.MergeManagerImpl.unconditionalReserve(MergeManagerImpl.java:303)
   at 
 org.apache.hadoop.mapreduce.task.reduce.MergeManagerImpl.reserve(MergeManagerImpl.java:293)
   at 
 org.apache.hadoop.mapreduce.task.reduce.Fetcher.copyMapOutput(Fetcher.java:511)
   at 
 org.apache.hadoop.mapreduce.task.reduce.Fetcher.copyFromHost(Fetcher.java:329)
   at org.apache.hadoop.mapreduce.task.reduce.Fetcher.run(Fetcher.java:193)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MAPREDUCE-6439) AM may fail instead of retrying if RM is restarting/shutting down during the allocate call

2015-08-11 Thread Anubhav Dhoot (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-6439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anubhav Dhoot updated MAPREDUCE-6439:
-
Attachment: MAPREDUCE-6439.002.patch

Added Unit tests to verify changes in exception handling

 AM may fail instead of retrying if RM is restarting/shutting down during the 
 allocate call 
 ---

 Key: MAPREDUCE-6439
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6439
 Project: Hadoop Map/Reduce
  Issue Type: Bug
Reporter: Anubhav Dhoot
Assignee: Anubhav Dhoot
 Attachments: MAPREDUCE-6439.001.patch, MAPREDUCE-6439.002.patch


 We are seeing cases where MR AM gets a YarnRuntimeException thats thrown in 
 RM and gets sent back to AM causing it to think that it has exhausted the 
 number of retries. Copying the error which causes the heartbeat thread to 
 quit.
 {noformat}
 2015-07-25 20:07:27,346 ERROR [RMCommunicator Allocator] 
 org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Error 
 communicating with RM: java.lang.InterruptedException
   at 
 org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:245)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.allocate(ApplicationMasterService.java:469)
   at 
 org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.allocate(ApplicationMasterProtocolPBServiceImpl.java:60)
   at 
 org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.callBlockingMethod(ApplicationMasterProtocol.java:99)
   at 
 org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:587)
   at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1026)
   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013)
   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009)
   at java.security.AccessController.doPrivileged(Native Method)
   at javax.security.auth.Subject.doAs(Subject.java:415)
   at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1642)
   at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007)
 Caused by: java.lang.InterruptedException
   at 
 java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireInterruptibly(AbstractQueuedSynchronizer.java:1219)
   at 
 java.util.concurrent.locks.ReentrantLock.lockInterruptibly(ReentrantLock.java:340)
   at 
 java.util.concurrent.LinkedBlockingQueue.put(LinkedBlockingQueue.java:338)
   at 
 org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:240)
   ... 11 more
 org.apache.hadoop.yarn.exceptions.YarnRuntimeException: 
 java.lang.InterruptedException
   at 
 org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:245)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.allocate(ApplicationMasterService.java:469)
   at 
 org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.allocate(ApplicationMasterProtocolPBServiceImpl.java:60)
   at 
 org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.callBlockingMethod(ApplicationMasterProtocol.java:99)
   at 
 org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:587)
   at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1026)
   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013)
   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009)
   at java.security.AccessController.doPrivileged(Native Method)
   at javax.security.auth.Subject.doAs(Subject.java:415)
   at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1642)
   at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007)
 Caused by: java.lang.InterruptedException
   at 
 java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireInterruptibly(AbstractQueuedSynchronizer.java:1219)
   at 
 java.util.concurrent.locks.ReentrantLock.lockInterruptibly(ReentrantLock.java:340)
   at 
 java.util.concurrent.LinkedBlockingQueue.put(LinkedBlockingQueue.java:338)
   at 
 org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:240)
   ... 11 more
   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
   at 
 sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
   at 
 sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
   at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
   at