[jira] [Commented] (MAPREDUCE-6439) AM may fail instead of retrying if RM is restarting/shutting down during the allocate call
[ https://issues.apache.org/jira/browse/MAPREDUCE-6439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14653302#comment-14653302 ] zhihai xu commented on MAPREDUCE-6439: -- Thanks for working on this issue [~adhoot]! Based on the stack trace, it looks like this issue is because AM(YARN IPC Client) can't differentiate between local YarnRuntimeException and remote YarnRuntimeException. It looks like we can either fix it at the client side(AM) or at the serve side(RM). Your attached patch is a client side fix, which replaces local YarnRuntimeException with a new exception RMContainerAllocationException. I think the patch will work, but I feel it looks more like a workaround It looks like there are several ways to fix it at the serve side(RM). I can think of the following two server side fixes. # Similar as Anubhav's first option, Translates YarnRuntimeException to YarnException by calling RPCUtil#getRemoteException. Maybe we can do the translations at ApplicationMasterProtocolPBServiceImpl#allocate {code} public AllocateResponseProto allocate(RpcController arg0, AllocateRequestProto proto) throws ServiceException { AllocateRequestPBImpl request = new AllocateRequestPBImpl(proto); try { AllocateResponse response = real.allocate(request); return ((AllocateResponsePBImpl)response).getProto(); } catch (YarnException e) { throw new ServiceException(e); } catch (YarnRuntimeException e) { throw new ServiceException(RPCUtil.getRemoteException(e)); } catch (IOException e) { throw new ServiceException(e); } } {code} YARN-635 Rename YarnRemoteException to YarnException and YarnException to YarnRuntimeException. So this change looks like more backward compatible and more generalized. # Don't translate InterruptedException to YarnRuntimeException at AsyncDispatcher. Currently the InterruptedException is hidden in AsyncDispatcher. The stack trace shows the YarnRuntimeException from AsyncDispatcher is sent to AM. Because lots of code uses AsyncDispatcher. This change may not be easy and safe. {code} } catch (InterruptedException e) { if (!stopped) { LOG.warn(AsyncDispatcher thread interrupted, e); } // Need to reset drained flag to true if event queue is empty, // otherwise dispatcher will hang on stop. drained = eventQueue.isEmpty(); throw new YarnRuntimeException(e); } {code} [~jlowe] - do you think any of my earlier suggestions are reasonable? AM may fail instead of retrying if RM is restarting/shutting down during the allocate call --- Key: MAPREDUCE-6439 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6439 Project: Hadoop Map/Reduce Issue Type: Bug Reporter: Anubhav Dhoot Assignee: Anubhav Dhoot Attachments: MAPREDUCE-6439.001.patch We are seeing cases where MR AM gets a YarnRuntimeException thats thrown in RM and gets sent back to AM causing it to think that it has exhausted the number of retries. Copying the error which causes the heartbeat thread to quit. {noformat} 2015-07-25 20:07:27,346 ERROR [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Error communicating with RM: java.lang.InterruptedException at org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:245) at org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.allocate(ApplicationMasterService.java:469) at org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.allocate(ApplicationMasterProtocolPBServiceImpl.java:60) at org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.callBlockingMethod(ApplicationMasterProtocol.java:99) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:587) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1026) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1642) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007) Caused by: java.lang.InterruptedException at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireInterruptibly(AbstractQueuedSynchronizer.java:1219) at java.util.concurrent.locks.ReentrantLock.lockInterruptibly(ReentrantLock.java:340) at
[jira] [Commented] (MAPREDUCE-6240) Hadoop client displays confusing error message
[ https://issues.apache.org/jira/browse/MAPREDUCE-6240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14653262#comment-14653262 ] Ajith S commented on MAPREDUCE-6240: Hi [~jira.shegalov] Any reason why we dint just throw a IOException with all these exception added as suppressed exceptions.? {code} catch (IOException e) { if(generalException == null) { generalException = new IOException(General exception); } generalException.addSuppressed(e); } {code} just like how java does it in a try-with-resource block Hadoop client displays confusing error message -- Key: MAPREDUCE-6240 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6240 Project: Hadoop Map/Reduce Issue Type: Bug Components: client Affects Versions: 2.7.0 Reporter: Mohammad Kamrul Islam Assignee: Gera Shegalov Attachments: MAPREDUCE-6240-gera.001.patch, MAPREDUCE-6240-gera.001.patch, MAPREDUCE-6240-gera.002.patch, MAPREDUCE-6240.003.patch, MAPREDUCE-6240.1.patch Hadoop client often throws exception with java.io.IOException: Cannot initialize Cluster. Please check your configuration for mapreduce.framework.name and the correspond server addresses. This is a misleading and generic message for any cluster initialization problem. It takes a lot of debugging hours to identify the root cause. The correct error message could resolve this problem quickly. In one such instance, Oozie log showed the following exception while the root cause was CNF that Hadoop client didn't return in the exception. {noformat} JA009: Cannot initialize Cluster. Please check your configuration for mapreduce.framework.name and the correspond server addresses. at org.apache.oozie.action.ActionExecutor.convertExceptionHelper(ActionExecutor.java:412) at org.apache.oozie.action.ActionExecutor.convertException(ActionExecutor.java:392) at org.apache.oozie.action.hadoop.JavaActionExecutor.submitLauncher(JavaActionExecutor.java:979) at org.apache.oozie.action.hadoop.JavaActionExecutor.start(JavaActionExecutor.java:1134) at org.apache.oozie.command.wf.ActionStartXCommand.execute(ActionStartXCommand.java:228) at org.apache.oozie.command.wf.ActionStartXCommand.execute(ActionStartXCommand.java:63) at org.apache.oozie.command.XCommand.call(XCommand.java:281) at org.apache.oozie.service.CallableQueueService$CompositeCallable.call(CallableQueueService.java:323) at org.apache.oozie.service.CallableQueueService$CompositeCallable.call(CallableQueueService.java:252) at org.apache.oozie.service.CallableQueueService$CallableWrapper.run(CallableQueueService.java:174) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) Caused by: java.io.IOException: Cannot initialize Cluster. Please check your configuration for mapreduce.framework.name and the correspond server addresses. at org.apache.hadoop.mapreduce.Cluster.initialize(Cluster.java:120) at org.apache.hadoop.mapreduce.Cluster.init(Cluster.java:82) at org.apache.hadoop.mapreduce.Cluster.init(Cluster.java:75) at org.apache.hadoop.mapred.JobClient.init(JobClient.java:470) at org.apache.hadoop.mapred.JobClient.init(JobClient.java:449) at org.apache.oozie.service.HadoopAccessorService$1.run(HadoopAccessorService.java:372) at org.apache.oozie.service.HadoopAccessorService$1.run(HadoopAccessorService.java:370) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548) at org.apache.oozie.service.HadoopAccessorService.createJobClient(HadoopAccessorService.java:379) at org.apache.oozie.action.hadoop.JavaActionExecutor.createJobClient(JavaActionExecutor.java:1185) at org.apache.oozie.action.hadoop.JavaActionExecutor.submitLauncher(JavaActionExecutor.java:927) ... 10 more {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MAPREDUCE-6439) AM may fail instead of retrying if RM is restarting/shutting down during the allocate call
[ https://issues.apache.org/jira/browse/MAPREDUCE-6439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14653796#comment-14653796 ] Jason Lowe commented on MAPREDUCE-6439: --- I think a Mapreduce AM client-side fix is the safest from a backwards compatibility point of view. However it would be nice if clients could know that only IOExceptions and YarnExceptions and no escapable runtime exceptions are going to come from remote calls. If that's the case then the fix you propose for the allocate call would need to be replicated for the other calls and arguably other PBServiceImpls for other YARN protocols. AM may fail instead of retrying if RM is restarting/shutting down during the allocate call --- Key: MAPREDUCE-6439 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6439 Project: Hadoop Map/Reduce Issue Type: Bug Reporter: Anubhav Dhoot Assignee: Anubhav Dhoot Attachments: MAPREDUCE-6439.001.patch We are seeing cases where MR AM gets a YarnRuntimeException thats thrown in RM and gets sent back to AM causing it to think that it has exhausted the number of retries. Copying the error which causes the heartbeat thread to quit. {noformat} 2015-07-25 20:07:27,346 ERROR [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Error communicating with RM: java.lang.InterruptedException at org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:245) at org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.allocate(ApplicationMasterService.java:469) at org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.allocate(ApplicationMasterProtocolPBServiceImpl.java:60) at org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.callBlockingMethod(ApplicationMasterProtocol.java:99) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:587) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1026) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1642) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007) Caused by: java.lang.InterruptedException at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireInterruptibly(AbstractQueuedSynchronizer.java:1219) at java.util.concurrent.locks.ReentrantLock.lockInterruptibly(ReentrantLock.java:340) at java.util.concurrent.LinkedBlockingQueue.put(LinkedBlockingQueue.java:338) at org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:240) ... 11 more org.apache.hadoop.yarn.exceptions.YarnRuntimeException: java.lang.InterruptedException at org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:245) at org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.allocate(ApplicationMasterService.java:469) at org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.allocate(ApplicationMasterProtocolPBServiceImpl.java:60) at org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.callBlockingMethod(ApplicationMasterProtocol.java:99) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:587) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1026) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1642) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007) Caused by: java.lang.InterruptedException at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireInterruptibly(AbstractQueuedSynchronizer.java:1219) at java.util.concurrent.locks.ReentrantLock.lockInterruptibly(ReentrantLock.java:340) at java.util.concurrent.LinkedBlockingQueue.put(LinkedBlockingQueue.java:338) at org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:240) ... 11 more at
[jira] [Created] (MAPREDUCE-6443) Add JvmPauseMonitor to Job History Server
Robert Kanter created MAPREDUCE-6443: Summary: Add JvmPauseMonitor to Job History Server Key: MAPREDUCE-6443 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6443 Project: Hadoop Map/Reduce Issue Type: Improvement Components: jobhistoryserver Affects Versions: 2.8.0 Reporter: Robert Kanter Assignee: Robert Kanter We should add the {{JvmPauseMonitor}} from HADOOP-9618 to the Job History Server. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MAPREDUCE-6415) Create a tool to combine aggregated logs into HAR files
[ https://issues.apache.org/jira/browse/MAPREDUCE-6415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Kanter updated MAPREDUCE-6415: - Attachment: MAPREDUCE-6415_branch-2_prelim_002.patch MAPREDUCE-6415_prelim_002.patch The prelim_002 patch: - Uses {{YARN_SHELL_ID}} from YARN-3950 instead of parsing {{CONTAINER_ID}} - Runs 'hadoop archive' and the FileSystem commands from a Java program, so we can limit the JVM startup cost Create a tool to combine aggregated logs into HAR files --- Key: MAPREDUCE-6415 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6415 Project: Hadoop Map/Reduce Issue Type: New Feature Affects Versions: 2.8.0 Reporter: Robert Kanter Assignee: Robert Kanter Attachments: HAR-ableAggregatedLogs_v1.pdf, MAPREDUCE-6415_branch-2_prelim_001.patch, MAPREDUCE-6415_branch-2_prelim_002.patch, MAPREDUCE-6415_prelim_001.patch, MAPREDUCE-6415_prelim_002.patch While we wait for YARN-2942 to become viable, it would still be great to improve the aggregated logs problem. We can write a tool that combines aggregated log files into a single HAR file per application, which should solve the too many files and too many blocks problems. See the design document for details. See YARN-2942 for more context. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MAPREDUCE-6443) Add JvmPauseMonitor to Job History Server
[ https://issues.apache.org/jira/browse/MAPREDUCE-6443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14654655#comment-14654655 ] Hadoop QA commented on MAPREDUCE-6443: -- \\ \\ | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | pre-patch | 16m 1s | Pre-patch trunk compilation is healthy. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:red}-1{color} | tests included | 0m 0s | The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. | | {color:green}+1{color} | javac | 7m 40s | There were no new javac warning messages. | | {color:green}+1{color} | javadoc | 9m 41s | There were no new javadoc warning messages. | | {color:green}+1{color} | release audit | 0m 22s | The applied patch does not increase the total number of release audit warnings. | | {color:green}+1{color} | checkstyle | 0m 29s | There were no new checkstyle issues. | | {color:green}+1{color} | whitespace | 0m 0s | The patch has no lines that end in whitespace. | | {color:green}+1{color} | install | 1m 23s | mvn install still works. | | {color:green}+1{color} | eclipse:eclipse | 0m 32s | The patch built with eclipse:eclipse. | | {color:green}+1{color} | findbugs | 0m 51s | The patch does not introduce any new Findbugs (version 3.0.0) warnings. | | {color:green}+1{color} | mapreduce tests | 5m 47s | Tests passed in hadoop-mapreduce-client-hs. | | | | 42m 49s | | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12748751/MAPREDUCE-6443.001.patch | | Optional Tests | javadoc javac unit findbugs checkstyle | | git revision | trunk / c95993c | | hadoop-mapreduce-client-hs test log | https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/5928/artifact/patchprocess/testrun_hadoop-mapreduce-client-hs.txt | | Test Results | https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/5928/testReport/ | | Java | 1.7.0_55 | | uname | Linux asf907.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Console output | https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/5928/console | This message was automatically generated. Add JvmPauseMonitor to Job History Server - Key: MAPREDUCE-6443 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6443 Project: Hadoop Map/Reduce Issue Type: Improvement Components: jobhistoryserver Affects Versions: 2.8.0 Reporter: Robert Kanter Assignee: Robert Kanter Attachments: MAPREDUCE-6443.001.patch We should add the {{JvmPauseMonitor}} from HADOOP-9618 to the Job History Server. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MAPREDUCE-6443) Add JvmPauseMonitor to Job History Server
[ https://issues.apache.org/jira/browse/MAPREDUCE-6443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Kanter updated MAPREDUCE-6443: - Attachment: MAPREDUCE-6443.001.patch Add JvmPauseMonitor to Job History Server - Key: MAPREDUCE-6443 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6443 Project: Hadoop Map/Reduce Issue Type: Improvement Components: jobhistoryserver Affects Versions: 2.8.0 Reporter: Robert Kanter Assignee: Robert Kanter Attachments: MAPREDUCE-6443.001.patch We should add the {{JvmPauseMonitor}} from HADOOP-9618 to the Job History Server. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MAPREDUCE-6443) Add JvmPauseMonitor to Job History Server
[ https://issues.apache.org/jira/browse/MAPREDUCE-6443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Kanter updated MAPREDUCE-6443: - Status: Patch Available (was: Open) Add JvmPauseMonitor to Job History Server - Key: MAPREDUCE-6443 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6443 Project: Hadoop Map/Reduce Issue Type: Improvement Components: jobhistoryserver Affects Versions: 2.8.0 Reporter: Robert Kanter Assignee: Robert Kanter Attachments: MAPREDUCE-6443.001.patch We should add the {{JvmPauseMonitor}} from HADOOP-9618 to the Job History Server. -- This message was sent by Atlassian JIRA (v6.3.4#6332)