[jira] [Commented] (MAPREDUCE-6439) AM may fail instead of retrying if RM is restarting/shutting down during the allocate call

2015-08-04 Thread zhihai xu (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14653302#comment-14653302
 ] 

zhihai xu commented on MAPREDUCE-6439:
--

Thanks for working on this issue [~adhoot]! Based on the stack trace, it looks 
like this issue is because AM(YARN IPC Client) can't differentiate between 
local YarnRuntimeException and remote YarnRuntimeException.
It looks like we can either fix it at the client side(AM) or at the serve 
side(RM).
Your attached patch is a client side fix, which replaces local 
YarnRuntimeException with a new exception RMContainerAllocationException. I 
think the patch will work, but I feel it looks more like a workaround

It looks like there are several ways to fix it at the serve side(RM).
I can think of the following two server side fixes.
# Similar as Anubhav's first option, Translates YarnRuntimeException to 
YarnException by calling RPCUtil#getRemoteException. Maybe
we can do the translations at ApplicationMasterProtocolPBServiceImpl#allocate
{code}
  public AllocateResponseProto allocate(RpcController arg0,
  AllocateRequestProto proto) throws ServiceException {
AllocateRequestPBImpl request = new AllocateRequestPBImpl(proto);
try {
  AllocateResponse response = real.allocate(request);
  return ((AllocateResponsePBImpl)response).getProto();
} catch (YarnException e) {
  throw new ServiceException(e);
} catch (YarnRuntimeException e) {
  throw new ServiceException(RPCUtil.getRemoteException(e));
} catch (IOException e) {
  throw new ServiceException(e);
}
  }
{code}
YARN-635 Rename YarnRemoteException to YarnException and  YarnException to 
YarnRuntimeException. So this change looks like more backward compatible and 
more generalized.

# Don't translate InterruptedException to YarnRuntimeException at 
AsyncDispatcher. Currently the InterruptedException is hidden in 
AsyncDispatcher. The stack trace shows the YarnRuntimeException from 
AsyncDispatcher is sent to AM. Because lots of code uses AsyncDispatcher. This 
change may not be easy and safe.
{code}
  } catch (InterruptedException e) {
if (!stopped) {
  LOG.warn(AsyncDispatcher thread interrupted, e);
}
// Need to reset drained flag to true if event queue is empty,
// otherwise dispatcher will hang on stop.
drained = eventQueue.isEmpty();
throw new YarnRuntimeException(e);
  }
{code}

[~jlowe] - do you think any of my earlier suggestions are reasonable?

 AM may fail instead of retrying if RM is restarting/shutting down during the 
 allocate call 
 ---

 Key: MAPREDUCE-6439
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6439
 Project: Hadoop Map/Reduce
  Issue Type: Bug
Reporter: Anubhav Dhoot
Assignee: Anubhav Dhoot
 Attachments: MAPREDUCE-6439.001.patch


 We are seeing cases where MR AM gets a YarnRuntimeException thats thrown in 
 RM and gets sent back to AM causing it to think that it has exhausted the 
 number of retries. Copying the error which causes the heartbeat thread to 
 quit.
 {noformat}
 2015-07-25 20:07:27,346 ERROR [RMCommunicator Allocator] 
 org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Error 
 communicating with RM: java.lang.InterruptedException
   at 
 org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:245)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.allocate(ApplicationMasterService.java:469)
   at 
 org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.allocate(ApplicationMasterProtocolPBServiceImpl.java:60)
   at 
 org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.callBlockingMethod(ApplicationMasterProtocol.java:99)
   at 
 org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:587)
   at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1026)
   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013)
   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009)
   at java.security.AccessController.doPrivileged(Native Method)
   at javax.security.auth.Subject.doAs(Subject.java:415)
   at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1642)
   at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007)
 Caused by: java.lang.InterruptedException
   at 
 java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireInterruptibly(AbstractQueuedSynchronizer.java:1219)
   at 
 java.util.concurrent.locks.ReentrantLock.lockInterruptibly(ReentrantLock.java:340)
   at 
 

[jira] [Commented] (MAPREDUCE-6240) Hadoop client displays confusing error message

2015-08-04 Thread Ajith S (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14653262#comment-14653262
 ] 

Ajith S commented on MAPREDUCE-6240:


Hi [~jira.shegalov]

Any reason why we dint just throw a IOException with all these exception added 
as suppressed exceptions.?
{code}

catch (IOException e) {
  if(generalException == null)
  {
generalException = new IOException(General exception);
  }
  generalException.addSuppressed(e);
}
{code}

just like how java does it in a try-with-resource block

 Hadoop client displays confusing error message
 --

 Key: MAPREDUCE-6240
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6240
 Project: Hadoop Map/Reduce
  Issue Type: Bug
  Components: client
Affects Versions: 2.7.0
Reporter: Mohammad Kamrul Islam
Assignee: Gera Shegalov
 Attachments: MAPREDUCE-6240-gera.001.patch, 
 MAPREDUCE-6240-gera.001.patch, MAPREDUCE-6240-gera.002.patch, 
 MAPREDUCE-6240.003.patch, MAPREDUCE-6240.1.patch


 Hadoop client often throws exception  with java.io.IOException: Cannot 
 initialize Cluster. Please check your configuration for 
 mapreduce.framework.name and the correspond server addresses.
 This is a misleading and generic message for any cluster initialization 
 problem. It takes a lot of debugging hours to identify the root cause. The 
 correct error message could resolve this problem quickly.
 In one such instance, Oozie log showed the following exception  while the 
 root cause was CNF  that Hadoop client didn't return in the exception.
 {noformat}
  JA009: Cannot initialize Cluster. Please check your configuration for 
 mapreduce.framework.name and the correspond server addresses.
 at 
 org.apache.oozie.action.ActionExecutor.convertExceptionHelper(ActionExecutor.java:412)
 at 
 org.apache.oozie.action.ActionExecutor.convertException(ActionExecutor.java:392)
 at 
 org.apache.oozie.action.hadoop.JavaActionExecutor.submitLauncher(JavaActionExecutor.java:979)
 at 
 org.apache.oozie.action.hadoop.JavaActionExecutor.start(JavaActionExecutor.java:1134)
 at 
 org.apache.oozie.command.wf.ActionStartXCommand.execute(ActionStartXCommand.java:228)
 at 
 org.apache.oozie.command.wf.ActionStartXCommand.execute(ActionStartXCommand.java:63)
 at org.apache.oozie.command.XCommand.call(XCommand.java:281)
 at 
 org.apache.oozie.service.CallableQueueService$CompositeCallable.call(CallableQueueService.java:323)
 at 
 org.apache.oozie.service.CallableQueueService$CompositeCallable.call(CallableQueueService.java:252)
 at 
 org.apache.oozie.service.CallableQueueService$CallableWrapper.run(CallableQueueService.java:174)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:744)
 Caused by: java.io.IOException: Cannot initialize Cluster. Please check your 
 configuration for mapreduce.framework.name and the correspond server 
 addresses.
 at org.apache.hadoop.mapreduce.Cluster.initialize(Cluster.java:120)
 at org.apache.hadoop.mapreduce.Cluster.init(Cluster.java:82)
 at org.apache.hadoop.mapreduce.Cluster.init(Cluster.java:75)
 at org.apache.hadoop.mapred.JobClient.init(JobClient.java:470)
 at org.apache.hadoop.mapred.JobClient.init(JobClient.java:449)
 at 
 org.apache.oozie.service.HadoopAccessorService$1.run(HadoopAccessorService.java:372)
 at 
 org.apache.oozie.service.HadoopAccessorService$1.run(HadoopAccessorService.java:370)
 at java.security.AccessController.doPrivileged(Native Method)
 at javax.security.auth.Subject.doAs(Subject.java:415)
 at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
 at 
 org.apache.oozie.service.HadoopAccessorService.createJobClient(HadoopAccessorService.java:379)
 at 
 org.apache.oozie.action.hadoop.JavaActionExecutor.createJobClient(JavaActionExecutor.java:1185)
 at 
 org.apache.oozie.action.hadoop.JavaActionExecutor.submitLauncher(JavaActionExecutor.java:927)
  ... 10 more
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAPREDUCE-6439) AM may fail instead of retrying if RM is restarting/shutting down during the allocate call

2015-08-04 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14653796#comment-14653796
 ] 

Jason Lowe commented on MAPREDUCE-6439:
---

I think a Mapreduce AM client-side fix is the safest from a backwards 
compatibility point of view.  However it would be nice if clients could know 
that only IOExceptions and YarnExceptions and no escapable runtime exceptions 
are going to come from remote calls.  If that's the case then the fix you 
propose for the allocate call would need to be replicated for the other calls 
and arguably other PBServiceImpls for other YARN protocols.


 AM may fail instead of retrying if RM is restarting/shutting down during the 
 allocate call 
 ---

 Key: MAPREDUCE-6439
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6439
 Project: Hadoop Map/Reduce
  Issue Type: Bug
Reporter: Anubhav Dhoot
Assignee: Anubhav Dhoot
 Attachments: MAPREDUCE-6439.001.patch


 We are seeing cases where MR AM gets a YarnRuntimeException thats thrown in 
 RM and gets sent back to AM causing it to think that it has exhausted the 
 number of retries. Copying the error which causes the heartbeat thread to 
 quit.
 {noformat}
 2015-07-25 20:07:27,346 ERROR [RMCommunicator Allocator] 
 org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Error 
 communicating with RM: java.lang.InterruptedException
   at 
 org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:245)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.allocate(ApplicationMasterService.java:469)
   at 
 org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.allocate(ApplicationMasterProtocolPBServiceImpl.java:60)
   at 
 org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.callBlockingMethod(ApplicationMasterProtocol.java:99)
   at 
 org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:587)
   at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1026)
   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013)
   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009)
   at java.security.AccessController.doPrivileged(Native Method)
   at javax.security.auth.Subject.doAs(Subject.java:415)
   at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1642)
   at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007)
 Caused by: java.lang.InterruptedException
   at 
 java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireInterruptibly(AbstractQueuedSynchronizer.java:1219)
   at 
 java.util.concurrent.locks.ReentrantLock.lockInterruptibly(ReentrantLock.java:340)
   at 
 java.util.concurrent.LinkedBlockingQueue.put(LinkedBlockingQueue.java:338)
   at 
 org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:240)
   ... 11 more
 org.apache.hadoop.yarn.exceptions.YarnRuntimeException: 
 java.lang.InterruptedException
   at 
 org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:245)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.allocate(ApplicationMasterService.java:469)
   at 
 org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.allocate(ApplicationMasterProtocolPBServiceImpl.java:60)
   at 
 org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.callBlockingMethod(ApplicationMasterProtocol.java:99)
   at 
 org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:587)
   at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1026)
   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013)
   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009)
   at java.security.AccessController.doPrivileged(Native Method)
   at javax.security.auth.Subject.doAs(Subject.java:415)
   at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1642)
   at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007)
 Caused by: java.lang.InterruptedException
   at 
 java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireInterruptibly(AbstractQueuedSynchronizer.java:1219)
   at 
 java.util.concurrent.locks.ReentrantLock.lockInterruptibly(ReentrantLock.java:340)
   at 
 java.util.concurrent.LinkedBlockingQueue.put(LinkedBlockingQueue.java:338)
   at 
 org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:240)
   ... 11 more
   at 

[jira] [Created] (MAPREDUCE-6443) Add JvmPauseMonitor to Job History Server

2015-08-04 Thread Robert Kanter (JIRA)
Robert Kanter created MAPREDUCE-6443:


 Summary: Add JvmPauseMonitor to Job History Server
 Key: MAPREDUCE-6443
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6443
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
  Components: jobhistoryserver
Affects Versions: 2.8.0
Reporter: Robert Kanter
Assignee: Robert Kanter


We should add the {{JvmPauseMonitor}} from HADOOP-9618 to the Job History 
Server.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MAPREDUCE-6415) Create a tool to combine aggregated logs into HAR files

2015-08-04 Thread Robert Kanter (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-6415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Kanter updated MAPREDUCE-6415:
-
Attachment: MAPREDUCE-6415_branch-2_prelim_002.patch
MAPREDUCE-6415_prelim_002.patch

The prelim_002 patch:
- Uses {{YARN_SHELL_ID}} from YARN-3950 instead of parsing {{CONTAINER_ID}}
- Runs 'hadoop archive' and the FileSystem commands from a Java program, so we 
can limit the JVM startup cost

 Create a tool to combine aggregated logs into HAR files
 ---

 Key: MAPREDUCE-6415
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6415
 Project: Hadoop Map/Reduce
  Issue Type: New Feature
Affects Versions: 2.8.0
Reporter: Robert Kanter
Assignee: Robert Kanter
 Attachments: HAR-ableAggregatedLogs_v1.pdf, 
 MAPREDUCE-6415_branch-2_prelim_001.patch, 
 MAPREDUCE-6415_branch-2_prelim_002.patch, MAPREDUCE-6415_prelim_001.patch, 
 MAPREDUCE-6415_prelim_002.patch


 While we wait for YARN-2942 to become viable, it would still be great to 
 improve the aggregated logs problem.  We can write a tool that combines 
 aggregated log files into a single HAR file per application, which should 
 solve the too many files and too many blocks problems.  See the design 
 document for details.
 See YARN-2942 for more context.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAPREDUCE-6443) Add JvmPauseMonitor to Job History Server

2015-08-04 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14654655#comment-14654655
 ] 

Hadoop QA commented on MAPREDUCE-6443:
--

\\
\\
| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | pre-patch |  16m  1s | Pre-patch trunk compilation is 
healthy. |
| {color:green}+1{color} | @author |   0m  0s | The patch does not contain any 
@author tags. |
| {color:red}-1{color} | tests included |   0m  0s | The patch doesn't appear 
to include any new or modified tests.  Please justify why no new tests are 
needed for this patch. Also please list what manual steps were performed to 
verify this patch. |
| {color:green}+1{color} | javac |   7m 40s | There were no new javac warning 
messages. |
| {color:green}+1{color} | javadoc |   9m 41s | There were no new javadoc 
warning messages. |
| {color:green}+1{color} | release audit |   0m 22s | The applied patch does 
not increase the total number of release audit warnings. |
| {color:green}+1{color} | checkstyle |   0m 29s | There were no new checkstyle 
issues. |
| {color:green}+1{color} | whitespace |   0m  0s | The patch has no lines that 
end in whitespace. |
| {color:green}+1{color} | install |   1m 23s | mvn install still works. |
| {color:green}+1{color} | eclipse:eclipse |   0m 32s | The patch built with 
eclipse:eclipse. |
| {color:green}+1{color} | findbugs |   0m 51s | The patch does not introduce 
any new Findbugs (version 3.0.0) warnings. |
| {color:green}+1{color} | mapreduce tests |   5m 47s | Tests passed in 
hadoop-mapreduce-client-hs. |
| | |  42m 49s | |
\\
\\
|| Subsystem || Report/Notes ||
| Patch URL | 
http://issues.apache.org/jira/secure/attachment/12748751/MAPREDUCE-6443.001.patch
 |
| Optional Tests | javadoc javac unit findbugs checkstyle |
| git revision | trunk / c95993c |
| hadoop-mapreduce-client-hs test log | 
https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/5928/artifact/patchprocess/testrun_hadoop-mapreduce-client-hs.txt
 |
| Test Results | 
https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/5928/testReport/ |
| Java | 1.7.0_55 |
| uname | Linux asf907.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP 
PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux |
| Console output | 
https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/5928/console |


This message was automatically generated.

 Add JvmPauseMonitor to Job History Server
 -

 Key: MAPREDUCE-6443
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6443
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
  Components: jobhistoryserver
Affects Versions: 2.8.0
Reporter: Robert Kanter
Assignee: Robert Kanter
 Attachments: MAPREDUCE-6443.001.patch


 We should add the {{JvmPauseMonitor}} from HADOOP-9618 to the Job History 
 Server.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MAPREDUCE-6443) Add JvmPauseMonitor to Job History Server

2015-08-04 Thread Robert Kanter (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-6443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Kanter updated MAPREDUCE-6443:
-
Attachment: MAPREDUCE-6443.001.patch

 Add JvmPauseMonitor to Job History Server
 -

 Key: MAPREDUCE-6443
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6443
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
  Components: jobhistoryserver
Affects Versions: 2.8.0
Reporter: Robert Kanter
Assignee: Robert Kanter
 Attachments: MAPREDUCE-6443.001.patch


 We should add the {{JvmPauseMonitor}} from HADOOP-9618 to the Job History 
 Server.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MAPREDUCE-6443) Add JvmPauseMonitor to Job History Server

2015-08-04 Thread Robert Kanter (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-6443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Kanter updated MAPREDUCE-6443:
-
Status: Patch Available  (was: Open)

 Add JvmPauseMonitor to Job History Server
 -

 Key: MAPREDUCE-6443
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6443
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
  Components: jobhistoryserver
Affects Versions: 2.8.0
Reporter: Robert Kanter
Assignee: Robert Kanter
 Attachments: MAPREDUCE-6443.001.patch


 We should add the {{JvmPauseMonitor}} from HADOOP-9618 to the Job History 
 Server.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)