[jira] [Commented] (MAPREDUCE-5831) Old MR client is not compatible with new MR application

2014-04-22 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5831?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13977689#comment-13977689
 ] 

Wangda Tan commented on MAPREDUCE-5831:
---

Link this issue with MAPREDUCE-4150

 Old MR client is not compatible with new MR application
 ---

 Key: MAPREDUCE-5831
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5831
 Project: Hadoop Map/Reduce
  Issue Type: Bug
  Components: client, mr-am
Affects Versions: 2.2.0, 2.3.0
Reporter: Zhijie Shen
Priority: Critical

 Recently, we saw the following scenario:
 1. The user setup a cluster of hadoop 2.3., which contains YARN 2.3 and MR  
 2.3.
 2. The user client on a machine that MR 2.2 is installed and in the classpath.
 Then, when the user submitted a simple wordcount job, he saw the following 
 message:
 {code}
 16:00:41,027  INFO main mapreduce.Job:1345 -  map 100% reduce 100%
 16:00:41,036  INFO main mapreduce.Job:1356 - Job job_1396468045458_0006 
 completed successfully
 16:02:20,535  WARN main mapreduce.JobRunner:212 - Cannot start job 
 [wordcountJob]
 java.lang.IllegalArgumentException: No enum constant 
 org.apache.hadoop.mapreduce.JobCounter.MB_MILLIS_REDUCES
   at java.lang.Enum.valueOf(Enum.java:236)
   at 
 org.apache.hadoop.mapreduce.counters.FrameworkCounterGroup.valueOf(FrameworkCounterGroup.java:148)
   at 
 org.apache.hadoop.mapreduce.counters.FrameworkCounterGroup.findCounter(FrameworkCounterGroup.java:182)
   at 
 org.apache.hadoop.mapreduce.counters.AbstractCounters.findCounter(AbstractCounters.java:154)
   at 
 org.apache.hadoop.mapreduce.TypeConverter.fromYarn(TypeConverter.java:240)
   at 
 org.apache.hadoop.mapred.ClientServiceDelegate.getJobCounters(ClientServiceDelegate.java:370)
   at 
 org.apache.hadoop.mapred.YARNRunner.getJobCounters(YARNRunner.java:511)
   at org.apache.hadoop.mapreduce.Job$7.run(Job.java:756)
   at org.apache.hadoop.mapreduce.Job$7.run(Job.java:753)
   at java.security.AccessController.doPrivileged(Native Method)
   at javax.security.auth.Subject.doAs(Subject.java:415)
   at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1491)
   at org.apache.hadoop.mapreduce.Job.getCounters(Job.java:753)
   at org.apache.hadoop.mapreduce.Job.monitorAndPrintJob(Job.java:1361)
   at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1289)
 . . .
 {code}
 The problem is that the wordcount job was running on one or more than one 
 nodes of the YARN cluster, where MR 2.3 libs were installed, and 
 JobCounter.MB_MILLIS_REDUCES is available in the counters. On the other side, 
 due to the classpath setting, the client was likely to run with MR 2.2 libs. 
 After the client retrieved the counters from MR AM, it tried to construct the 
 Counter object with the received counter name. Unfortunately, the enum didn't 
 exist in the client's classpath. Therefore, No enum constant exception is 
 thrown here.
 JobCounter.MB_MILLIS_REDUCES is brought to MR2 via MAPREDUCE-5464 since 
 Hadoop 2.3.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAPREDUCE-5844) Reducer Preemption is too aggressive

2014-05-23 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14007184#comment-14007184
 ] 

Wangda Tan commented on MAPREDUCE-5844:
---

Hi [~maysamyabandeh], 
Thanks for your patch, I think currently, headroom needs to be well improved in 
fair or capacity scheduler. So it's better to make to make your method become a 
default behavior (change time threshold to 0 and reasonable number in my 
opinion. 
A suggestion is, can we simply set a time equals the last mapper container we 
get, and use this time to check if we run into hard to allocate mapper 
situation. Which can avoid modify ContainerRequest code.

 Reducer Preemption is too aggressive
 

 Key: MAPREDUCE-5844
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5844
 Project: Hadoop Map/Reduce
  Issue Type: Bug
Reporter: Maysam Yabandeh
Assignee: Maysam Yabandeh
 Attachments: MAPREDUCE-5844.patch


 We observed cases where the reducer preemption makes the job finish much 
 later, and the preemption does not seem to be necessary since after 
 preemption both the preempted reducer and the mapper are assigned 
 immediately--meaning that there was already enough space for the mapper.
 The logic for triggering preemption is at 
 RMContainerAllocator::preemptReducesIfNeeded
 The preemption is triggered if the following is true:
 {code}
 headroom +  am * |m| + pr * |r|  mapResourceRequest
 {code} 
 where am: number of assigned mappers, |m| is mapper size, pr is number of 
 reducers being preempted, and |r| is the reducer size.
 The original idea apparently was that if headroom is not big enough for the 
 new mapper requests, reducers should be preempted. This would work if the job 
 is alone in the cluster. Once we have queues, the headroom calculation 
 becomes more complicated and it would require a separate headroom calculation 
 per queue/job.
 So, as a result headroom variable is kind of given up currently: *headroom is 
 always set to 0* What this implies to the speculation is that speculation 
 becomes very aggressive, not considering whether there is enough space for 
 the mappers or not.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAPREDUCE-5196) CheckpointAMPreemptionPolicy implements preemption in MR AM via checkpointing

2014-06-03 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14016357#comment-14016357
 ] 

Wangda Tan commented on MAPREDUCE-5196:
---


Hi [~curino], 
While trying to understand this part of change, I've several questions, hope 
you could share some idea of it,

my understand of the workflow, please forgive my current ignorance of this code,
1) CheckpointAMPreemptionPolicy will keep track of which containers 
(task-attempt) need to be preempted
2) TaskAttemptListener will set AMFeedback.preempted to true, when 
AMPreemptionPolicy.isPreempted(TaskID) returns true
3) Task get AMFeedback, and set mustPreempt. Task takes some action to do some 
checkpoints, etc. And call umbilical.preempted(taskId, taskStatus)
But a question here, I found the Task do nothing except printing some logs 
after mustPreempt set, is it expected behavior? 
If it's expected behavior, the task will keep running until it completed or 
killed by NM, resource cannot be proactively released by this task. I think we 
should call umbilical.preempted when we found mustPreempt is true, correct?

Another question is, I found in Task.java
{code}
  public void done(TaskUmbilicalProtocol umbilical,
   TaskReporter reporter
   ) throws IOException, InterruptedException {
updateCounters();
if (taskStatus.getRunState() == TaskStatus.State.PREEMPTED ) {
  // If we are preempted, do no output promotion; signal done and exit
  committer.commitTask(taskContext);
  umbilical.preempted(taskId, taskStatus);
  taskDone.set(true);
  reporter.stopCommunicationThread();
  return;
}
...
  }
{code}
It relies on taskStatus.getRunState() == PREEMPTED, but I found nobody set 
taskStatus.runState to PREEMPTED. Could you please tell me which part of code 
set taskStatus. If nobody sets runState to PREEMPTED, it cannot invoke 
umbilical.preempted properly.

Thanks,
Wangda


 CheckpointAMPreemptionPolicy implements preemption in MR AM via checkpointing 
 --

 Key: MAPREDUCE-5196
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5196
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
  Components: mr-am, mrv2
Reporter: Carlo Curino
Assignee: Carlo Curino
 Fix For: 3.0.0

 Attachments: MAPREDUCE-5196.1.patch, MAPREDUCE-5196.2.patch, 
 MAPREDUCE-5196.3.patch, MAPREDUCE-5196.patch, MAPREDUCE-5196.patch


 This JIRA tracks a checkpoint-based AM preemption policy. The policy handles 
 propagation of the preemption requests received from the RM to the 
 appropriate tasks, and bookeeping of checkpoints. Actual checkpointing of the 
 task state is handled in upcoming JIRAs.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAPREDUCE-5900) Container preemption interpreted as task failures and eventually job failures

2014-06-03 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14016365#comment-14016365
 ] 

Wangda Tan commented on MAPREDUCE-5900:
---

Hi [~mayank_bansal],
Thanks for your patch, I think it almost looks good to me, since the 
TaskAttemptImpl will not separately treat kill/preempt, both of them will not 
increase retry number of task.
One comment is, current TaskAttemptImpl already has a PreemptTransition, which 
may confuse people. In my understanding, this is more like a proactive 
preemption introduced by MAPREDUCE-5916, (See my comment 
https://issues.apache.org/jira/browse/MAPREDUCE-5196?focusedCommentId=14016357page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14016357,
 which is waiting for Carlo's reply). My sugguestion is, either rename this 
transition to ProactivePreemptionTransiton and TA_PREEMPTED to 
TA_PROACTIVE_PREEMPTED, or we can simply remove this transition and merge logic 
into KillTransition. Does it make sense to you?

Thanks,
Wangda

 Container preemption interpreted as task failures and eventually job failures 
 --

 Key: MAPREDUCE-5900
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5900
 Project: Hadoop Map/Reduce
  Issue Type: Sub-task
  Components: applicationmaster, mr-am, mrv2
Affects Versions: 2.4.1
Reporter: Mayank Bansal
Assignee: Mayank Bansal
 Attachments: MAPREDUCE-5900-1.patch, 
 MAPREDUCE-5900-branch-241-2.patch, MAPREDUCE-5900-trunk-1.patch, 
 MAPREDUCE-5900-trunk-2.patch


 We have Added preemption exit code needs to be incorporated
 MR needs to recognize the special exit code value of -102 and interpret it as 
 a container being killed instead of a container failure.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAPREDUCE-5196) CheckpointAMPreemptionPolicy implements preemption in MR AM via checkpointing

2014-06-04 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14018357#comment-14018357
 ] 

Wangda Tan commented on MAPREDUCE-5196:
---

Hi [~curino],
Thanks for your clarifications on my question, it's clear to me now.
For the MAPREDUCE-5269, please feel free to let me know if I can help with 
review.

Wangda


 CheckpointAMPreemptionPolicy implements preemption in MR AM via checkpointing 
 --

 Key: MAPREDUCE-5196
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5196
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
  Components: mr-am, mrv2
Reporter: Carlo Curino
Assignee: Carlo Curino
 Fix For: 3.0.0

 Attachments: MAPREDUCE-5196.1.patch, MAPREDUCE-5196.2.patch, 
 MAPREDUCE-5196.3.patch, MAPREDUCE-5196.patch, MAPREDUCE-5196.patch


 This JIRA tracks a checkpoint-based AM preemption policy. The policy handles 
 propagation of the preemption requests received from the RM to the 
 appropriate tasks, and bookeeping of checkpoints. Actual checkpointing of the 
 task state is handled in upcoming JIRAs.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAPREDUCE-5831) Old MR client is not compatible with new MR application

2014-06-21 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5831?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14039768#comment-14039768
 ] 

Wangda Tan commented on MAPREDUCE-5831:
---

+1 for migrating towards MAPREDUCE-4421 for solution of client/server 
capability issue in a short time. For the longer time, as mentioned by Vinod, 
wire capability and rolling upgrade should be the ultimate solution. So IMHO, 
we don't need fix this issue for now.

 Old MR client is not compatible with new MR application
 ---

 Key: MAPREDUCE-5831
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5831
 Project: Hadoop Map/Reduce
  Issue Type: Bug
  Components: client, mr-am
Affects Versions: 2.2.0, 2.3.0
Reporter: Zhijie Shen
Assignee: Tan, Wangda
Priority: Critical

 Recently, we saw the following scenario:
 1. The user setup a cluster of hadoop 2.3., which contains YARN 2.3 and MR  
 2.3.
 2. The user client on a machine that MR 2.2 is installed and in the classpath.
 Then, when the user submitted a simple wordcount job, he saw the following 
 message:
 {code}
 16:00:41,027  INFO main mapreduce.Job:1345 -  map 100% reduce 100%
 16:00:41,036  INFO main mapreduce.Job:1356 - Job job_1396468045458_0006 
 completed successfully
 16:02:20,535  WARN main mapreduce.JobRunner:212 - Cannot start job 
 [wordcountJob]
 java.lang.IllegalArgumentException: No enum constant 
 org.apache.hadoop.mapreduce.JobCounter.MB_MILLIS_REDUCES
   at java.lang.Enum.valueOf(Enum.java:236)
   at 
 org.apache.hadoop.mapreduce.counters.FrameworkCounterGroup.valueOf(FrameworkCounterGroup.java:148)
   at 
 org.apache.hadoop.mapreduce.counters.FrameworkCounterGroup.findCounter(FrameworkCounterGroup.java:182)
   at 
 org.apache.hadoop.mapreduce.counters.AbstractCounters.findCounter(AbstractCounters.java:154)
   at 
 org.apache.hadoop.mapreduce.TypeConverter.fromYarn(TypeConverter.java:240)
   at 
 org.apache.hadoop.mapred.ClientServiceDelegate.getJobCounters(ClientServiceDelegate.java:370)
   at 
 org.apache.hadoop.mapred.YARNRunner.getJobCounters(YARNRunner.java:511)
   at org.apache.hadoop.mapreduce.Job$7.run(Job.java:756)
   at org.apache.hadoop.mapreduce.Job$7.run(Job.java:753)
   at java.security.AccessController.doPrivileged(Native Method)
   at javax.security.auth.Subject.doAs(Subject.java:415)
   at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1491)
   at org.apache.hadoop.mapreduce.Job.getCounters(Job.java:753)
   at org.apache.hadoop.mapreduce.Job.monitorAndPrintJob(Job.java:1361)
   at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1289)
 . . .
 {code}
 The problem is that the wordcount job was running on one or more than one 
 nodes of the YARN cluster, where MR 2.3 libs were installed, and 
 JobCounter.MB_MILLIS_REDUCES is available in the counters. On the other side, 
 due to the classpath setting, the client was likely to run with MR 2.2 libs. 
 After the client retrieved the counters from MR AM, it tried to construct the 
 Counter object with the received counter name. Unfortunately, the enum didn't 
 exist in the client's classpath. Therefore, No enum constant exception is 
 thrown here.
 JobCounter.MB_MILLIS_REDUCES is brought to MR2 via MAPREDUCE-5464 since 
 Hadoop 2.3.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAPREDUCE-5900) Container preemption interpreted as task failures and eventually job failures

2014-06-30 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14048406#comment-14048406
 ] 

Wangda Tan commented on MAPREDUCE-5900:
---

Make sense, LGTM, +1.

 Container preemption interpreted as task failures and eventually job failures 
 --

 Key: MAPREDUCE-5900
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5900
 Project: Hadoop Map/Reduce
  Issue Type: Sub-task
  Components: applicationmaster, mr-am, mrv2
Affects Versions: 2.4.1
Reporter: Mayank Bansal
Assignee: Mayank Bansal
 Attachments: MAPREDUCE-5900-1.patch, 
 MAPREDUCE-5900-branch-241-2.patch, MAPREDUCE-5900-trunk-1.patch, 
 MAPREDUCE-5900-trunk-2.patch


 We have Added preemption exit code needs to be incorporated
 MR needs to recognize the special exit code value of -102 and interpret it as 
 a container being killed instead of a container failure.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Assigned] (MAPREDUCE-5956) MapReduce AM should not use maxAttempts to determine if this is the last retry

2014-07-02 Thread Wangda Tan (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-5956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan reassigned MAPREDUCE-5956:
-

Assignee: Wangda Tan  (was: Vinod Kumar Vavilapalli)

 MapReduce AM should not use maxAttempts to determine if this is the last retry
 --

 Key: MAPREDUCE-5956
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5956
 Project: Hadoop Map/Reduce
  Issue Type: Sub-task
  Components: applicationmaster, mrv2
Reporter: Vinod Kumar Vavilapalli
Assignee: Wangda Tan
Priority: Blocker

 Found this while reviewing YARN-2074. The problem is that after YARN-2074, we 
 don't count AM preemption towards AM failures on RM side, but MapReduce AM 
 itself checks the attempt id against the max-attempt count to determine if 
 this is the last attempt.
 {code}
 public void computeIsLastAMRetry() {
   isLastAMRetry = appAttemptID.getAttemptId() = maxAppAttempts;
 }
 {code}
 This causes issues w.r.t deletion of staging directory etc..



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Work started] (MAPREDUCE-5956) MapReduce AM should not use maxAttempts to determine if this is the last retry

2014-07-02 Thread Wangda Tan (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-5956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on MAPREDUCE-5956 started by Wangda Tan.

 MapReduce AM should not use maxAttempts to determine if this is the last retry
 --

 Key: MAPREDUCE-5956
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5956
 Project: Hadoop Map/Reduce
  Issue Type: Sub-task
  Components: applicationmaster, mrv2
Reporter: Vinod Kumar Vavilapalli
Assignee: Wangda Tan
Priority: Blocker

 Found this while reviewing YARN-2074. The problem is that after YARN-2074, we 
 don't count AM preemption towards AM failures on RM side, but MapReduce AM 
 itself checks the attempt id against the max-attempt count to determine if 
 this is the last attempt.
 {code}
 public void computeIsLastAMRetry() {
   isLastAMRetry = appAttemptID.getAttemptId() = maxAppAttempts;
 }
 {code}
 This causes issues w.r.t deletion of staging directory etc..



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAPREDUCE-5956) MapReduce AM should not use maxAttempts to determine if this is the last retry

2014-07-02 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14050924#comment-14050924
 ] 

Wangda Tan commented on MAPREDUCE-5956:
---

Assigned it to me, already started working on this ..

 MapReduce AM should not use maxAttempts to determine if this is the last retry
 --

 Key: MAPREDUCE-5956
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5956
 Project: Hadoop Map/Reduce
  Issue Type: Sub-task
  Components: applicationmaster, mrv2
Reporter: Vinod Kumar Vavilapalli
Assignee: Wangda Tan
Priority: Blocker

 Found this while reviewing YARN-2074. The problem is that after YARN-2074, we 
 don't count AM preemption towards AM failures on RM side, but MapReduce AM 
 itself checks the attempt id against the max-attempt count to determine if 
 this is the last attempt.
 {code}
 public void computeIsLastAMRetry() {
   isLastAMRetry = appAttemptID.getAttemptId() = maxAppAttempts;
 }
 {code}
 This causes issues w.r.t deletion of staging directory etc..



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAPREDUCE-5956) MapReduce AM should not use maxAttempts to determine if this is the last retry

2014-07-09 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14056026#comment-14056026
 ] 

Wangda Tan commented on MAPREDUCE-5956:
---

Thanks thoughts provided by [~vinodkv], had a offline discussion with Vinod, 
post summary here,

Basically there're 3 cases need cleanup.
a. Job completed (failed or succeeded, no matter it's lastRetry or not)
b. Failure happened, and captured by MRAppMasterShutDownHook
c. Failure happened, and doesn't capture by MRAppMasterShutDownHook

And for thoughts provided by Vinod,
{code}
1. YARN informs AM that it is the last retry as part of AM start-up or the 
register API
2. YARN informs the AM that this is the last retry as part of AM unregister
3. YARN has a way to run a separate cleanup container after it knows for sure 
that the application finished exhausting all its attempts
{code}

(1) can solve a. and part of b.
Why only part of b? Because it is possible MRAppMasterShutdownHook triggered 
but other possible failure happened causing cleanup not completed.
(2) can only solve a.
Reason is, if we don't have isLastRetry (or mayBeTheLastAttempt) properly set 
at register, we don't know if should do cleanup or not.
(3) can solve a. b. c.
Refer to YARN-2261 for more details.

I tried to work on (1) first, however, I found moving isLastRetry setup from 
MRAppMaster.init to RMCommunicator cause a lots code changes and lots of unit 
test failures, etc. 
So my suggestion is quickly finish (2), make job completed case correct, which 
is the most usual case. And push (3) forward.

I'll upload a patch in method (2) for review soon.

Thanks,
Wangda

 MapReduce AM should not use maxAttempts to determine if this is the last retry
 --

 Key: MAPREDUCE-5956
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5956
 Project: Hadoop Map/Reduce
  Issue Type: Sub-task
  Components: applicationmaster, mrv2
Reporter: Vinod Kumar Vavilapalli
Assignee: Wangda Tan
Priority: Blocker

 Found this while reviewing YARN-2074. The problem is that after YARN-2074, we 
 don't count AM preemption towards AM failures on RM side, but MapReduce AM 
 itself checks the attempt id against the max-attempt count to determine if 
 this is the last attempt.
 {code}
 public void computeIsLastAMRetry() {
   isLastAMRetry = appAttemptID.getAttemptId() = maxAppAttempts;
 }
 {code}
 This causes issues w.r.t deletion of staging directory etc..



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (MAPREDUCE-5956) MapReduce AM should not use maxAttempts to determine if this is the last retry

2014-07-09 Thread Wangda Tan (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-5956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated MAPREDUCE-5956:
--

Attachment: MR-5956.patch
MR-5956.patch

Attached a fix for this, please kindly review.
[~vinodkv], [~zjshen].

 MapReduce AM should not use maxAttempts to determine if this is the last retry
 --

 Key: MAPREDUCE-5956
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5956
 Project: Hadoop Map/Reduce
  Issue Type: Sub-task
  Components: applicationmaster, mrv2
Reporter: Vinod Kumar Vavilapalli
Assignee: Wangda Tan
Priority: Blocker
 Attachments: MR-5956.patch, MR-5956.patch


 Found this while reviewing YARN-2074. The problem is that after YARN-2074, we 
 don't count AM preemption towards AM failures on RM side, but MapReduce AM 
 itself checks the attempt id against the max-attempt count to determine if 
 this is the last attempt.
 {code}
 public void computeIsLastAMRetry() {
   isLastAMRetry = appAttemptID.getAttemptId() = maxAppAttempts;
 }
 {code}
 This causes issues w.r.t deletion of staging directory etc..



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAPREDUCE-5956) MapReduce AM should not use maxAttempts to determine if this is the last retry

2014-07-10 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14057260#comment-14057260
 ] 

Wangda Tan commented on MAPREDUCE-5956:
---

Hi [~mayank_bansal],
It should say, we will retry AM indefinitely until it *completed* and calls 
unregister. AM complete includes various states like job 
failed/killed/internal-error etc. More specifically, if JobFinishEventHandler 
in MRAppMaster received JobFinishEvent. It will call unregister/cleanup.
Does this answer your question?

Thanks,
Wangda

 MapReduce AM should not use maxAttempts to determine if this is the last retry
 --

 Key: MAPREDUCE-5956
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5956
 Project: Hadoop Map/Reduce
  Issue Type: Sub-task
  Components: applicationmaster, mrv2
Reporter: Vinod Kumar Vavilapalli
Assignee: Wangda Tan
Priority: Blocker
 Attachments: MR-5956.patch, MR-5956.patch


 Found this while reviewing YARN-2074. The problem is that after YARN-2074, we 
 don't count AM preemption towards AM failures on RM side, but MapReduce AM 
 itself checks the attempt id against the max-attempt count to determine if 
 this is the last attempt.
 {code}
 public void computeIsLastAMRetry() {
   isLastAMRetry = appAttemptID.getAttemptId() = maxAppAttempts;
 }
 {code}
 This causes issues w.r.t deletion of staging directory etc..



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (MAPREDUCE-5956) MapReduce AM should not use maxAttempts to determine if this is the last retry

2014-07-10 Thread Wangda Tan (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-5956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated MAPREDUCE-5956:
--

Affects Version/s: 2.4.0
   Status: Patch Available  (was: In Progress)

 MapReduce AM should not use maxAttempts to determine if this is the last retry
 --

 Key: MAPREDUCE-5956
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5956
 Project: Hadoop Map/Reduce
  Issue Type: Sub-task
  Components: applicationmaster, mrv2
Affects Versions: 2.4.0
Reporter: Vinod Kumar Vavilapalli
Assignee: Wangda Tan
Priority: Blocker
 Attachments: MR-5956.patch, MR-5956.patch


 Found this while reviewing YARN-2074. The problem is that after YARN-2074, we 
 don't count AM preemption towards AM failures on RM side, but MapReduce AM 
 itself checks the attempt id against the max-attempt count to determine if 
 this is the last attempt.
 {code}
 public void computeIsLastAMRetry() {
   isLastAMRetry = appAttemptID.getAttemptId() = maxAppAttempts;
 }
 {code}
 This causes issues w.r.t deletion of staging directory etc..



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (MAPREDUCE-5956) MapReduce AM should not use maxAttempts to determine if this is the last retry

2014-07-10 Thread Wangda Tan (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-5956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated MAPREDUCE-5956:
--

Attachment: (was: MR-5956.patch)

 MapReduce AM should not use maxAttempts to determine if this is the last retry
 --

 Key: MAPREDUCE-5956
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5956
 Project: Hadoop Map/Reduce
  Issue Type: Sub-task
  Components: applicationmaster, mrv2
Affects Versions: 2.4.0
Reporter: Vinod Kumar Vavilapalli
Assignee: Wangda Tan
Priority: Blocker
 Attachments: MR-5956.patch


 Found this while reviewing YARN-2074. The problem is that after YARN-2074, we 
 don't count AM preemption towards AM failures on RM side, but MapReduce AM 
 itself checks the attempt id against the max-attempt count to determine if 
 this is the last attempt.
 {code}
 public void computeIsLastAMRetry() {
   isLastAMRetry = appAttemptID.getAttemptId() = maxAppAttempts;
 }
 {code}
 This causes issues w.r.t deletion of staging directory etc..



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAPREDUCE-5956) MapReduce AM should not use maxAttempts to determine if this is the last retry

2014-07-10 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14058110#comment-14058110
 ] 

Wangda Tan commented on MAPREDUCE-5956:
---

Hi [~mayank_bansal] and [~hitesh],
Currently, YARN NM will first send SIGTERM then sleep for a while (default is 
250ms, set by yarn.nodemanager.sleep-delay-before-sigkill.ms) send SIGKILL if 
process still alive when trying to kill a container.
MR shutdown hook can catch SIGTERM. So in Hitesh's status, if AM OOM at last 
retry and killed by NM ContainersMonitor, AM will not do cleanup. If AM is not 
last attempt, it will be restarted by RM.
Thanks,
Wangda


 MapReduce AM should not use maxAttempts to determine if this is the last retry
 --

 Key: MAPREDUCE-5956
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5956
 Project: Hadoop Map/Reduce
  Issue Type: Sub-task
  Components: applicationmaster, mrv2
Affects Versions: 2.4.0
Reporter: Vinod Kumar Vavilapalli
Assignee: Wangda Tan
Priority: Blocker
 Attachments: MR-5956.patch


 Found this while reviewing YARN-2074. The problem is that after YARN-2074, we 
 don't count AM preemption towards AM failures on RM side, but MapReduce AM 
 itself checks the attempt id against the max-attempt count to determine if 
 this is the last attempt.
 {code}
 public void computeIsLastAMRetry() {
   isLastAMRetry = appAttemptID.getAttemptId() = maxAppAttempts;
 }
 {code}
 This causes issues w.r.t deletion of staging directory etc..



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (MAPREDUCE-5956) MapReduce AM should not use maxAttempts to determine if this is the last retry

2014-07-10 Thread Wangda Tan (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-5956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated MAPREDUCE-5956:
--

Attachment: MR-5956.patch

Zhijie, thanks for review!
Uploaded a patch addressed your comment and removed max-attempt field in 
MRAppMaster since we don't use it now.

 MapReduce AM should not use maxAttempts to determine if this is the last retry
 --

 Key: MAPREDUCE-5956
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5956
 Project: Hadoop Map/Reduce
  Issue Type: Sub-task
  Components: applicationmaster, mrv2
Affects Versions: 2.4.0
Reporter: Vinod Kumar Vavilapalli
Assignee: Wangda Tan
Priority: Blocker
 Attachments: MR-5956.patch, MR-5956.patch


 Found this while reviewing YARN-2074. The problem is that after YARN-2074, we 
 don't count AM preemption towards AM failures on RM side, but MapReduce AM 
 itself checks the attempt id against the max-attempt count to determine if 
 this is the last attempt.
 {code}
 public void computeIsLastAMRetry() {
   isLastAMRetry = appAttemptID.getAttemptId() = maxAppAttempts;
 }
 {code}
 This causes issues w.r.t deletion of staging directory etc..



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAPREDUCE-5956) MapReduce AM should not use maxAttempts to determine if this is the last retry

2014-07-11 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14058565#comment-14058565
 ] 

Wangda Tan commented on MAPREDUCE-5956:
---

Thanks [~zjshen] for review and commit!

 MapReduce AM should not use maxAttempts to determine if this is the last retry
 --

 Key: MAPREDUCE-5956
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5956
 Project: Hadoop Map/Reduce
  Issue Type: Sub-task
  Components: applicationmaster, mrv2
Affects Versions: 2.4.0
Reporter: Vinod Kumar Vavilapalli
Assignee: Wangda Tan
Priority: Blocker
 Fix For: 2.6.0

 Attachments: MR-5956.patch, MR-5956.patch


 Found this while reviewing YARN-2074. The problem is that after YARN-2074, we 
 don't count AM preemption towards AM failures on RM side, but MapReduce AM 
 itself checks the attempt id against the max-attempt count to determine if 
 this is the last attempt.
 {code}
 public void computeIsLastAMRetry() {
   isLastAMRetry = appAttemptID.getAttemptId() = maxAppAttempts;
 }
 {code}
 This causes issues w.r.t deletion of staging directory etc..



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (MAPREDUCE-6002) MR task should prevent report error to AM when process is shutting down

2014-07-23 Thread Wangda Tan (JIRA)
Wangda Tan created MAPREDUCE-6002:
-

 Summary: MR task should prevent report error to AM when process is 
shutting down
 Key: MAPREDUCE-6002
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6002
 Project: Hadoop Map/Reduce
  Issue Type: Bug
  Components: task
Affects Versions: 2.5.0
Reporter: Wangda Tan
Assignee: Wangda Tan


With MAPREDUCE-5900, preempted MR task should not be treat as failed. 
But it is still possible a MR task fail and report to AM when preemption take 
effect and the AM hasn't received completed container from RM yet. It will 
cause the task attempt marked failed instead of preempted.

An example is FileSystem has shutdown hook, it will close all FileSystem 
instance, if at the same time, the FileSystem is in-use (like reading split 
details from HDFS), MR task will fail and report the fatal error to MR AM. An 
exception will be raised:
{code}
2014-07-22 01:46:19,613 FATAL [IPC Server handler 10 on 56903] 
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Task: 
attempt_1405985051088_0018_m_25_0 - exited : java.io.IOException: 
Filesystem closed
at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:707)
at 
org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:776)
at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:837)
at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:645)
at java.io.DataInputStream.readByte(DataInputStream.java:265)
at org.apache.hadoop.io.WritableUtils.readVLong(WritableUtils.java:308)
at 
org.apache.hadoop.io.WritableUtils.readVIntInRange(WritableUtils.java:348)
at org.apache.hadoop.io.Text.readString(Text.java:464)
at org.apache.hadoop.io.Text.readString(Text.java:457)
at org.apache.hadoop.mapred.MapTask.getSplitDetails(MapTask.java:357)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:731)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:167)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1594)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:162)
{code}

We should prevent this, because it is possible other exceptions happen when 
shutting down, we shouldn't report any of such exceptions to AM.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (MAPREDUCE-6002) MR task should prevent report error to AM when process is shutting down

2014-07-24 Thread Wangda Tan (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-6002?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated MAPREDUCE-6002:
--

Status: Patch Available  (was: Open)

 MR task should prevent report error to AM when process is shutting down
 ---

 Key: MAPREDUCE-6002
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6002
 Project: Hadoop Map/Reduce
  Issue Type: Bug
  Components: task
Affects Versions: 2.5.0
Reporter: Wangda Tan
Assignee: Wangda Tan
 Attachments: MR-6002.patch


 With MAPREDUCE-5900, preempted MR task should not be treat as failed. 
 But it is still possible a MR task fail and report to AM when preemption take 
 effect and the AM hasn't received completed container from RM yet. It will 
 cause the task attempt marked failed instead of preempted.
 An example is FileSystem has shutdown hook, it will close all FileSystem 
 instance, if at the same time, the FileSystem is in-use (like reading split 
 details from HDFS), MR task will fail and report the fatal error to MR AM. An 
 exception will be raised:
 {code}
 2014-07-22 01:46:19,613 FATAL [IPC Server handler 10 on 56903] 
 org.apache.hadoop.mapred.TaskAttemptListenerImpl: Task: 
 attempt_1405985051088_0018_m_25_0 - exited : java.io.IOException: 
 Filesystem closed
   at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:707)
   at 
 org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:776)
   at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:837)
   at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:645)
   at java.io.DataInputStream.readByte(DataInputStream.java:265)
   at org.apache.hadoop.io.WritableUtils.readVLong(WritableUtils.java:308)
   at 
 org.apache.hadoop.io.WritableUtils.readVIntInRange(WritableUtils.java:348)
   at org.apache.hadoop.io.Text.readString(Text.java:464)
   at org.apache.hadoop.io.Text.readString(Text.java:457)
   at org.apache.hadoop.mapred.MapTask.getSplitDetails(MapTask.java:357)
   at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:731)
   at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340)
   at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:167)
   at java.security.AccessController.doPrivileged(Native Method)
   at javax.security.auth.Subject.doAs(Subject.java:415)
   at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1594)
   at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:162)
 {code}
 We should prevent this, because it is possible other exceptions happen when 
 shutting down, we shouldn't report any of such exceptions to AM.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (MAPREDUCE-6002) MR task should prevent report error to AM when process is shutting down

2014-07-24 Thread Wangda Tan (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-6002?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated MAPREDUCE-6002:
--

Attachment: MR-6002.patch

Attached a patch for review.

 MR task should prevent report error to AM when process is shutting down
 ---

 Key: MAPREDUCE-6002
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6002
 Project: Hadoop Map/Reduce
  Issue Type: Bug
  Components: task
Affects Versions: 2.5.0
Reporter: Wangda Tan
Assignee: Wangda Tan
 Attachments: MR-6002.patch


 With MAPREDUCE-5900, preempted MR task should not be treat as failed. 
 But it is still possible a MR task fail and report to AM when preemption take 
 effect and the AM hasn't received completed container from RM yet. It will 
 cause the task attempt marked failed instead of preempted.
 An example is FileSystem has shutdown hook, it will close all FileSystem 
 instance, if at the same time, the FileSystem is in-use (like reading split 
 details from HDFS), MR task will fail and report the fatal error to MR AM. An 
 exception will be raised:
 {code}
 2014-07-22 01:46:19,613 FATAL [IPC Server handler 10 on 56903] 
 org.apache.hadoop.mapred.TaskAttemptListenerImpl: Task: 
 attempt_1405985051088_0018_m_25_0 - exited : java.io.IOException: 
 Filesystem closed
   at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:707)
   at 
 org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:776)
   at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:837)
   at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:645)
   at java.io.DataInputStream.readByte(DataInputStream.java:265)
   at org.apache.hadoop.io.WritableUtils.readVLong(WritableUtils.java:308)
   at 
 org.apache.hadoop.io.WritableUtils.readVIntInRange(WritableUtils.java:348)
   at org.apache.hadoop.io.Text.readString(Text.java:464)
   at org.apache.hadoop.io.Text.readString(Text.java:457)
   at org.apache.hadoop.mapred.MapTask.getSplitDetails(MapTask.java:357)
   at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:731)
   at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340)
   at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:167)
   at java.security.AccessController.doPrivileged(Native Method)
   at javax.security.auth.Subject.doAs(Subject.java:415)
   at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1594)
   at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:162)
 {code}
 We should prevent this, because it is possible other exceptions happen when 
 shutting down, we shouldn't report any of such exceptions to AM.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAPREDUCE-6002) MR task should prevent report error to AM when process is shutting down

2014-07-25 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6002?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14074228#comment-14074228
 ] 

Wangda Tan commented on MAPREDUCE-6002:
---

Hi [~zjshen],
Thanks for review
I totally agree with you, I think we should ignore the extremely race condition 
you mentioned too, since we don't deprive its right to retry :)

Wangda

 MR task should prevent report error to AM when process is shutting down
 ---

 Key: MAPREDUCE-6002
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6002
 Project: Hadoop Map/Reduce
  Issue Type: Bug
  Components: task
Affects Versions: 2.5.0
Reporter: Wangda Tan
Assignee: Wangda Tan
 Attachments: MR-6002.patch


 With MAPREDUCE-5900, preempted MR task should not be treat as failed. 
 But it is still possible a MR task fail and report to AM when preemption take 
 effect and the AM hasn't received completed container from RM yet. It will 
 cause the task attempt marked failed instead of preempted.
 An example is FileSystem has shutdown hook, it will close all FileSystem 
 instance, if at the same time, the FileSystem is in-use (like reading split 
 details from HDFS), MR task will fail and report the fatal error to MR AM. An 
 exception will be raised:
 {code}
 2014-07-22 01:46:19,613 FATAL [IPC Server handler 10 on 56903] 
 org.apache.hadoop.mapred.TaskAttemptListenerImpl: Task: 
 attempt_1405985051088_0018_m_25_0 - exited : java.io.IOException: 
 Filesystem closed
   at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:707)
   at 
 org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:776)
   at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:837)
   at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:645)
   at java.io.DataInputStream.readByte(DataInputStream.java:265)
   at org.apache.hadoop.io.WritableUtils.readVLong(WritableUtils.java:308)
   at 
 org.apache.hadoop.io.WritableUtils.readVIntInRange(WritableUtils.java:348)
   at org.apache.hadoop.io.Text.readString(Text.java:464)
   at org.apache.hadoop.io.Text.readString(Text.java:457)
   at org.apache.hadoop.mapred.MapTask.getSplitDetails(MapTask.java:357)
   at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:731)
   at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340)
   at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:167)
   at java.security.AccessController.doPrivileged(Native Method)
   at javax.security.auth.Subject.doAs(Subject.java:415)
   at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1594)
   at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:162)
 {code}
 We should prevent this, because it is possible other exceptions happen when 
 shutting down, we shouldn't report any of such exceptions to AM.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAPREDUCE-6002) MR task should prevent report error to AM when process is shutting down

2014-07-25 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6002?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14075183#comment-14075183
 ] 

Wangda Tan commented on MAPREDUCE-6002:
---

Jason, thanks for your comments,
bq. I suspect this is a rare situation when it occurs, probably correctable in 
the user's code in many of those cases, and the attempt logs should be able to 
sort things out if it does occur.
I agree, in normal failure, no matter what kind of exception throw, YarnChild 
should be able to catch them and report to AM. In some rare cases, if some 
error cause JVM starting shutdown before reporting to AM, it cannot 
successfully report to AM in a big chance even if we don't change this.

To Zhijie,
bq. Isn't it possible that PREEMPTED from RM still comes before AM knows the 
task attempt FAILED?
I think what Jason mentioned is another case: there's no preemption happens, 
it's a failure happens in TA side, and JVM shutdown happens before TA can 
report such error to AM.

Thanks,
Wangda

 MR task should prevent report error to AM when process is shutting down
 ---

 Key: MAPREDUCE-6002
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6002
 Project: Hadoop Map/Reduce
  Issue Type: Bug
  Components: task
Affects Versions: 2.5.0
Reporter: Wangda Tan
Assignee: Wangda Tan
 Attachments: MR-6002.patch


 With MAPREDUCE-5900, preempted MR task should not be treat as failed. 
 But it is still possible a MR task fail and report to AM when preemption take 
 effect and the AM hasn't received completed container from RM yet. It will 
 cause the task attempt marked failed instead of preempted.
 An example is FileSystem has shutdown hook, it will close all FileSystem 
 instance, if at the same time, the FileSystem is in-use (like reading split 
 details from HDFS), MR task will fail and report the fatal error to MR AM. An 
 exception will be raised:
 {code}
 2014-07-22 01:46:19,613 FATAL [IPC Server handler 10 on 56903] 
 org.apache.hadoop.mapred.TaskAttemptListenerImpl: Task: 
 attempt_1405985051088_0018_m_25_0 - exited : java.io.IOException: 
 Filesystem closed
   at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:707)
   at 
 org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:776)
   at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:837)
   at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:645)
   at java.io.DataInputStream.readByte(DataInputStream.java:265)
   at org.apache.hadoop.io.WritableUtils.readVLong(WritableUtils.java:308)
   at 
 org.apache.hadoop.io.WritableUtils.readVIntInRange(WritableUtils.java:348)
   at org.apache.hadoop.io.Text.readString(Text.java:464)
   at org.apache.hadoop.io.Text.readString(Text.java:457)
   at org.apache.hadoop.mapred.MapTask.getSplitDetails(MapTask.java:357)
   at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:731)
   at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340)
   at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:167)
   at java.security.AccessController.doPrivileged(Native Method)
   at javax.security.auth.Subject.doAs(Subject.java:415)
   at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1594)
   at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:162)
 {code}
 We should prevent this, because it is possible other exceptions happen when 
 shutting down, we shouldn't report any of such exceptions to AM.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAPREDUCE-5956) MapReduce AM should not use maxAttempts to determine if this is the last retry

2014-08-21 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14106425#comment-14106425
 ] 

Wangda Tan commented on MAPREDUCE-5956:
---

Hi [~ashwinshankar77],
You can check my comment 
https://issues.apache.org/jira/browse/MAPREDUCE-5956?focusedCommentId=14056026page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14056026.
 And Zhijie's comment: 
https://issues.apache.org/jira/browse/MAPREDUCE-5956?focusedCommentId=14057067page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14057067.

bq. In such cases, would the job go missing on the history server ? Any other 
impact ?
I think this will not miss on the JHS, no other impact in my mind now.
Please let me know if you have any other questions,

Thanks,
Wangda

 MapReduce AM should not use maxAttempts to determine if this is the last retry
 --

 Key: MAPREDUCE-5956
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5956
 Project: Hadoop Map/Reduce
  Issue Type: Sub-task
  Components: applicationmaster, mrv2
Affects Versions: 2.4.0
Reporter: Vinod Kumar Vavilapalli
Assignee: Wangda Tan
Priority: Blocker
 Fix For: 2.6.0

 Attachments: MR-5956.patch, MR-5956.patch


 Found this while reviewing YARN-2074. The problem is that after YARN-2074, we 
 don't count AM preemption towards AM failures on RM side, but MapReduce AM 
 itself checks the attempt id against the max-attempt count to determine if 
 this is the last attempt.
 {code}
 public void computeIsLastAMRetry() {
   isLastAMRetry = appAttemptID.getAttemptId() = maxAppAttempts;
 }
 {code}
 This causes issues w.r.t deletion of staging directory etc..



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (MAPREDUCE-5831) Old MR client is not compatible with new MR application

2014-09-16 Thread Wangda Tan (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-5831?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated MAPREDUCE-5831:
--
Assignee: Junping Du  (was: Tan, Wangda)

 Old MR client is not compatible with new MR application
 ---

 Key: MAPREDUCE-5831
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5831
 Project: Hadoop Map/Reduce
  Issue Type: Bug
  Components: client, mr-am
Affects Versions: 2.2.0, 2.3.0
Reporter: Zhijie Shen
Assignee: Junping Du
Priority: Critical

 Recently, we saw the following scenario:
 1. The user setup a cluster of hadoop 2.3., which contains YARN 2.3 and MR  
 2.3.
 2. The user client on a machine that MR 2.2 is installed and in the classpath.
 Then, when the user submitted a simple wordcount job, he saw the following 
 message:
 {code}
 16:00:41,027  INFO main mapreduce.Job:1345 -  map 100% reduce 100%
 16:00:41,036  INFO main mapreduce.Job:1356 - Job job_1396468045458_0006 
 completed successfully
 16:02:20,535  WARN main mapreduce.JobRunner:212 - Cannot start job 
 [wordcountJob]
 java.lang.IllegalArgumentException: No enum constant 
 org.apache.hadoop.mapreduce.JobCounter.MB_MILLIS_REDUCES
   at java.lang.Enum.valueOf(Enum.java:236)
   at 
 org.apache.hadoop.mapreduce.counters.FrameworkCounterGroup.valueOf(FrameworkCounterGroup.java:148)
   at 
 org.apache.hadoop.mapreduce.counters.FrameworkCounterGroup.findCounter(FrameworkCounterGroup.java:182)
   at 
 org.apache.hadoop.mapreduce.counters.AbstractCounters.findCounter(AbstractCounters.java:154)
   at 
 org.apache.hadoop.mapreduce.TypeConverter.fromYarn(TypeConverter.java:240)
   at 
 org.apache.hadoop.mapred.ClientServiceDelegate.getJobCounters(ClientServiceDelegate.java:370)
   at 
 org.apache.hadoop.mapred.YARNRunner.getJobCounters(YARNRunner.java:511)
   at org.apache.hadoop.mapreduce.Job$7.run(Job.java:756)
   at org.apache.hadoop.mapreduce.Job$7.run(Job.java:753)
   at java.security.AccessController.doPrivileged(Native Method)
   at javax.security.auth.Subject.doAs(Subject.java:415)
   at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1491)
   at org.apache.hadoop.mapreduce.Job.getCounters(Job.java:753)
   at org.apache.hadoop.mapreduce.Job.monitorAndPrintJob(Job.java:1361)
   at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1289)
 . . .
 {code}
 The problem is that the wordcount job was running on one or more than one 
 nodes of the YARN cluster, where MR 2.3 libs were installed, and 
 JobCounter.MB_MILLIS_REDUCES is available in the counters. On the other side, 
 due to the classpath setting, the client was likely to run with MR 2.2 libs. 
 After the client retrieved the counters from MR AM, it tried to construct the 
 Counter object with the received counter name. Unfortunately, the enum didn't 
 exist in the client's classpath. Therefore, No enum constant exception is 
 thrown here.
 JobCounter.MB_MILLIS_REDUCES is brought to MR2 via MAPREDUCE-5464 since 
 Hadoop 2.3.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MAPREDUCE-6098) org.apache.hadoop.mapreduce.lib.input.TestMRCJCFileInputFormat intermittently failed in trunk

2014-09-19 Thread Wangda Tan (JIRA)
Wangda Tan created MAPREDUCE-6098:
-

 Summary: 
org.apache.hadoop.mapreduce.lib.input.TestMRCJCFileInputFormat intermittently 
failed in trunk
 Key: MAPREDUCE-6098
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6098
 Project: Hadoop Map/Reduce
  Issue Type: Bug
  Components: test
Reporter: Wangda Tan
 Fix For: trunk


See: 
https://issues.apache.org/jira/browse/YARN-611?focusedCommentId=14129761page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14129761
 for details



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAPREDUCE-6130) Mapreduce tests fail with IllegalArgumentException in trunk

2014-10-17 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14175486#comment-14175486
 ] 

Wangda Tan commented on MAPREDUCE-6130:
---

Assigned to me, looking at it.

 Mapreduce tests fail with IllegalArgumentException in trunk
 ---

 Key: MAPREDUCE-6130
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6130
 Project: Hadoop Map/Reduce
  Issue Type: Bug
Reporter: Ted Yu
Assignee: Wangda Tan

 From https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1929/console :
 {code}
 testComplexNameWithRegex(org.apache.hadoop.mapred.TestJobName)  Time elapsed: 
 5.153 sec   ERROR!
 java.lang.IllegalArgumentException: Illegal capacity of -1.0 for label=x in 
 queue=root.default
   at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacitySchedulerConfiguration.getNodeLabelCapacities(CapacitySchedulerConfiguration.java:473)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.AbstractCSQueue.init(AbstractCSQueue.java:119)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.init(LeafQueue.java:120)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.parseQueue(CapacityScheduler.java:567)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.parseQueue(CapacityScheduler.java:587)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.initializeQueues(CapacityScheduler.java:462)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.initScheduler(CapacityScheduler.java:294)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.serviceInit(CapacityScheduler.java:323)
   at 
 org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
   at 
 org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceInit(ResourceManager.java:537)
   at 
 org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.createAndInitActiveServices(ResourceManager.java:976)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceInit(ResourceManager.java:239)
   at 
 org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
   at 
 org.apache.hadoop.yarn.server.MiniYARNCluster.initResourceManager(MiniYARNCluster.java:291)
   at 
 org.apache.hadoop.yarn.server.MiniYARNCluster.access$400(MiniYARNCluster.java:95)
   at 
 org.apache.hadoop.yarn.server.MiniYARNCluster$ResourceManagerWrapper.serviceInit(MiniYARNCluster.java:442)
   at 
 org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
   at 
 org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107)
   at 
 org.apache.hadoop.yarn.server.MiniYARNCluster.serviceInit(MiniYARNCluster.java:267)
   at 
 org.apache.hadoop.mapreduce.v2.MiniMRYarnCluster.serviceInit(MiniMRYarnCluster.java:183)
   at 
 org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
 {code}
 A lot of tests failed due to 'Illegal capacity' exception



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (MAPREDUCE-6130) Mapreduce tests fail with IllegalArgumentException in trunk

2014-10-17 Thread Wangda Tan (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-6130?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan reassigned MAPREDUCE-6130:
-

Assignee: Wangda Tan

 Mapreduce tests fail with IllegalArgumentException in trunk
 ---

 Key: MAPREDUCE-6130
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6130
 Project: Hadoop Map/Reduce
  Issue Type: Bug
Reporter: Ted Yu
Assignee: Wangda Tan

 From https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1929/console :
 {code}
 testComplexNameWithRegex(org.apache.hadoop.mapred.TestJobName)  Time elapsed: 
 5.153 sec   ERROR!
 java.lang.IllegalArgumentException: Illegal capacity of -1.0 for label=x in 
 queue=root.default
   at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacitySchedulerConfiguration.getNodeLabelCapacities(CapacitySchedulerConfiguration.java:473)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.AbstractCSQueue.init(AbstractCSQueue.java:119)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.init(LeafQueue.java:120)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.parseQueue(CapacityScheduler.java:567)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.parseQueue(CapacityScheduler.java:587)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.initializeQueues(CapacityScheduler.java:462)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.initScheduler(CapacityScheduler.java:294)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.serviceInit(CapacityScheduler.java:323)
   at 
 org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
   at 
 org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceInit(ResourceManager.java:537)
   at 
 org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.createAndInitActiveServices(ResourceManager.java:976)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceInit(ResourceManager.java:239)
   at 
 org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
   at 
 org.apache.hadoop.yarn.server.MiniYARNCluster.initResourceManager(MiniYARNCluster.java:291)
   at 
 org.apache.hadoop.yarn.server.MiniYARNCluster.access$400(MiniYARNCluster.java:95)
   at 
 org.apache.hadoop.yarn.server.MiniYARNCluster$ResourceManagerWrapper.serviceInit(MiniYARNCluster.java:442)
   at 
 org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
   at 
 org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107)
   at 
 org.apache.hadoop.yarn.server.MiniYARNCluster.serviceInit(MiniYARNCluster.java:267)
   at 
 org.apache.hadoop.mapreduce.v2.MiniMRYarnCluster.serviceInit(MiniMRYarnCluster.java:183)
   at 
 org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
 {code}
 A lot of tests failed due to 'Illegal capacity' exception



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAPREDUCE-6130) Mapreduce tests fail with IllegalArgumentException in trunk

2014-10-17 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14175526#comment-14175526
 ] 

Wangda Tan commented on MAPREDUCE-6130:
---

Opened YARN-2705 for resolving default NodeLabelManager caused issues.

 Mapreduce tests fail with IllegalArgumentException in trunk
 ---

 Key: MAPREDUCE-6130
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6130
 Project: Hadoop Map/Reduce
  Issue Type: Bug
Reporter: Ted Yu
Assignee: Wangda Tan

 From https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1929/console :
 {code}
 testComplexNameWithRegex(org.apache.hadoop.mapred.TestJobName)  Time elapsed: 
 5.153 sec   ERROR!
 java.lang.IllegalArgumentException: Illegal capacity of -1.0 for label=x in 
 queue=root.default
   at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacitySchedulerConfiguration.getNodeLabelCapacities(CapacitySchedulerConfiguration.java:473)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.AbstractCSQueue.init(AbstractCSQueue.java:119)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.init(LeafQueue.java:120)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.parseQueue(CapacityScheduler.java:567)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.parseQueue(CapacityScheduler.java:587)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.initializeQueues(CapacityScheduler.java:462)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.initScheduler(CapacityScheduler.java:294)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.serviceInit(CapacityScheduler.java:323)
   at 
 org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
   at 
 org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceInit(ResourceManager.java:537)
   at 
 org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.createAndInitActiveServices(ResourceManager.java:976)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceInit(ResourceManager.java:239)
   at 
 org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
   at 
 org.apache.hadoop.yarn.server.MiniYARNCluster.initResourceManager(MiniYARNCluster.java:291)
   at 
 org.apache.hadoop.yarn.server.MiniYARNCluster.access$400(MiniYARNCluster.java:95)
   at 
 org.apache.hadoop.yarn.server.MiniYARNCluster$ResourceManagerWrapper.serviceInit(MiniYARNCluster.java:442)
   at 
 org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
   at 
 org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107)
   at 
 org.apache.hadoop.yarn.server.MiniYARNCluster.serviceInit(MiniYARNCluster.java:267)
   at 
 org.apache.hadoop.mapreduce.v2.MiniMRYarnCluster.serviceInit(MiniMRYarnCluster.java:183)
   at 
 org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
 {code}
 A lot of tests failed due to 'Illegal capacity' exception



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MAPREDUCE-6212) Hadoop 2.6.0: Basic error “starting MRAppMaster” after installing

2015-01-07 Thread Wangda Tan (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-6212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated MAPREDUCE-6212:
--
Priority: Major  (was: Critical)

 Hadoop 2.6.0: Basic error “starting MRAppMaster” after installing
 -

 Key: MAPREDUCE-6212
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6212
 Project: Hadoop Map/Reduce
  Issue Type: Bug
  Components: security
Affects Versions: 2.6.0
 Environment: Ubuntu 64bit
Reporter: Dinh Hoang Mai
 Fix For: 2.6.0


 I have just started to work with Hadoop 2.
 After installing with basic configs, I always failed to run any examples. Has 
 anyone seen this problem and please help me?
 This is the log
 2015-01-08 01:52:01,599 INFO [main] 
 org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Created MRAppMaster for 
 application appattempt_1420648881673_0004_01
 2015-01-08 01:52:01,764 FATAL [main] 
 org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Error starting MRAppMaster
 java.lang.RuntimeException: java.lang.reflect.InvocationTargetException
   at 
 org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:131)
   at org.apache.hadoop.security.Groups.init(Groups.java:70)
   at org.apache.hadoop.security.Groups.init(Groups.java:66)
   at 
 org.apache.hadoop.security.Groups.getUserToGroupsMappingService(Groups.java:280)
   at 
 org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:271)
   at 
 org.apache.hadoop.security.UserGroupInformation.setConfiguration(UserGroupInformation.java:299)
   at 
 org.apache.hadoop.mapreduce.v2.app.MRAppMaster.initAndStartAppMaster(MRAppMaster.java:1473)
   at 
 org.apache.hadoop.mapreduce.v2.app.MRAppMaster.main(MRAppMaster.java:1429)
 Caused by: java.lang.reflect.InvocationTargetException
   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
   at 
 sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
   at 
 sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
   at java.lang.reflect.Constructor.newInstance(Constructor.java:408)
   at 
 org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:129)
   ... 7 more
 Caused by: java.lang.UnsatisfiedLinkError: 
 org.apache.hadoop.security.JniBasedUnixGroupsMapping.anchorNative()V
   at 
 org.apache.hadoop.security.JniBasedUnixGroupsMapping.anchorNative(Native 
 Method)
   at 
 org.apache.hadoop.security.JniBasedUnixGroupsMapping.clinit(JniBasedUnixGroupsMapping.java:49)
   at 
 org.apache.hadoop.security.JniBasedUnixGroupsMappingWithFallback.init(JniBasedUnixGroupsMappingWithFallback.java:39)
   ... 12 more
 2015-01-08 01:52:01,767 INFO [main] org.apache.hadoop.util.ExitUtil: Exiting 
 with status 1



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MAPREDUCE-6212) UnsatisfiedLinkError: org.apache.hadoop.security.JniBasedUnixGroupsMapping.anchorNative() happened when starting MRAppMaster

2015-01-07 Thread Wangda Tan (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-6212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated MAPREDUCE-6212:
--
Summary: UnsatisfiedLinkError: 
org.apache.hadoop.security.JniBasedUnixGroupsMapping.anchorNative() happened 
when starting MRAppMaster  (was: Hadoop 2.6.0: Basic error “starting 
MRAppMaster” after installing)

 UnsatisfiedLinkError: 
 org.apache.hadoop.security.JniBasedUnixGroupsMapping.anchorNative() happened 
 when starting MRAppMaster
 

 Key: MAPREDUCE-6212
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6212
 Project: Hadoop Map/Reduce
  Issue Type: Bug
  Components: security
Affects Versions: 2.6.0
 Environment: Ubuntu 64bit
Reporter: Dinh Hoang Mai
 Fix For: 2.6.0


 I have just started to work with Hadoop 2.
 After installing with basic configs, I always failed to run any examples. Has 
 anyone seen this problem and please help me?
 This is the log
 2015-01-08 01:52:01,599 INFO [main] 
 org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Created MRAppMaster for 
 application appattempt_1420648881673_0004_01
 2015-01-08 01:52:01,764 FATAL [main] 
 org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Error starting MRAppMaster
 java.lang.RuntimeException: java.lang.reflect.InvocationTargetException
   at 
 org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:131)
   at org.apache.hadoop.security.Groups.init(Groups.java:70)
   at org.apache.hadoop.security.Groups.init(Groups.java:66)
   at 
 org.apache.hadoop.security.Groups.getUserToGroupsMappingService(Groups.java:280)
   at 
 org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:271)
   at 
 org.apache.hadoop.security.UserGroupInformation.setConfiguration(UserGroupInformation.java:299)
   at 
 org.apache.hadoop.mapreduce.v2.app.MRAppMaster.initAndStartAppMaster(MRAppMaster.java:1473)
   at 
 org.apache.hadoop.mapreduce.v2.app.MRAppMaster.main(MRAppMaster.java:1429)
 Caused by: java.lang.reflect.InvocationTargetException
   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
   at 
 sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
   at 
 sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
   at java.lang.reflect.Constructor.newInstance(Constructor.java:408)
   at 
 org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:129)
   ... 7 more
 Caused by: java.lang.UnsatisfiedLinkError: 
 org.apache.hadoop.security.JniBasedUnixGroupsMapping.anchorNative()V
   at 
 org.apache.hadoop.security.JniBasedUnixGroupsMapping.anchorNative(Native 
 Method)
   at 
 org.apache.hadoop.security.JniBasedUnixGroupsMapping.clinit(JniBasedUnixGroupsMapping.java:49)
   at 
 org.apache.hadoop.security.JniBasedUnixGroupsMappingWithFallback.init(JniBasedUnixGroupsMappingWithFallback.java:39)
   ... 12 more
 2015-01-08 01:52:01,767 INFO [main] org.apache.hadoop.util.ExitUtil: Exiting 
 with status 1



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAPREDUCE-6212) Hadoop 2.6.0: Basic error “starting MRAppMaster” after installing

2015-01-07 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14268354#comment-14268354
 ] 

Wangda Tan commented on MAPREDUCE-6212:
---

It's more like a configuration issue instead of bug, [~maidh91], could you 
provide more information about what's the basic configuration?.

 Hadoop 2.6.0: Basic error “starting MRAppMaster” after installing
 -

 Key: MAPREDUCE-6212
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6212
 Project: Hadoop Map/Reduce
  Issue Type: Bug
  Components: security
Affects Versions: 2.6.0
 Environment: Ubuntu 64bit
Reporter: Dinh Hoang Mai
Priority: Critical
 Fix For: 2.6.0


 I have just started to work with Hadoop 2.
 After installing with basic configs, I always failed to run any examples. Has 
 anyone seen this problem and please help me?
 This is the log
 2015-01-08 01:52:01,599 INFO [main] 
 org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Created MRAppMaster for 
 application appattempt_1420648881673_0004_01
 2015-01-08 01:52:01,764 FATAL [main] 
 org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Error starting MRAppMaster
 java.lang.RuntimeException: java.lang.reflect.InvocationTargetException
   at 
 org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:131)
   at org.apache.hadoop.security.Groups.init(Groups.java:70)
   at org.apache.hadoop.security.Groups.init(Groups.java:66)
   at 
 org.apache.hadoop.security.Groups.getUserToGroupsMappingService(Groups.java:280)
   at 
 org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:271)
   at 
 org.apache.hadoop.security.UserGroupInformation.setConfiguration(UserGroupInformation.java:299)
   at 
 org.apache.hadoop.mapreduce.v2.app.MRAppMaster.initAndStartAppMaster(MRAppMaster.java:1473)
   at 
 org.apache.hadoop.mapreduce.v2.app.MRAppMaster.main(MRAppMaster.java:1429)
 Caused by: java.lang.reflect.InvocationTargetException
   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
   at 
 sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
   at 
 sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
   at java.lang.reflect.Constructor.newInstance(Constructor.java:408)
   at 
 org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:129)
   ... 7 more
 Caused by: java.lang.UnsatisfiedLinkError: 
 org.apache.hadoop.security.JniBasedUnixGroupsMapping.anchorNative()V
   at 
 org.apache.hadoop.security.JniBasedUnixGroupsMapping.anchorNative(Native 
 Method)
   at 
 org.apache.hadoop.security.JniBasedUnixGroupsMapping.clinit(JniBasedUnixGroupsMapping.java:49)
   at 
 org.apache.hadoop.security.JniBasedUnixGroupsMappingWithFallback.init(JniBasedUnixGroupsMappingWithFallback.java:39)
   ... 12 more
 2015-01-08 01:52:01,767 INFO [main] org.apache.hadoop.util.ExitUtil: Exiting 
 with status 1



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (MAPREDUCE-6130) Mapreduce tests fail with IllegalArgumentException in trunk

2015-01-13 Thread Wangda Tan (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-6130?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan resolved MAPREDUCE-6130.
---
Resolution: Done

Problem solved by YARN-2705 already.

 Mapreduce tests fail with IllegalArgumentException in trunk
 ---

 Key: MAPREDUCE-6130
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6130
 Project: Hadoop Map/Reduce
  Issue Type: Bug
Reporter: Ted Yu
Assignee: Wangda Tan

 From https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1929/console :
 {code}
 testComplexNameWithRegex(org.apache.hadoop.mapred.TestJobName)  Time elapsed: 
 5.153 sec   ERROR!
 java.lang.IllegalArgumentException: Illegal capacity of -1.0 for label=x in 
 queue=root.default
   at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacitySchedulerConfiguration.getNodeLabelCapacities(CapacitySchedulerConfiguration.java:473)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.AbstractCSQueue.init(AbstractCSQueue.java:119)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.init(LeafQueue.java:120)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.parseQueue(CapacityScheduler.java:567)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.parseQueue(CapacityScheduler.java:587)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.initializeQueues(CapacityScheduler.java:462)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.initScheduler(CapacityScheduler.java:294)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.serviceInit(CapacityScheduler.java:323)
   at 
 org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
   at 
 org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceInit(ResourceManager.java:537)
   at 
 org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.createAndInitActiveServices(ResourceManager.java:976)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceInit(ResourceManager.java:239)
   at 
 org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
   at 
 org.apache.hadoop.yarn.server.MiniYARNCluster.initResourceManager(MiniYARNCluster.java:291)
   at 
 org.apache.hadoop.yarn.server.MiniYARNCluster.access$400(MiniYARNCluster.java:95)
   at 
 org.apache.hadoop.yarn.server.MiniYARNCluster$ResourceManagerWrapper.serviceInit(MiniYARNCluster.java:442)
   at 
 org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
   at 
 org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107)
   at 
 org.apache.hadoop.yarn.server.MiniYARNCluster.serviceInit(MiniYARNCluster.java:267)
   at 
 org.apache.hadoop.mapreduce.v2.MiniMRYarnCluster.serviceInit(MiniMRYarnCluster.java:183)
   at 
 org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
 {code}
 A lot of tests failed due to 'Illegal capacity' exception



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MAPREDUCE-6304) Specifying node labels when submitting MR jobs

2015-04-01 Thread Wangda Tan (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-6304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated MAPREDUCE-6304:
--
Description: Per the discussion on YARN-796, we need a mechanism in 
MAPREDUCE to specify node labels when submitting MR jobs.  (was: Per the 
discussion on Yarn-796, we need a mechanism in MAPREDUCE to specify node labels 
when submitting MR jobs.)

 Specifying node labels when submitting MR jobs
 --

 Key: MAPREDUCE-6304
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6304
 Project: Hadoop Map/Reduce
  Issue Type: New Feature
Reporter: Jian Fang

 Per the discussion on YARN-796, we need a mechanism in MAPREDUCE to specify 
 node labels when submitting MR jobs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MAPREDUCE-6326) Cleanup ResourceManagerAdministrationProtocol interface audience

2015-04-21 Thread Wangda Tan (JIRA)
Wangda Tan created MAPREDUCE-6326:
-

 Summary: Cleanup ResourceManagerAdministrationProtocol interface 
audience
 Key: MAPREDUCE-6326
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6326
 Project: Hadoop Map/Reduce
  Issue Type: Bug
  Components: client, resourcemanager
Reporter: Wangda Tan


I noticed ResourceManagerAdministrationProtocol has @Private audience for the 
class and @Public audience for methods. It doesn't make sense to me. We should 
make class audience and methods audience consistent.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAPREDUCE-6304) Specifying node labels when submitting MR jobs

2015-04-27 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14515853#comment-14515853
 ] 

Wangda Tan commented on MAPREDUCE-6304:
---

bq. Well my view should be to support them as i feel in some cases we might 
feel like mapper can run in any node but may be reducer requires more mem so 
may be clients might require it to be run in high mem nodes (may be some 
constraint on mem for the nodes) or some other node constraints.
That's correct, so you agree to add mapper/reducer node-label-expression?

bq. Well, IIUC then ApplicationSubmissionContext.getNodeLabelExpression() is 
for mapper/reducer and AM's node label expression is set in 
ApplicationSubmissionContext.getAMContainerResourceRequest. I could cross 
verify the same with code in ApplicationMasterService (line number 495)  
RMAppManager (line number 378) and so did not get what you meant by the above 
statement, correct me if my understanding is wrong.
What I meant is:
When job.label = x, and don't set am/mapper/reducer label, all containers 
should get allocated on x 
When job.label = x and am.label = y, don't set mapper/reducer label, am will 
get allocated on y (overwritten x) but mapper/reducer should get allocated on x.
When job.label = x, am.label=y, mapper.label=z, am will get allocated on y, 
mapper will get allocated on z (overwrite x), and reducer get allocated on x.
This should be existing behavior. And mapper/reducer's label should be added to 
ResourceRequest (may need modify {{RMContainerRequestor}}) instead of 
AplicationSubmissionContext.

 Specifying node labels when submitting MR jobs
 --

 Key: MAPREDUCE-6304
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6304
 Project: Hadoop Map/Reduce
  Issue Type: New Feature
Reporter: Jian Fang
Assignee: Naganarasimha G R
 Fix For: 2.8.0

 Attachments: MAPREDUCE-6304.20150410-1.patch, 
 MAPREDUCE-6304.20150411-1.patch


 Per the discussion on YARN-796, we need a mechanism in MAPREDUCE to specify 
 node labels when submitting MR jobs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAPREDUCE-6304) Specifying node labels when submitting MR jobs

2015-05-01 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14523543#comment-14523543
 ] 

Wangda Tan commented on MAPREDUCE-6304:
---

[~Naganarasimha], 
Patch generally LGTM, one minor comment:
- If specify label= in mapred-default.xml, which means the job will always 
set label= while adding resource-request, but the previously behavior is send 
null instead of empty string. So I suggest to change the default to be NOT 
SPECIFIED (which is not a valid label since it contains space) as default 
value, and we will replace it by null. Does this make sense to you? Do you have 
any other ideas on this? 

And could you deploy a few nodes cluster, with 2-3 labels, and run MR job with 
am.label, mapper.label, reducer.label, etc. that will be very important to make 
sure everything works well.

 Specifying node labels when submitting MR jobs
 --

 Key: MAPREDUCE-6304
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6304
 Project: Hadoop Map/Reduce
  Issue Type: New Feature
Reporter: Jian Fang
Assignee: Naganarasimha G R
 Fix For: 2.8.0

 Attachments: MAPREDUCE-6304.20150410-1.patch, 
 MAPREDUCE-6304.20150411-1.patch, MAPREDUCE-6304.20150501-1.patch


 Per the discussion on YARN-796, we need a mechanism in MAPREDUCE to specify 
 node labels when submitting MR jobs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAPREDUCE-6304) Specifying node labels when submitting MR jobs

2015-04-27 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14514669#comment-14514669
 ] 

Wangda Tan commented on MAPREDUCE-6304:
---

Hi [~Naganarasimha],
Just took a look at the patch, some comments about syntax:
- Beyond AM/Job node-labels setting, should we support Mapper/Reducer setting? 
- When job and AM (or mapper/reducer) set together, AM (or mapper/reducer)'s 
node-label-expression should overwrite job's setting.
- If we need support mapper/reducer's node-label-expression. We may need add 
some changes in MR AM side and some tests.

Any ideas? [~lohit]. 

 Specifying node labels when submitting MR jobs
 --

 Key: MAPREDUCE-6304
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6304
 Project: Hadoop Map/Reduce
  Issue Type: New Feature
Reporter: Jian Fang
Assignee: Naganarasimha G R
 Fix For: 2.8.0

 Attachments: MAPREDUCE-6304.20150410-1.patch, 
 MAPREDUCE-6304.20150411-1.patch


 Per the discussion on YARN-796, we need a mechanism in MAPREDUCE to specify 
 node labels when submitting MR jobs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MAPREDUCE-6302) deadlock in a job between map and reduce cores allocation

2015-05-07 Thread Wangda Tan (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-6302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated MAPREDUCE-6302:
--
Status: Patch Available  (was: Open)

 deadlock in a job between map and reduce cores allocation 
 --

 Key: MAPREDUCE-6302
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6302
 Project: Hadoop Map/Reduce
  Issue Type: Bug
Affects Versions: 2.6.0
Reporter: mai shurong
Assignee: Karthik Kambatla
Priority: Critical
 Attachments: AM_log_head10.txt.gz, AM_log_tail10.txt.gz, 
 log.txt, mr-6302-prelim.patch, queue_with_max163cores.png, 
 queue_with_max263cores.png, queue_with_max333cores.png


 I submit a  big job, which has 500 maps and 350 reduce, to a 
 queue(fairscheduler) with 300 max cores. When the big mapreduce job is 
 running 100% maps, the 300 reduces have occupied 300 max cores in the queue. 
 And then, a map fails and retry, waiting for a core, while the 300 reduces 
 are waiting for failed map to finish. So a deadlock occur. As a result, the 
 job is blocked, and the later job in the queue cannot run because no 
 available cores in the queue.
 I think there is the similar issue for memory of a queue .



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAPREDUCE-6302) deadlock in a job between map and reduce cores allocation

2015-05-07 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14533618#comment-14533618
 ] 

Wangda Tan commented on MAPREDUCE-6302:
---

[~kasha],
Thanks for working on this.

Just take a look at your patch, overall approach looks good, some comments 
about configuration:
{{MR_JOB_REDUCER_FORCE_PREEMPT_DELAY_SEC}}
It is actually not REDUCER_FORCE_PREEMPT_DELAY, it is timeout of mapper 
allocation to start reducer preemption, I suggest to rename it to be: 
mapreduce.job.mapper.timeout-to-start-reducer-preemption-ms. I think it's 
better to use ms instead of sec to better control it.

In addition, do you think should we add a value to let user choose to disable 
this? For example, -1.

And could you add some tests?

 deadlock in a job between map and reduce cores allocation 
 --

 Key: MAPREDUCE-6302
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6302
 Project: Hadoop Map/Reduce
  Issue Type: Bug
Affects Versions: 2.6.0
Reporter: mai shurong
Assignee: Karthik Kambatla
Priority: Critical
 Attachments: AM_log_head10.txt.gz, AM_log_tail10.txt.gz, 
 log.txt, mr-6302-prelim.patch, queue_with_max163cores.png, 
 queue_with_max263cores.png, queue_with_max333cores.png


 I submit a  big job, which has 500 maps and 350 reduce, to a 
 queue(fairscheduler) with 300 max cores. When the big mapreduce job is 
 running 100% maps, the 300 reduces have occupied 300 max cores in the queue. 
 And then, a map fails and retry, waiting for a core, while the 300 reduces 
 are waiting for failed map to finish. So a deadlock occur. As a result, the 
 job is blocked, and the later job in the queue cannot run because no 
 available cores in the queue.
 I think there is the similar issue for memory of a queue .



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAPREDUCE-6302) deadlock in a job between map and reduce cores allocation

2015-05-07 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14533619#comment-14533619
 ] 

Wangda Tan commented on MAPREDUCE-6302:
---

Linked to YARN-1680, one is for more accurate calculation, one is to prevent 
inaccurate calculation.

 deadlock in a job between map and reduce cores allocation 
 --

 Key: MAPREDUCE-6302
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6302
 Project: Hadoop Map/Reduce
  Issue Type: Bug
Affects Versions: 2.6.0
Reporter: mai shurong
Assignee: Karthik Kambatla
Priority: Critical
 Attachments: AM_log_head10.txt.gz, AM_log_tail10.txt.gz, 
 log.txt, mr-6302-prelim.patch, queue_with_max163cores.png, 
 queue_with_max263cores.png, queue_with_max333cores.png


 I submit a  big job, which has 500 maps and 350 reduce, to a 
 queue(fairscheduler) with 300 max cores. When the big mapreduce job is 
 running 100% maps, the 300 reduces have occupied 300 max cores in the queue. 
 And then, a map fails and retry, waiting for a core, while the 300 reduces 
 are waiting for failed map to finish. So a deadlock occur. As a result, the 
 job is blocked, and the later job in the queue cannot run because no 
 available cores in the queue.
 I think there is the similar issue for memory of a queue .



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MAPREDUCE-6304) Specifying node labels when submitting MR jobs

2015-05-04 Thread Wangda Tan (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-6304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated MAPREDUCE-6304:
--
Status: Open  (was: Patch Available)

 Specifying node labels when submitting MR jobs
 --

 Key: MAPREDUCE-6304
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6304
 Project: Hadoop Map/Reduce
  Issue Type: New Feature
Reporter: Jian Fang
Assignee: Naganarasimha G R
 Fix For: 2.8.0

 Attachments: MAPREDUCE-6304.20150410-1.patch, 
 MAPREDUCE-6304.20150411-1.patch, MAPREDUCE-6304.20150501-1.patch


 Per the discussion on YARN-796, we need a mechanism in MAPREDUCE to specify 
 node labels when submitting MR jobs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAPREDUCE-6304) Specifying node labels when submitting MR jobs

2015-05-04 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14527488#comment-14527488
 ] 

Wangda Tan commented on MAPREDUCE-6304:
---

[~Naganarasimha], remove from mapred-default.xml is not good. 
mapred-default.xml is a way to document all options we have, how about make its 
default to be USE_QUEUE_DEFINED_DEFAULT or better name, it's not not 
specified or null actually, it's using queue defined node label expression. 
Thoughts?

 Specifying node labels when submitting MR jobs
 --

 Key: MAPREDUCE-6304
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6304
 Project: Hadoop Map/Reduce
  Issue Type: New Feature
Reporter: Jian Fang
Assignee: Naganarasimha G R
 Fix For: 2.8.0

 Attachments: MAPREDUCE-6304.20150410-1.patch, 
 MAPREDUCE-6304.20150411-1.patch, MAPREDUCE-6304.20150501-1.patch


 Per the discussion on YARN-796, we need a mechanism in MAPREDUCE to specify 
 node labels when submitting MR jobs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAPREDUCE-6304) Specifying node labels when submitting MR jobs

2015-05-05 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14528957#comment-14528957
 ] 

Wangda Tan commented on MAPREDUCE-6304:
---

[~Naganarasimha], thanks for pointing me about yarn.ipc.*.factory, etc. I think 
it's important to
- Not bring in additional unncessary default config
- Follow what we have in *default.xml
- Make admin easy to understand

So I think it's fine to do as what you suggested, but could you please mention 
in description that, by default the node-label-expression for job is not set, 
it will use queue's default-node-label-expression.

 Specifying node labels when submitting MR jobs
 --

 Key: MAPREDUCE-6304
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6304
 Project: Hadoop Map/Reduce
  Issue Type: New Feature
Reporter: Jian Fang
Assignee: Naganarasimha G R
 Fix For: 2.8.0

 Attachments: MAPREDUCE-6304.20150410-1.patch, 
 MAPREDUCE-6304.20150411-1.patch, MAPREDUCE-6304.20150501-1.patch


 Per the discussion on YARN-796, we need a mechanism in MAPREDUCE to specify 
 node labels when submitting MR jobs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAPREDUCE-6304) Specifying node labels when submitting MR jobs

2015-05-11 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14539024#comment-14539024
 ] 

Wangda Tan commented on MAPREDUCE-6304:
---

Thanks [~Naganarasimha] for working on this and testing, mostly LGTM, could you 
add the overwriting behavior in mapred-default.xml? (For example, by default is 
using queue's default-node-label-expression, AM.expression can overwrite 
job.expression, etc.

 Specifying node labels when submitting MR jobs
 --

 Key: MAPREDUCE-6304
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6304
 Project: Hadoop Map/Reduce
  Issue Type: New Feature
Reporter: Jian Fang
Assignee: Naganarasimha G R
  Labels: mapreduce
 Attachments: MAPREDUCE-6304.20150410-1.patch, 
 MAPREDUCE-6304.20150411-1.patch, MAPREDUCE-6304.20150501-1.patch, 
 MAPREDUCE-6304.20150510-1.patch, MAPREDUCE-6304.20150511-1.patch


 Per the discussion on YARN-796, we need a mechanism in MAPREDUCE to specify 
 node labels when submitting MR jobs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAPREDUCE-6304) Specifying node labels when submitting MR jobs

2015-05-12 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14540701#comment-14540701
 ] 

Wangda Tan commented on MAPREDUCE-6304:
---

Thanks update, [~Naganarasimha], +1 for latest patch.

 Specifying node labels when submitting MR jobs
 --

 Key: MAPREDUCE-6304
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6304
 Project: Hadoop Map/Reduce
  Issue Type: New Feature
Reporter: Jian Fang
Assignee: Naganarasimha G R
  Labels: mapreduce
 Attachments: MAPREDUCE-6304.20150410-1.patch, 
 MAPREDUCE-6304.20150411-1.patch, MAPREDUCE-6304.20150501-1.patch, 
 MAPREDUCE-6304.20150510-1.patch, MAPREDUCE-6304.20150511-1.patch, 
 MAPREDUCE-6304.20150512-1.patch


 Per the discussion on YARN-796, we need a mechanism in MAPREDUCE to specify 
 node labels when submitting MR jobs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MAPREDUCE-6302) deadlock in a job between map and reduce cores allocation

2015-05-14 Thread Wangda Tan (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-6302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated MAPREDUCE-6302:
--
Target Version/s: 2.8.0

Mark its target version to be 2.8.0

 deadlock in a job between map and reduce cores allocation 
 --

 Key: MAPREDUCE-6302
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6302
 Project: Hadoop Map/Reduce
  Issue Type: Bug
Affects Versions: 2.6.0
Reporter: mai shurong
Assignee: Karthik Kambatla
Priority: Critical
 Attachments: AM_log_head10.txt.gz, AM_log_tail10.txt.gz, 
 log.txt, mr-6302-prelim.patch, queue_with_max163cores.png, 
 queue_with_max263cores.png, queue_with_max333cores.png


 I submit a  big job, which has 500 maps and 350 reduce, to a 
 queue(fairscheduler) with 300 max cores. When the big mapreduce job is 
 running 100% maps, the 300 reduces have occupied 300 max cores in the queue. 
 And then, a map fails and retry, waiting for a core, while the 300 reduces 
 are waiting for failed map to finish. So a deadlock occur. As a result, the 
 job is blocked, and the later job in the queue cannot run because no 
 available cores in the queue.
 I think there is the similar issue for memory of a queue .



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAPREDUCE-6304) Specifying node labels when submitting MR jobs

2015-05-14 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14544102#comment-14544102
 ] 

Wangda Tan commented on MAPREDUCE-6304:
---

[~Naganarasimha],
Actually now CS will check label-expression of ANY resource request matches 
labels on nodes before allocating rack/node local requests. IAW, 
node-label-expression of rack/node local request will be ignored and will be 
treated as same as ANY request. To reply your concern:
MR tasks run only on the labeled nodes.

Since this changes MR configuration, I will wait  for another couple of days 
before commit this.

Thanks,

 Specifying node labels when submitting MR jobs
 --

 Key: MAPREDUCE-6304
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6304
 Project: Hadoop Map/Reduce
  Issue Type: New Feature
Reporter: Jian Fang
Assignee: Naganarasimha G R
  Labels: mapreduce
 Attachments: MAPREDUCE-6304.20150410-1.patch, 
 MAPREDUCE-6304.20150411-1.patch, MAPREDUCE-6304.20150501-1.patch, 
 MAPREDUCE-6304.20150510-1.patch, MAPREDUCE-6304.20150511-1.patch, 
 MAPREDUCE-6304.20150512-1.patch


 Per the discussion on YARN-796, we need a mechanism in MAPREDUCE to specify 
 node labels when submitting MR jobs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (MAPREDUCE-1439) Learning Scheduler

2015-05-13 Thread Wangda Tan (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-1439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan resolved MAPREDUCE-1439.
---
Resolution: Not A Problem

[~jaideep], thanks for sharing this, but JIRA is to track issues or feature 
proposals. I suggest you can share it via mail list. Closing as not a problem.

 Learning Scheduler
 --

 Key: MAPREDUCE-1439
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1439
 Project: Hadoop Map/Reduce
  Issue Type: New Feature
  Components: jobtracker
Reporter: Jaideep
 Attachments: learning-scheduler-description.pdf


 I would like to contribute the scheduler I have written to the MapReduce 
 project. Presently the scheduler source code is available on 
 http://code.google.com/p/learnsched/. It has been tested to work with Hadoop 
 0.20, although the code available at the URL had been modified to build with 
 trunk and needs testing. Currently the scheduler is in experimental stages, 
 and any feedback for improvement will be extremely useful.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Reopened] (MAPREDUCE-1439) Learning Scheduler

2015-05-13 Thread Wangda Tan (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-1439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan reopened MAPREDUCE-1439:
---
  Assignee: Jaideep

My bad, this should be a new feature, reopen and assign to [~jaideep].

 Learning Scheduler
 --

 Key: MAPREDUCE-1439
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1439
 Project: Hadoop Map/Reduce
  Issue Type: New Feature
  Components: jobtracker
Reporter: Jaideep
Assignee: Jaideep
 Attachments: learning-scheduler-description.pdf


 I would like to contribute the scheduler I have written to the MapReduce 
 project. Presently the scheduler source code is available on 
 http://code.google.com/p/learnsched/. It has been tested to work with Hadoop 
 0.20, although the code available at the URL had been modified to build with 
 trunk and needs testing. Currently the scheduler is in experimental stages, 
 and any feedback for improvement will be extremely useful.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAPREDUCE-6304) Specifying node labels when submitting MR jobs

2015-05-15 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14546230#comment-14546230
 ] 

Wangda Tan commented on MAPREDUCE-6304:
---

[~Naganarasimha],
Thanks for replying, it's glad to hear such cases are already addressed.

I think that will be problematic if we enable_any_resource_request_only, now we 
calculate pending-resource-by-label only for ANY request. I think it's fine to 
me to leave behind the node-local/rack-local out of partition issue. MR job can 
still get start and running. 

 Specifying node labels when submitting MR jobs
 --

 Key: MAPREDUCE-6304
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6304
 Project: Hadoop Map/Reduce
  Issue Type: New Feature
Reporter: Jian Fang
Assignee: Naganarasimha G R
  Labels: mapreduce
 Attachments: MAPREDUCE-6304.20150410-1.patch, 
 MAPREDUCE-6304.20150411-1.patch, MAPREDUCE-6304.20150501-1.patch, 
 MAPREDUCE-6304.20150510-1.patch, MAPREDUCE-6304.20150511-1.patch, 
 MAPREDUCE-6304.20150512-1.patch


 Per the discussion on YARN-796, we need a mechanism in MAPREDUCE to specify 
 node labels when submitting MR jobs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAPREDUCE-6304) Specifying node labels when submitting MR jobs

2015-05-15 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14545831#comment-14545831
 ] 

Wangda Tan commented on MAPREDUCE-6304:
---

[~Naganarasimha], I think I made a mistake in previous comment, rack/node local 
resource request is not allowed to specify label-expression for now. See 
SchedulerUtils.validateResourceRequest, I think you need to add a check in MR 
side, if the resourceName is not ANY, don't set node-label-expression to it.

 Specifying node labels when submitting MR jobs
 --

 Key: MAPREDUCE-6304
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6304
 Project: Hadoop Map/Reduce
  Issue Type: New Feature
Reporter: Jian Fang
Assignee: Naganarasimha G R
  Labels: mapreduce
 Attachments: MAPREDUCE-6304.20150410-1.patch, 
 MAPREDUCE-6304.20150411-1.patch, MAPREDUCE-6304.20150501-1.patch, 
 MAPREDUCE-6304.20150510-1.patch, MAPREDUCE-6304.20150511-1.patch, 
 MAPREDUCE-6304.20150512-1.patch


 Per the discussion on YARN-796, we need a mechanism in MAPREDUCE to specify 
 node labels when submitting MR jobs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAPREDUCE-6302) deadlock in a job between map and reduce cores allocation

2015-04-14 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14494812#comment-14494812
 ] 

Wangda Tan commented on MAPREDUCE-6302:
---

bq. In other words, I guess I am proposing MR use the headroom from YARN more 
as a heuristic than an absolute guarantee. MR should use the resources given to 
it in the best possible way it can.
+1 to make the headroom more heuristic, actually it can only be heuristic, 
YARN RM cannot precisely know what's the headroom of an app in most cases. 
Changing it to heuristic can avoid lots of deadlock between mappers and 
reducers like this. 

 deadlock in a job between map and reduce cores allocation 
 --

 Key: MAPREDUCE-6302
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6302
 Project: Hadoop Map/Reduce
  Issue Type: Bug
Affects Versions: 2.6.0
Reporter: mai shurong
Assignee: Karthik Kambatla
Priority: Critical
 Attachments: AM_log_head10.txt.gz, AM_log_tail10.txt.gz, 
 queue_with_max163cores.png, queue_with_max263cores.png, 
 queue_with_max333cores.png


 I submit a  big job, which has 500 maps and 350 reduce, to a 
 queue(fairscheduler) with 300 max cores. When the big mapreduce job is 
 running 100% maps, the 300 reduces have occupied 300 max cores in the queue. 
 And then, a map fails and retry, waiting for a core, while the 300 reduces 
 are waiting for failed map to finish. So a deadlock occur. As a result, the 
 job is blocked, and the later job in the queue cannot run because no 
 available cores in the queue.
 I think there is the similar issue for memory of a queue .



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAPREDUCE-6189) TestMRTimelineEventHandling fails in trunk

2015-04-08 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14485641#comment-14485641
 ] 

Wangda Tan commented on MAPREDUCE-6189:
---

bq. We do have 
user.getResourceUsage().decAMUsed(application.getAMResource()); in 
finishApplicationAttempt() in LeafQueue.java. However, I didn't see the method 
get called after putting some debug log, as well as APP_ATTEMPT_REMOVED event 
get logged. Sounds weird.

The application isn't finished in RM side, and the app is in FINISHING state, 
RM will wait NM heartbeat about AM container exit message before releasing AM 
container resource. (MAPREDUCE-4099).

 TestMRTimelineEventHandling fails in trunk
 --

 Key: MAPREDUCE-6189
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6189
 Project: Hadoop Map/Reduce
  Issue Type: Test
Reporter: Ted Yu
Assignee: Junping Du

 From https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1988/:
 {code}
 REGRESSION:  
 org.apache.hadoop.mapred.TestMRTimelineEventHandling.testMRTimelineEventHandling
 Error Message:
 Job didn't finish in 30 seconds
 Stack Trace:
 java.io.IOException: Job didn't finish in 30 seconds
 at 
 org.apache.hadoop.mapred.UtilsForTests.runJobSucceed(UtilsForTests.java:622)
 at 
 org.apache.hadoop.mapred.TestMRTimelineEventHandling.testMRTimelineEventHandling(TestMRTimelineEventHandling.java:105)
 REGRESSION:  
 org.apache.hadoop.mapred.TestMRTimelineEventHandling.testTimelineServiceStartInMiniCluster
 Error Message:
 Job didn't finish in 30 seconds
 Stack Trace:
 java.io.IOException: Job didn't finish in 30 seconds
 at 
 org.apache.hadoop.mapred.UtilsForTests.runJobSucceed(UtilsForTests.java:622)
 at 
 org.apache.hadoop.mapred.TestMRTimelineEventHandling.testTimelineServiceStartInMiniCluster(TestMRTimelineEventHandling.java:61)
 REGRESSION:  
 org.apache.hadoop.mapred.TestMRTimelineEventHandling.testMapreduceJobTimelineServiceEnabled
 Error Message:
 Job didn't finish in 30 seconds
 Stack Trace:
 java.io.IOException: Job didn't finish in 30 seconds
 at 
 org.apache.hadoop.mapred.UtilsForTests.runJobSucceed(UtilsForTests.java:622)
 at 
 org.apache.hadoop.mapred.TestMRTimelineEventHandling.testMapreduceJobTimelineServiceEnabled(TestMRTimelineEventHandling.java:198)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAPREDUCE-6311) AM JVM hungs after job unregister and finished

2015-04-08 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14485675#comment-14485675
 ] 

Wangda Tan commented on MAPREDUCE-6311:
---

This seems like a big issue, thanks for working on this [~rohithsharma].

Do you know in which version of Hadoop, this issue happens?

 AM JVM hungs after job unregister and finished
 --

 Key: MAPREDUCE-6311
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6311
 Project: Hadoop Map/Reduce
  Issue Type: Bug
Reporter: Rohith
Assignee: Rohith
 Attachments: 0001-MAPREDUCE-6311.patch, 0001-MAPREDUCE-6311.patch, 
 MR_TD.out


 It is observed that MRAppMaster JVM hungs after unregistered with 
 ResourceManager.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAPREDUCE-6302) deadlock in a job between map and reduce cores allocation

2015-04-01 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14391208#comment-14391208
 ] 

Wangda Tan commented on MAPREDUCE-6302:
---

Moved to mapreduce. And [~shurong.mai], could you confirm the Hadoop version 
you're currently using?

 deadlock in a job between map and reduce cores allocation 
 --

 Key: MAPREDUCE-6302
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6302
 Project: Hadoop Map/Reduce
  Issue Type: Bug
Affects Versions: 2.6.0
Reporter: mai shurong
Priority: Critical
 Attachments: AM_log_head10.txt.gz, AM_log_tail10.txt.gz, 
 queue_with_max163cores.png, queue_with_max263cores.png, 
 queue_with_max333cores.png


 I submit a  big job, which has 500 maps and 350 reduce, to a 
 queue(fairscheduler) with 300 max cores. When the big mapreduce job is 
 running 100% maps, the 300 reduces have occupied 300 max cores in the queue. 
 And then, a map fails and retry, waiting for a core, while the 300 reduces 
 are waiting for failed map to finish. So a deadlock occur. As a result, the 
 job is blocked, and the later job in the queue cannot run because no 
 available cores in the queue.
 I think there is the similar issue for memory of a queue .



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Moved] (MAPREDUCE-6302) deadlock in a job between map and reduce cores allocation

2015-04-01 Thread Wangda Tan (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-6302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan moved YARN-3416 to MAPREDUCE-6302:
-

  Component/s: (was: fairscheduler)
Affects Version/s: (was: 2.6.0)
   2.6.0
  Key: MAPREDUCE-6302  (was: YARN-3416)
  Project: Hadoop Map/Reduce  (was: Hadoop YARN)

 deadlock in a job between map and reduce cores allocation 
 --

 Key: MAPREDUCE-6302
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6302
 Project: Hadoop Map/Reduce
  Issue Type: Bug
Affects Versions: 2.6.0
Reporter: mai shurong
Priority: Critical
 Attachments: AM_log_head10.txt.gz, AM_log_tail10.txt.gz, 
 queue_with_max163cores.png, queue_with_max263cores.png, 
 queue_with_max333cores.png


 I submit a  big job, which has 500 maps and 350 reduce, to a 
 queue(fairscheduler) with 300 max cores. When the big mapreduce job is 
 running 100% maps, the 300 reduces have occupied 300 max cores in the queue. 
 And then, a map fails and retry, waiting for a core, while the 300 reduces 
 are waiting for failed map to finish. So a deadlock occur. As a result, the 
 job is blocked, and the later job in the queue cannot run because no 
 available cores in the queue.
 I think there is the similar issue for memory of a queue .



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAPREDUCE-6304) Specifying node labels when submitting MR jobs

2015-04-09 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14487941#comment-14487941
 ] 

Wangda Tan commented on MAPREDUCE-6304:
---

Just found I'm not in the watcher list, missed some discussions.

I think [~john.jian.fang]'s use case is very unique but also interesting, 
platform provider has no control and knowledge of user's applications, and user 
doesn't understand about platform's details such as which machine is temporary 
or not.

Maybe a possible way to solve this problem is to add a 
global-am-resource-request setting, which can include node-label-expression, 
node/rack information. Because as YARN RM, it doesn't know what's the 
application's running model, it only knows which container is AM or not, which 
seems enough for your requirements.

But adding such global-am-resource-request setting can also be dangerous, such 
as how about a queue is not accessible node-label-expression of 
global-am-resource-request?

 Specifying node labels when submitting MR jobs
 --

 Key: MAPREDUCE-6304
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6304
 Project: Hadoop Map/Reduce
  Issue Type: New Feature
Reporter: Jian Fang
Assignee: Naganarasimha G R

 Per the discussion on YARN-796, we need a mechanism in MAPREDUCE to specify 
 node labels when submitting MR jobs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MAPREDUCE-6304) Specifying node labels when submitting MR jobs

2015-05-27 Thread Wangda Tan (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-6304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated MAPREDUCE-6304:
--
   Resolution: Fixed
Fix Version/s: 2.8.0
 Hadoop Flags: Reviewed
   Status: Resolved  (was: Patch Available)

 Specifying node labels when submitting MR jobs
 --

 Key: MAPREDUCE-6304
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6304
 Project: Hadoop Map/Reduce
  Issue Type: New Feature
Reporter: Jian Fang
Assignee: Naganarasimha G R
  Labels: mapreduce
 Fix For: 2.8.0

 Attachments: MAPREDUCE-6304.20150410-1.patch, 
 MAPREDUCE-6304.20150411-1.patch, MAPREDUCE-6304.20150501-1.patch, 
 MAPREDUCE-6304.20150510-1.patch, MAPREDUCE-6304.20150511-1.patch, 
 MAPREDUCE-6304.20150512-1.patch


 Per the discussion on YARN-796, we need a mechanism in MAPREDUCE to specify 
 node labels when submitting MR jobs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAPREDUCE-6304) Specifying node labels when submitting MR jobs

2015-05-27 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14561824#comment-14561824
 ] 

Wangda Tan commented on MAPREDUCE-6304:
---

Just committed to branch-2/trunk. Thanks [~Naganarasimha] and review from 
[~john.jian.fang] and [~yufeldman]. I cannot resolve it because JIRA system is 
so slow today.

 Specifying node labels when submitting MR jobs
 --

 Key: MAPREDUCE-6304
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6304
 Project: Hadoop Map/Reduce
  Issue Type: New Feature
Reporter: Jian Fang
Assignee: Naganarasimha G R
  Labels: mapreduce
 Attachments: MAPREDUCE-6304.20150410-1.patch, 
 MAPREDUCE-6304.20150411-1.patch, MAPREDUCE-6304.20150501-1.patch, 
 MAPREDUCE-6304.20150510-1.patch, MAPREDUCE-6304.20150511-1.patch, 
 MAPREDUCE-6304.20150512-1.patch


 Per the discussion on YARN-796, we need a mechanism in MAPREDUCE to specify 
 node labels when submitting MR jobs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAPREDUCE-6304) Specifying node labels when submitting MR jobs

2015-05-22 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14556454#comment-14556454
 ] 

Wangda Tan commented on MAPREDUCE-6304:
---

It's the last call of comments, I plan to get this in today. :)

 Specifying node labels when submitting MR jobs
 --

 Key: MAPREDUCE-6304
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6304
 Project: Hadoop Map/Reduce
  Issue Type: New Feature
Reporter: Jian Fang
Assignee: Naganarasimha G R
  Labels: mapreduce
 Attachments: MAPREDUCE-6304.20150410-1.patch, 
 MAPREDUCE-6304.20150411-1.patch, MAPREDUCE-6304.20150501-1.patch, 
 MAPREDUCE-6304.20150510-1.patch, MAPREDUCE-6304.20150511-1.patch, 
 MAPREDUCE-6304.20150512-1.patch


 Per the discussion on YARN-796, we need a mechanism in MAPREDUCE to specify 
 node labels when submitting MR jobs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAPREDUCE-6304) Specifying node labels when submitting MR jobs

2015-07-13 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14625631#comment-14625631
 ] 

Wangda Tan commented on MAPREDUCE-6304:
---

[~Naganarasimha], I just found the problem is already resolved by 
MAPREDUCE-6421, please ignore my previous comment.

 Specifying node labels when submitting MR jobs
 --

 Key: MAPREDUCE-6304
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6304
 Project: Hadoop Map/Reduce
  Issue Type: New Feature
Reporter: Jian Fang
Assignee: Naganarasimha G R
  Labels: mapreduce
 Fix For: 2.8.0

 Attachments: MAPREDUCE-6304.20150410-1.patch, 
 MAPREDUCE-6304.20150411-1.patch, MAPREDUCE-6304.20150501-1.patch, 
 MAPREDUCE-6304.20150510-1.patch, MAPREDUCE-6304.20150511-1.patch, 
 MAPREDUCE-6304.20150512-1.patch


 Per the discussion on YARN-796, we need a mechanism in MAPREDUCE to specify 
 node labels when submitting MR jobs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAPREDUCE-6304) Specifying node labels when submitting MR jobs

2015-07-13 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14625618#comment-14625618
 ] 

Wangda Tan commented on MAPREDUCE-6304:
---

[~Naganarasimha], I just found there's one findbugs warning, I missed that 
before committing, could you reopen this issue and add an addendum patch to fix 
that?

Thanks,

 Specifying node labels when submitting MR jobs
 --

 Key: MAPREDUCE-6304
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6304
 Project: Hadoop Map/Reduce
  Issue Type: New Feature
Reporter: Jian Fang
Assignee: Naganarasimha G R
  Labels: mapreduce
 Fix For: 2.8.0

 Attachments: MAPREDUCE-6304.20150410-1.patch, 
 MAPREDUCE-6304.20150411-1.patch, MAPREDUCE-6304.20150501-1.patch, 
 MAPREDUCE-6304.20150510-1.patch, MAPREDUCE-6304.20150511-1.patch, 
 MAPREDUCE-6304.20150512-1.patch


 Per the discussion on YARN-796, we need a mechanism in MAPREDUCE to specify 
 node labels when submitting MR jobs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MAPREDUCE-6302) Preempt reducers after a configurable timeout irrespective of headroom

2015-11-09 Thread Wangda Tan (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-6302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated MAPREDUCE-6302:
--
Target Version/s: 2.8.0, 2.6.3, 2.7.3  (was: 2.8.0)

> Preempt reducers after a configurable timeout irrespective of headroom
> --
>
> Key: MAPREDUCE-6302
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6302
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Affects Versions: 2.6.0
>Reporter: mai shurong
>Assignee: Karthik Kambatla
>Priority: Critical
> Fix For: 2.8.0
>
> Attachments: AM_log_head10.txt.gz, AM_log_tail10.txt.gz, 
> log.txt, mr-6302-1.patch, mr-6302-2.patch, mr-6302-3.patch, mr-6302-4.patch, 
> mr-6302-5.patch, mr-6302-6.patch, mr-6302-7.patch, mr-6302-prelim.patch, 
> mr-6302_branch-2.patch, queue_with_max163cores.png, 
> queue_with_max263cores.png, queue_with_max333cores.png
>
>
> I submit a  big job, which has 500 maps and 350 reduce, to a 
> queue(fairscheduler) with 300 max cores. When the big mapreduce job is 
> running 100% maps, the 300 reduces have occupied 300 max cores in the queue. 
> And then, a map fails and retry, waiting for a core, while the 300 reduces 
> are waiting for failed map to finish. So a deadlock occur. As a result, the 
> job is blocked, and the later job in the queue cannot run because no 
> available cores in the queue.
> I think there is the similar issue for memory of a queue .



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MAPREDUCE-6302) Preempt reducers after a configurable timeout irrespective of headroom

2015-11-09 Thread Wangda Tan (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-6302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated MAPREDUCE-6302:
--
Attachment: MAPREDUCE-6302.branch-2.7.0001.patch
MAPREDUCE-6302.branch-2.6.0001.patch

Attached patch for branch-2.6/branch-2.7 for review, and added 2.6.3/2.7.3 to 
target versions .

> Preempt reducers after a configurable timeout irrespective of headroom
> --
>
> Key: MAPREDUCE-6302
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6302
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Affects Versions: 2.6.0
>Reporter: mai shurong
>Assignee: Karthik Kambatla
>Priority: Critical
> Fix For: 2.8.0
>
> Attachments: AM_log_head10.txt.gz, AM_log_tail10.txt.gz, 
> MAPREDUCE-6302.branch-2.6.0001.patch, MAPREDUCE-6302.branch-2.7.0001.patch, 
> log.txt, mr-6302-1.patch, mr-6302-2.patch, mr-6302-3.patch, mr-6302-4.patch, 
> mr-6302-5.patch, mr-6302-6.patch, mr-6302-7.patch, mr-6302-prelim.patch, 
> mr-6302_branch-2.patch, queue_with_max163cores.png, 
> queue_with_max263cores.png, queue_with_max333cores.png
>
>
> I submit a  big job, which has 500 maps and 350 reduce, to a 
> queue(fairscheduler) with 300 max cores. When the big mapreduce job is 
> running 100% maps, the 300 reduces have occupied 300 max cores in the queue. 
> And then, a map fails and retry, waiting for a core, while the 300 reduces 
> are waiting for failed map to finish. So a deadlock occur. As a result, the 
> job is blocked, and the later job in the queue cannot run because no 
> available cores in the queue.
> I think there is the similar issue for memory of a queue .



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAPREDUCE-6541) Exclude scheduled reducer memory when calculating available mapper slots from headroom to avoid deadlock

2015-11-09 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14996998#comment-14996998
 ] 

Wangda Tan commented on MAPREDUCE-6541:
---

[~varun_saxena], yes you're correct, updated title/desc.

Thanks,

> Exclude scheduled reducer memory when calculating available mapper slots from 
> headroom to avoid deadlock 
> -
>
> Key: MAPREDUCE-6541
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6541
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Reporter: Wangda Tan
>Assignee: Varun Saxena
>
> We saw a MR deadlock recently:
> - When NM restarted by framework without enable recovery, containers running 
> on these nodes will be identified as "ABORTED", and MR AM will try to 
> reschedule "ABORTED" mapper containers.
> - Since such lost mappers are "ABORTED" container, MR AM gives normal mapper 
> priority (priority=20) to such mapper requests. If there's any pending 
> reducer (priority=10) at the same time, mapper requests need to wait for 
> reducer requests satisfied.
> - In our test, one mapper needs 700+ MB, reducer needs 1000+ MB, and RM 
> available resource = mapper-request = (700+ MB), only one job was running in 
> the system so scheduler cannot allocate more reducer containers AND MR-AM 
> thinks there're enough headroom for mapper so reducer containers will not be 
> preempted.
> MAPREDUCE-6302 can solve most of the problems, but in the other hand, I think 
> we may need to exclude scheduled reducers resource when calculating 
> #available-mapper-slots from headroom. Which we can avoid excessive reducer 
> preemption.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MAPREDUCE-6541) Exclude scheduled reducer memory when calculating available mapper slots from headroom to avoid deadlock

2015-11-09 Thread Wangda Tan (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-6541?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated MAPREDUCE-6541:
--
Summary: Exclude scheduled reducer memory when calculating available mapper 
slots from headroom to avoid deadlock   (was: Exclude pending reducer memory 
when calculating available mapper slots from headroom to avoid deadlock )

> Exclude scheduled reducer memory when calculating available mapper slots from 
> headroom to avoid deadlock 
> -
>
> Key: MAPREDUCE-6541
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6541
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Reporter: Wangda Tan
>Assignee: Varun Saxena
>
> We saw a MR deadlock recently:
> - When NM restarted by framework without enable recovery, containers running 
> on these nodes will be identified as "ABORTED", and MR AM will try to 
> reschedule "ABORTED" mapper containers.
> - Since such lost mappers are "ABORTED" container, MR AM gives normal mapper 
> priority (priority=20) to such mapper requests. If there's any pending 
> reducer (priority=10) at the same time, mapper requests need to wait for 
> reducer requests satisfied.
> - In our test, one mapper needs 700+ MB, reducer needs 1000+ MB, and RM 
> available resource = mapper-request = (700+ MB), only one job was running in 
> the system so scheduler cannot allocate more reducer containers AND MR-AM 
> thinks there're enough headroom for mapper so reducer containers will not be 
> preempted.
> MAPREDUCE-6302 can solve most of the problems, but in the other hand, I think 
> we may need to exclude pending reducers resource when calculating 
> #available-mapper-slots from headroom. Which we can avoid excessive reducer 
> preemption.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MAPREDUCE-6541) Exclude scheduled reducer memory when calculating available mapper slots from headroom to avoid deadlock

2015-11-09 Thread Wangda Tan (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-6541?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated MAPREDUCE-6541:
--
Description: 
We saw a MR deadlock recently:

- When NM restarted by framework without enable recovery, containers running on 
these nodes will be identified as "ABORTED", and MR AM will try to reschedule 
"ABORTED" mapper containers.
- Since such lost mappers are "ABORTED" container, MR AM gives normal mapper 
priority (priority=20) to such mapper requests. If there's any pending reducer 
(priority=10) at the same time, mapper requests need to wait for reducer 
requests satisfied.
- In our test, one mapper needs 700+ MB, reducer needs 1000+ MB, and RM 
available resource = mapper-request = (700+ MB), only one job was running in 
the system so scheduler cannot allocate more reducer containers AND MR-AM 
thinks there're enough headroom for mapper so reducer containers will not be 
preempted.

MAPREDUCE-6302 can solve most of the problems, but in the other hand, I think 
we may need to exclude scheduled reducers resource when calculating 
#available-mapper-slots from headroom. Which we can avoid excessive reducer 
preemption.

  was:
We saw a MR deadlock recently:

- When NM restarted by framework without enable recovery, containers running on 
these nodes will be identified as "ABORTED", and MR AM will try to reschedule 
"ABORTED" mapper containers.
- Since such lost mappers are "ABORTED" container, MR AM gives normal mapper 
priority (priority=20) to such mapper requests. If there's any pending reducer 
(priority=10) at the same time, mapper requests need to wait for reducer 
requests satisfied.
- In our test, one mapper needs 700+ MB, reducer needs 1000+ MB, and RM 
available resource = mapper-request = (700+ MB), only one job was running in 
the system so scheduler cannot allocate more reducer containers AND MR-AM 
thinks there're enough headroom for mapper so reducer containers will not be 
preempted.

MAPREDUCE-6302 can solve most of the problems, but in the other hand, I think 
we may need to exclude pending reducers resource when calculating 
#available-mapper-slots from headroom. Which we can avoid excessive reducer 
preemption.


> Exclude scheduled reducer memory when calculating available mapper slots from 
> headroom to avoid deadlock 
> -
>
> Key: MAPREDUCE-6541
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6541
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Reporter: Wangda Tan
>Assignee: Varun Saxena
>
> We saw a MR deadlock recently:
> - When NM restarted by framework without enable recovery, containers running 
> on these nodes will be identified as "ABORTED", and MR AM will try to 
> reschedule "ABORTED" mapper containers.
> - Since such lost mappers are "ABORTED" container, MR AM gives normal mapper 
> priority (priority=20) to such mapper requests. If there's any pending 
> reducer (priority=10) at the same time, mapper requests need to wait for 
> reducer requests satisfied.
> - In our test, one mapper needs 700+ MB, reducer needs 1000+ MB, and RM 
> available resource = mapper-request = (700+ MB), only one job was running in 
> the system so scheduler cannot allocate more reducer containers AND MR-AM 
> thinks there're enough headroom for mapper so reducer containers will not be 
> preempted.
> MAPREDUCE-6302 can solve most of the problems, but in the other hand, I think 
> we may need to exclude scheduled reducers resource when calculating 
> #available-mapper-slots from headroom. Which we can avoid excessive reducer 
> preemption.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAPREDUCE-6302) Preempt reducers after a configurable timeout irrespective of headroom

2015-11-16 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15007092#comment-15007092
 ] 

Wangda Tan commented on MAPREDUCE-6302:
---

Update:
We run jobs in test cluster with this fix for branch-2.6/branch-2.7 for a few 
days, didn't see deadlock issue come back again. Will back port patches to 
branch-2.6/branch-2.7 in a few days if no opposite opinions.

> Preempt reducers after a configurable timeout irrespective of headroom
> --
>
> Key: MAPREDUCE-6302
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6302
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Affects Versions: 2.6.0
>Reporter: mai shurong
>Assignee: Karthik Kambatla
>Priority: Critical
> Fix For: 2.8.0
>
> Attachments: AM_log_head10.txt.gz, AM_log_tail10.txt.gz, 
> MAPREDUCE-6302.branch-2.6.0001.patch, MAPREDUCE-6302.branch-2.7.0001.patch, 
> log.txt, mr-6302-1.patch, mr-6302-2.patch, mr-6302-3.patch, mr-6302-4.patch, 
> mr-6302-5.patch, mr-6302-6.patch, mr-6302-7.patch, mr-6302-prelim.patch, 
> mr-6302_branch-2.patch, queue_with_max163cores.png, 
> queue_with_max263cores.png, queue_with_max333cores.png
>
>
> I submit a  big job, which has 500 maps and 350 reduce, to a 
> queue(fairscheduler) with 300 max cores. When the big mapreduce job is 
> running 100% maps, the 300 reduces have occupied 300 max cores in the queue. 
> And then, a map fails and retry, waiting for a core, while the 300 reduces 
> are waiting for failed map to finish. So a deadlock occur. As a result, the 
> job is blocked, and the later job in the queue cannot run because no 
> available cores in the queue.
> I think there is the similar issue for memory of a queue .



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAPREDUCE-6541) Exclude pending reducer memory when calculating available mapper slots from headroom to avoid deadlock

2015-11-06 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14994678#comment-14994678
 ] 

Wangda Tan commented on MAPREDUCE-6541:
---

[~Naganarasimha], that's also different: MAPREDUCE-6514 is to fix bug in 
existing reducer preemption logic (remove reducer request locally but not 
notify RM). This is an enhancement of how to calculate available mapper slots.

> Exclude pending reducer memory when calculating available mapper slots from 
> headroom to avoid deadlock 
> ---
>
> Key: MAPREDUCE-6541
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6541
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Reporter: Wangda Tan
>
> We saw a MR deadlock recently:
> - When NM restarted by framework without enable recovery, containers running 
> on these nodes will be identified as "ABORTED", and MR AM will try to 
> reschedule "ABORTED" mapper containers.
> - Since such lost mappers are "ABORTED" container, MR AM gives normal mapper 
> priority (priority=20) to such mapper requests. If there's any pending 
> reducer (priority=10) at the same time, mapper requests need to wait for 
> reducer requests satisfied.
> - In our test, one mapper needs 700+ MB, reducer needs 1000+ MB, and RM 
> available resource = mapper-request = (700+ MB), only one job was running in 
> the system so scheduler cannot allocate more reducer containers AND MR-AM 
> thinks there're enough headroom for mapper so reducer containers will not be 
> preempted.
> MAPREDUCE-6302 can solve most of the problems, but in the other hand, I think 
> we may need to exclude pending reducers resource when calculating 
> #available-mapper-slots from headroom. Which we can avoid excessive reducer 
> preemption.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAPREDUCE-6541) Exclude pending reducer memory when calculating available mapper slots from headroom to avoid deadlock

2015-11-06 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14994582#comment-14994582
 ] 

Wangda Tan commented on MAPREDUCE-6541:
---

[~Naganarasimha], think for pointing me MAPREDUCE-6513, I think they're similar 
issues, but the proposal maybe a little different:
- MAPREDUCE-6513 is trying to make retried mappers has higher priority.
- This JIRA is trying to exclude pending reducer memory so reducer preemption 
will involve earlier.

I'm linking the two JIRAs.

> Exclude pending reducer memory when calculating available mapper slots from 
> headroom to avoid deadlock 
> ---
>
> Key: MAPREDUCE-6541
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6541
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Reporter: Wangda Tan
>
> We saw a MR deadlock recently:
> - When NM restarted by framework without enable recovery, containers running 
> on these nodes will be identified as "ABORTED", and MR AM will try to 
> reschedule "ABORTED" mapper containers.
> - Since such lost mappers are "ABORTED" container, MR AM gives normal mapper 
> priority (priority=20) to such mapper requests. If there's any pending 
> reducer (priority=10) at the same time, mapper requests need to wait for 
> reducer requests satisfied.
> - In our test, one mapper needs 700+ MB, reducer needs 1000+ MB, and RM 
> available resource = mapper-request = (700+ MB), only one job was running in 
> the system so scheduler cannot allocate more reducer containers AND MR-AM 
> thinks there're enough headroom for mapper so reducer containers will not be 
> preempted.
> MAPREDUCE-6302 can solve most of the problems, but in the other hand, I think 
> we may need to exclude pending reducers resource when calculating 
> #available-mapper-slots from headroom. Which we can avoid excessive reducer 
> preemption.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAPREDUCE-6513) MR job got hanged forever when one NM unstable for some time

2015-11-06 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14994592#comment-14994592
 ] 

Wangda Tan commented on MAPREDUCE-6513:
---

I linked MAPREDUCE-6541 to this JIRA, they're different fixes for similar 
issues.

> MR job got hanged forever when one NM unstable for some time
> 
>
> Key: MAPREDUCE-6513
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6513
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: applicationmaster, resourcemanager
>Affects Versions: 2.7.0
>Reporter: Bob
>Assignee: Varun Saxena
>Priority: Critical
> Attachments: MAPREDUCE-6513.01.patch
>
>
> when job is in-progress which is having more tasks,one node became unstable 
> due to some OS issue.After the node became unstable, the map on this node 
> status changed to KILLED state. 
> Currently maps which were running on unstable node are rescheduled, and all 
> are in scheduled state and wait for RM assign container.Seen ask requests for 
> map till Node is good (all those failed), there are no ask request after 
> this. But AM keeps on preempting the reducers (it's recycling).
> Finally reducers are waiting for complete mappers and mappers did n't get 
> container..
> My Question Is:
> 
> why map requests did not sent AM ,once after node recovery.?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAPREDUCE-6541) Exclude pending reducer memory when calculating available mapper slots from headroom to avoid deadlock

2015-11-06 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14994785#comment-14994785
 ] 

Wangda Tan commented on MAPREDUCE-6541:
---

Thanks for sharing your thoughts [~varun_saxena]. And thanks for taking this, 
please go ahead.
Reconsidered these issues, I think 3 fixes are all required:
- MAPREDUCE-6513: failed/killed mappers should have higher priority
- MAPREDUCE-6514: reducer preemption should also cleanup resource requests in 
RM side.
- And also this one.

I think previous two are more important, this is just an optimization.

+[~vinodkv].

> Exclude pending reducer memory when calculating available mapper slots from 
> headroom to avoid deadlock 
> ---
>
> Key: MAPREDUCE-6541
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6541
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Reporter: Wangda Tan
>Assignee: Varun Saxena
>
> We saw a MR deadlock recently:
> - When NM restarted by framework without enable recovery, containers running 
> on these nodes will be identified as "ABORTED", and MR AM will try to 
> reschedule "ABORTED" mapper containers.
> - Since such lost mappers are "ABORTED" container, MR AM gives normal mapper 
> priority (priority=20) to such mapper requests. If there's any pending 
> reducer (priority=10) at the same time, mapper requests need to wait for 
> reducer requests satisfied.
> - In our test, one mapper needs 700+ MB, reducer needs 1000+ MB, and RM 
> available resource = mapper-request = (700+ MB), only one job was running in 
> the system so scheduler cannot allocate more reducer containers AND MR-AM 
> thinks there're enough headroom for mapper so reducer containers will not be 
> preempted.
> MAPREDUCE-6302 can solve most of the problems, but in the other hand, I think 
> we may need to exclude pending reducers resource when calculating 
> #available-mapper-slots from headroom. Which we can avoid excessive reducer 
> preemption.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAPREDUCE-6541) Exclude pending reducer memory when calculating available mapper slots from headroom to avoid deadlock

2015-11-06 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14994507#comment-14994507
 ] 

Wangda Tan commented on MAPREDUCE-6541:
---

+[~jlowe], [~kasha].

> Exclude pending reducer memory when calculating available mapper slots from 
> headroom to avoid deadlock 
> ---
>
> Key: MAPREDUCE-6541
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6541
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Reporter: Wangda Tan
>
> We saw a MR deadlock recently:
> - When NM restarted by framework without enable recovery, containers running 
> on these nodes will be identified as "ABORTED", and MR AM will try to 
> reschedule "ABORTED" mapper containers.
> - Since such lost mappers are "ABORTED" container, MR AM gives normal mapper 
> priority (priority=20) to such mapper requests. If there's any pending 
> reducer (priority=10) at the same time, mapper requests need to wait for 
> reducer requests satisfied.
> - In our test, one mapper needs 700+ MB, reducer needs 1000+ MB, and RM 
> available resource = mapper-request = (700+ MB), only one job was running in 
> the system so scheduler cannot allocate more reducer containers AND MR-AM 
> thinks there're enough headroom for mapper so reducer containers will not be 
> preempted.
> MAPREDUCE-6302 can solve most of the problems, but in the other hand, I think 
> we may need to exclude pending reducers resource when calculating 
> #available-mapper-slots from headroom. Which we can avoid excessive reducer 
> preemption.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MAPREDUCE-6541) Exclude pending reducer memory when calculating available mapper slots from headroom to avoid deadlock

2015-11-06 Thread Wangda Tan (JIRA)
Wangda Tan created MAPREDUCE-6541:
-

 Summary: Exclude pending reducer memory when calculating available 
mapper slots from headroom to avoid deadlock 
 Key: MAPREDUCE-6541
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6541
 Project: Hadoop Map/Reduce
  Issue Type: Bug
Reporter: Wangda Tan


We saw a MR deadlock recently:

- When NM restarted by framework without enable recovery, containers running on 
these nodes will be identified as "ABORTED", and MR AM will try to reschedule 
"ABORTED" mapper containers.
- Since such lost mappers are "ABORTED" container, MR AM gives normal mapper 
priority (priority=20) to such mapper requests. If there's any pending reducer 
(priority=10) at the same time, mapper requests need to wait for reducer 
requests satisfied.
- In our test, one mapper needs 700+ MB, reducer needs 1000+ MB, and RM 
available resource = mapper-request = (700+ MB), only one job was running in 
the system so scheduler cannot allocate more reducer containers AND MR-AM 
thinks there're enough headroom for mapper so reducer containers will not be 
preempted.

MAPREDUCE-6302 can solve most of the problems, but in the other hand, I think 
we may need to exclude pending reducers resource when calculating 
#available-mapper-slots from headroom. Which we can avoid excessive reducer 
preemption.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAPREDUCE-6302) Preempt reducers after a configurable timeout irrespective of headroom

2015-10-16 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14961050#comment-14961050
 ] 

Wangda Tan commented on MAPREDUCE-6302:
---

+1 to backport this issue to 2.6.x and 2.7.x

> Preempt reducers after a configurable timeout irrespective of headroom
> --
>
> Key: MAPREDUCE-6302
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6302
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Affects Versions: 2.6.0
>Reporter: mai shurong
>Assignee: Karthik Kambatla
>Priority: Critical
> Fix For: 2.8.0
>
> Attachments: AM_log_head10.txt.gz, AM_log_tail10.txt.gz, 
> log.txt, mr-6302-1.patch, mr-6302-2.patch, mr-6302-3.patch, mr-6302-4.patch, 
> mr-6302-5.patch, mr-6302-6.patch, mr-6302-7.patch, mr-6302-prelim.patch, 
> mr-6302_branch-2.patch, queue_with_max163cores.png, 
> queue_with_max263cores.png, queue_with_max333cores.png
>
>
> I submit a  big job, which has 500 maps and 350 reduce, to a 
> queue(fairscheduler) with 300 max cores. When the big mapreduce job is 
> running 100% maps, the 300 reduces have occupied 300 max cores in the queue. 
> And then, a map fails and retry, waiting for a core, while the 300 reduces 
> are waiting for failed map to finish. So a deadlock occur. As a result, the 
> job is blocked, and the later job in the queue cannot run because no 
> available cores in the queue.
> I think there is the similar issue for memory of a queue .



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAPREDUCE-6478) Add an option to skip cleanupJob stage or ignore cleanup failure during commitJob().

2015-09-16 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14790844#comment-14790844
 ] 

Wangda Tan commented on MAPREDUCE-6478:
---

Thanks [~djp], patch looks good, will wait for a few days to see if there's any 
opposite opinions.

> Add an option to skip cleanupJob stage or ignore cleanup failure during 
> commitJob().
> 
>
> Key: MAPREDUCE-6478
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6478
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>Reporter: Junping Du
>Assignee: Junping Du
> Attachments: MAPREDUCE-6478-v1.1.patch, MAPREDUCE-6478-v1.patch
>
>
> In some of our test cases for MR on public cloud scenario, a very big MR job 
> with hundreds or thousands of reducers cannot finish successfully because of 
> Job Cleanup failures which is caused by different scale/performance impact 
> for File System on the cloud (like AzureFS) which replacing HDFS's deletion 
> for whole directory with REST API calls on deleting each sub-directories 
> recursively. Even it get successfully, that could take much longer time 
> (hours) which is not necessary and waste time/resources especially in public 
> cloud scenario. 
> In these scenarios, some failures of cleanupJob can be ignored or user choose 
> to skip cleanupJob() completely make more sense. This is because making whole 
> job finish successfully with side effect of wasting some user spaces is much 
> better as user's jobs are usually comes and goes in public cloud, so have 
> choices to tolerant some temporary files exists with get rid of big job 
> re-run (or saving job's running time) is quite effective in time/resource 
> cost. 
> We should allow user to have this option (ignore failure or skip job cleanup 
> stage completely) especially when user know the cleanup failure is not due to 
> HDFS abnormal status but other FS' different performance trade-off.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MAPREDUCE-6478) Add an option to skip cleanupJob stage or ignore cleanup failure during commitJob().

2015-09-18 Thread Wangda Tan (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-6478?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated MAPREDUCE-6478:
--
   Resolution: Fixed
 Hadoop Flags: Reviewed
Fix Version/s: 2.8.0
   Status: Resolved  (was: Patch Available)

Committed to trunk/branch-2, thanks [~djp]!

> Add an option to skip cleanupJob stage or ignore cleanup failure during 
> commitJob().
> 
>
> Key: MAPREDUCE-6478
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6478
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>Reporter: Junping Du
>Assignee: Junping Du
> Fix For: 2.8.0
>
> Attachments: MAPREDUCE-6478-v1.1.patch, MAPREDUCE-6478-v1.patch
>
>
> In some of our test cases for MR on public cloud scenario, a very big MR job 
> with hundreds or thousands of reducers cannot finish successfully because of 
> Job Cleanup failures which is caused by different scale/performance impact 
> for File System on the cloud (like AzureFS) which replacing HDFS's deletion 
> for whole directory with REST API calls on deleting each sub-directories 
> recursively. Even it get successfully, that could take much longer time 
> (hours) which is not necessary and waste time/resources especially in public 
> cloud scenario. 
> In these scenarios, some failures of cleanupJob can be ignored or user choose 
> to skip cleanupJob() completely make more sense. This is because making whole 
> job finish successfully with side effect of wasting some user spaces is much 
> better as user's jobs are usually comes and goes in public cloud, so have 
> choices to tolerant some temporary files exists with get rid of big job 
> re-run (or saving job's running time) is quite effective in time/resource 
> cost. 
> We should allow user to have this option (ignore failure or skip job cleanup 
> stage completely) especially when user know the cleanup failure is not due to 
> HDFS abnormal status but other FS' different performance trade-off.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MAPREDUCE-6574) MR AM should print host of failed tasks.

2015-12-16 Thread Wangda Tan (JIRA)
Wangda Tan created MAPREDUCE-6574:
-

 Summary: MR AM should print host of failed tasks.
 Key: MAPREDUCE-6574
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6574
 Project: Hadoop Map/Reduce
  Issue Type: Bug
Reporter: Wangda Tan


Something tasks failed because of issues on NMs. For example, bad disk/network 
could cause reducer fetching failure and mappers need to be re-scheduled.

It will be very helpful to identify such issues if we could print host of 
failed tasks, which we can simply grep MR AM's log to see what happened.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAPREDUCE-6476) InvalidResourceException when Nodelabel don't have access to queue should be handled

2015-12-30 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15075343#comment-15075343
 ] 

Wangda Tan commented on MAPREDUCE-6476:
---

Looks good, +1, thanks [~bibinchundatt].

> InvalidResourceException when Nodelabel don't have access to queue should be 
> handled
> 
>
> Key: MAPREDUCE-6476
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6476
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: applicationmaster
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
> Attachments: 0001-MAPREDUCE-6476.patch, 0002-MAPREDUCE-6476.patch
>
>
> Steps to reproduce
> ===
> Submit  mapreduce job 
> # map to label x
> # reduce to label y
> Precondition
> # Queue b to which reduce is submitted not having access to label specified
> *Impact*
> # Jobs fail only of the RM-AM comunication timeout
> (About 10 mins i think)
> Should kill the job immediately when InvalidResourceException is received on 
> {{RMContainerRequestor#makeRemoteRequest}}
> *Logs*
> {noformat}
> 2015-09-11 16:44:30,116 ERROR [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMCommunicator: ERROR IN CONTACTING RM. 
> org.apache.hadoop.yarn.exceptions.InvalidResourceRequestException: Invalid 
> resource request, queue=b1 doesn't have permission to access all labels in 
> resource request. labelExpression of resource request=1. Queue labels=3
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.validateResourceRequest(SchedulerUtils.java:304)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.normalizeAndValidateRequest(SchedulerUtils.java:234)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.normalizeAndvalidateRequest(SchedulerUtils.java:250)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.RMServerUtils.normalizeAndValidateRequests(RMServerUtils.java:106)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.allocate(ApplicationMasterService.java:457)
>   at 
> org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.allocate(ApplicationMasterProtocolPBServiceImpl.java:60)
>   at 
> org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.callBlockingMethod(ApplicationMasterProtocol.java:99)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:636)
>   at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:976)
>   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2230)
>   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2226)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:422)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1667)
>   at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2224)
>   at sun.reflect.GeneratedConstructorAccessor39.newInstance(Unknown 
> Source)
>   at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>   at java.lang.reflect.Constructor.newInstance(Constructor.java:422)
>   at 
> org.apache.hadoop.yarn.ipc.RPCUtil.instantiateException(RPCUtil.java:53)
>   at 
> org.apache.hadoop.yarn.ipc.RPCUtil.instantiateYarnException(RPCUtil.java:75)
>   at 
> org.apache.hadoop.yarn.ipc.RPCUtil.unwrapAndThrowException(RPCUtil.java:116)
>   at 
> org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.allocate(ApplicationMasterProtocolPBClientImpl.java:79)
>   at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:497)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:251)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:103)
>   at com.sun.proxy.$Proxy37.allocate(Unknown Source)
>   at 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor.makeRemoteRequest(RMContainerRequestor.java:203)
>   at 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.getResources(RMContainerAllocator.java:694)
>   at 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.heartbeat(RMContainerAllocator.java:263)
>   at 
> org.apache.hadoop.mapreduce.v2.app.rm.RMCommunicator$AllocatorRunnable.run(RMCommunicator.java:281)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: 
> 

[jira] [Commented] (MAPREDUCE-6574) MR AM should print host of failed tasks.

2015-12-21 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15067083#comment-15067083
 ] 

Wangda Tan commented on MAPREDUCE-6574:
---

Hi [~mohdshahidkhan],

Thanks for working on this, some comments:
- I think we may not need to print nodeId for every transitions, 
{{DiagnosticInformationUpdater}} added such info to many different transitions.
- And DiagnosticInformationUpdater isn't used by transitions which is to FAILED 
status as well. For example, the one from succeeded -> failed: 
{{TooManyFetchFailureTransition}}
- Could you print a separate message when status changed from any state to 
FAILED state with the id and event type? Such as: {{Attempt X transitioned from 
state A to B, event type is E and nodeId=x}}. You may need to check if nodeId 
== null, IIRC, some task attempts could be failed before allocated to any node.

> MR AM should print host of failed tasks.
> 
>
> Key: MAPREDUCE-6574
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6574
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Reporter: Wangda Tan
>Assignee: Mohammad Shahid Khan
> Attachments: MAPREDUCE-6574-v1.patch
>
>
> Something tasks failed because of issues on NMs. For example, bad 
> disk/network could cause reducer fetching failure and mappers need to be 
> re-scheduled.
> It will be very helpful to identify such issues if we could print host of 
> failed tasks, which we can simply grep MR AM's log to see what happened.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MAPREDUCE-6574) MR AM should print host of failed tasks.

2015-12-21 Thread Wangda Tan (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-6574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated MAPREDUCE-6574:
--
Assignee: Mohammad Shahid Khan

> MR AM should print host of failed tasks.
> 
>
> Key: MAPREDUCE-6574
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6574
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Reporter: Wangda Tan
>Assignee: Mohammad Shahid Khan
> Attachments: MAPREDUCE-6574-v1.patch
>
>
> Something tasks failed because of issues on NMs. For example, bad 
> disk/network could cause reducer fetching failure and mappers need to be 
> re-scheduled.
> It will be very helpful to identify such issues if we could print host of 
> failed tasks, which we can simply grep MR AM's log to see what happened.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MAPREDUCE-6574) MR AM should print host of failed tasks.

2015-12-23 Thread Wangda Tan (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-6574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated MAPREDUCE-6574:
--
Status: Patch Available  (was: Open)

> MR AM should print host of failed tasks.
> 
>
> Key: MAPREDUCE-6574
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6574
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Reporter: Wangda Tan
>Assignee: Mohammad Shahid Khan
> Attachments: MAPREDUCE-6574-v1.patch, MAPREDUCE-6574-v2.patch
>
>
> Something tasks failed because of issues on NMs. For example, bad 
> disk/network could cause reducer fetching failure and mappers need to be 
> re-scheduled.
> It will be very helpful to identify such issues if we could print host of 
> failed tasks, which we can simply grep MR AM's log to see what happened.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAPREDUCE-6476) InvalidResourceException when Nodelabel don't have access to queue should be handled

2015-12-28 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15073040#comment-15073040
 ] 

Wangda Tan commented on MAPREDUCE-6476:
---

Thanks [~bibinchundatt], 
Some comments:
1) You can put InvalidLabelResourceRequestException into a separate catch, just 
like
{code}
try {
...
} catch ( InvalidLabelResourceRequestException e) {
   ...
}
{code}
Instead of checking {{e instanceof InvalidLabelResourceRequestException}}.
TestSchedulerUtils could be updated as well.

2) Diagnostic message:
{code}
String diagMsg = "Request dont have access to label so killing app";
{code}
I suggest to update it to "Requested node-label-expression is invalid:", with 
exception message.

> InvalidResourceException when Nodelabel don't have access to queue should be 
> handled
> 
>
> Key: MAPREDUCE-6476
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6476
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: applicationmaster
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
> Attachments: 0001-MAPREDUCE-6476.patch
>
>
> Steps to reproduce
> ===
> Submit  mapreduce job 
> # map to label x
> # reduce to label y
> Precondition
> # Queue b to which reduce is submitted not having access to label specified
> *Impact*
> # Jobs fail only of the RM-AM comunication timeout
> (About 10 mins i think)
> Should kill the job immediately when InvalidResourceException is received on 
> {{RMContainerRequestor#makeRemoteRequest}}
> *Logs*
> {noformat}
> 2015-09-11 16:44:30,116 ERROR [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMCommunicator: ERROR IN CONTACTING RM. 
> org.apache.hadoop.yarn.exceptions.InvalidResourceRequestException: Invalid 
> resource request, queue=b1 doesn't have permission to access all labels in 
> resource request. labelExpression of resource request=1. Queue labels=3
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.validateResourceRequest(SchedulerUtils.java:304)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.normalizeAndValidateRequest(SchedulerUtils.java:234)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.normalizeAndvalidateRequest(SchedulerUtils.java:250)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.RMServerUtils.normalizeAndValidateRequests(RMServerUtils.java:106)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.allocate(ApplicationMasterService.java:457)
>   at 
> org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.allocate(ApplicationMasterProtocolPBServiceImpl.java:60)
>   at 
> org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.callBlockingMethod(ApplicationMasterProtocol.java:99)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:636)
>   at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:976)
>   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2230)
>   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2226)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:422)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1667)
>   at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2224)
>   at sun.reflect.GeneratedConstructorAccessor39.newInstance(Unknown 
> Source)
>   at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>   at java.lang.reflect.Constructor.newInstance(Constructor.java:422)
>   at 
> org.apache.hadoop.yarn.ipc.RPCUtil.instantiateException(RPCUtil.java:53)
>   at 
> org.apache.hadoop.yarn.ipc.RPCUtil.instantiateYarnException(RPCUtil.java:75)
>   at 
> org.apache.hadoop.yarn.ipc.RPCUtil.unwrapAndThrowException(RPCUtil.java:116)
>   at 
> org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.allocate(ApplicationMasterProtocolPBClientImpl.java:79)
>   at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:497)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:251)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:103)
>   at com.sun.proxy.$Proxy37.allocate(Unknown Source)
>   at 
> 

[jira] [Updated] (MAPREDUCE-6574) MR AM should print host of failed tasks.

2015-12-28 Thread Wangda Tan (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-6574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated MAPREDUCE-6574:
--
   Resolution: Fixed
 Hadoop Flags: Reviewed
Fix Version/s: 2.8.0
   Status: Resolved  (was: Patch Available)

Committed to trunk/branch-2/branch-2.8, thanks [~mohdshahidkhan]!

> MR AM should print host of failed tasks.
> 
>
> Key: MAPREDUCE-6574
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6574
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Reporter: Wangda Tan
>Assignee: Mohammad Shahid Khan
> Fix For: 2.8.0
>
> Attachments: MAPREDUCE-6574-v1.patch, MAPREDUCE-6574-v2.patch, 
> MAPREDUCE-6574-v3.patch, MAPREDUCE-6574-v4.patch
>
>
> Something tasks failed because of issues on NMs. For example, bad 
> disk/network could cause reducer fetching failure and mappers need to be 
> re-scheduled.
> It will be very helpful to identify such issues if we could print host of 
> failed tasks, which we can simply grep MR AM's log to see what happened.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAPREDUCE-6476) InvalidResourceException when Nodelabel don't have access to queue should be handled

2015-11-23 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15022729#comment-15022729
 ] 

Wangda Tan commented on MAPREDUCE-6476:
---

[~bibinchundatt],

I think approach#1 seems more straightforward to me. In addition, I'm not sure 
if there's any other InvalidResourceRequestException we should handle as well, 
such as requested resource exceeds max-allocation.

> InvalidResourceException when Nodelabel don't have access to queue should be 
> handled
> 
>
> Key: MAPREDUCE-6476
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6476
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: applicationmaster
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
>
> Steps to reproduce
> ===
> Submit  mapreduce job 
> # map to label x
> # reduce to label y
> Precondition
> # Queue b to which reduce is submitted not having access to label specified
> *Impact*
> # Jobs fail only of the RM-AM comunication timeout
> (About 10 mins i think)
> Should kill the job immediately when InvalidResourceException is received on 
> {{RMContainerRequestor#makeRemoteRequest}}
> *Logs*
> {noformat}
> 2015-09-11 16:44:30,116 ERROR [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMCommunicator: ERROR IN CONTACTING RM. 
> org.apache.hadoop.yarn.exceptions.InvalidResourceRequestException: Invalid 
> resource request, queue=b1 doesn't have permission to access all labels in 
> resource request. labelExpression of resource request=1. Queue labels=3
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.validateResourceRequest(SchedulerUtils.java:304)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.normalizeAndValidateRequest(SchedulerUtils.java:234)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.normalizeAndvalidateRequest(SchedulerUtils.java:250)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.RMServerUtils.normalizeAndValidateRequests(RMServerUtils.java:106)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.allocate(ApplicationMasterService.java:457)
>   at 
> org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.allocate(ApplicationMasterProtocolPBServiceImpl.java:60)
>   at 
> org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.callBlockingMethod(ApplicationMasterProtocol.java:99)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:636)
>   at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:976)
>   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2230)
>   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2226)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:422)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1667)
>   at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2224)
>   at sun.reflect.GeneratedConstructorAccessor39.newInstance(Unknown 
> Source)
>   at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>   at java.lang.reflect.Constructor.newInstance(Constructor.java:422)
>   at 
> org.apache.hadoop.yarn.ipc.RPCUtil.instantiateException(RPCUtil.java:53)
>   at 
> org.apache.hadoop.yarn.ipc.RPCUtil.instantiateYarnException(RPCUtil.java:75)
>   at 
> org.apache.hadoop.yarn.ipc.RPCUtil.unwrapAndThrowException(RPCUtil.java:116)
>   at 
> org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.allocate(ApplicationMasterProtocolPBClientImpl.java:79)
>   at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:497)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:251)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:103)
>   at com.sun.proxy.$Proxy37.allocate(Unknown Source)
>   at 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor.makeRemoteRequest(RMContainerRequestor.java:203)
>   at 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.getResources(RMContainerAllocator.java:694)
>   at 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.heartbeat(RMContainerAllocator.java:263)
>   at 
> 

[jira] [Updated] (MAPREDUCE-6476) Label-related invalid resource request exception should be properly handled by application

2016-01-11 Thread Wangda Tan (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-6476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated MAPREDUCE-6476:
--
Summary: Label-related invalid resource request exception should be 
properly handled by application  (was: InvalidResourceException when Nodelabel 
don't have access to queue should be handled)

> Label-related invalid resource request exception should be properly handled 
> by application
> --
>
> Key: MAPREDUCE-6476
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6476
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: applicationmaster
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
> Attachments: 0001-MAPREDUCE-6476.patch, 0002-MAPREDUCE-6476.patch
>
>
> Steps to reproduce
> ===
> Submit  mapreduce job 
> # map to label x
> # reduce to label y
> Precondition
> # Queue b to which reduce is submitted not having access to label specified
> *Impact*
> # Jobs fail only of the RM-AM comunication timeout
> (About 10 mins i think)
> Should kill the job immediately when InvalidResourceException is received on 
> {{RMContainerRequestor#makeRemoteRequest}}
> *Logs*
> {noformat}
> 2015-09-11 16:44:30,116 ERROR [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMCommunicator: ERROR IN CONTACTING RM. 
> org.apache.hadoop.yarn.exceptions.InvalidResourceRequestException: Invalid 
> resource request, queue=b1 doesn't have permission to access all labels in 
> resource request. labelExpression of resource request=1. Queue labels=3
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.validateResourceRequest(SchedulerUtils.java:304)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.normalizeAndValidateRequest(SchedulerUtils.java:234)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.normalizeAndvalidateRequest(SchedulerUtils.java:250)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.RMServerUtils.normalizeAndValidateRequests(RMServerUtils.java:106)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.allocate(ApplicationMasterService.java:457)
>   at 
> org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.allocate(ApplicationMasterProtocolPBServiceImpl.java:60)
>   at 
> org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.callBlockingMethod(ApplicationMasterProtocol.java:99)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:636)
>   at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:976)
>   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2230)
>   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2226)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:422)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1667)
>   at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2224)
>   at sun.reflect.GeneratedConstructorAccessor39.newInstance(Unknown 
> Source)
>   at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>   at java.lang.reflect.Constructor.newInstance(Constructor.java:422)
>   at 
> org.apache.hadoop.yarn.ipc.RPCUtil.instantiateException(RPCUtil.java:53)
>   at 
> org.apache.hadoop.yarn.ipc.RPCUtil.instantiateYarnException(RPCUtil.java:75)
>   at 
> org.apache.hadoop.yarn.ipc.RPCUtil.unwrapAndThrowException(RPCUtil.java:116)
>   at 
> org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.allocate(ApplicationMasterProtocolPBClientImpl.java:79)
>   at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:497)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:251)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:103)
>   at com.sun.proxy.$Proxy37.allocate(Unknown Source)
>   at 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor.makeRemoteRequest(RMContainerRequestor.java:203)
>   at 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.getResources(RMContainerAllocator.java:694)
>   at 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.heartbeat(RMContainerAllocator.java:263)
>   at 
> 

[jira] [Commented] (MAPREDUCE-6579) Test failure : TestNetworkedJob

2016-01-27 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15120898#comment-15120898
 ] 

Wangda Tan commented on MAPREDUCE-6579:
---

Thanks [~ajisakaa]/[~Naganarasimha],

A couple of comments:
- Do you think we should rename JobStatus#getFailureInfo? Diagnostic message 
doesn't always means failure, JobStatus's InterfaceStability marked to 
"Evolving", we can update it.
- I would suggest not compare the diagnostic info, it is evolving, only human 
is supposed to read this message and we have test in YARN already.
{code}
175   String diag = runningJob.getFailureInfo();
176   assertTrue("The state of the ApplicationMaster should be 
activated," +
177   "assigned, or launched.",
178   diag.contains(AMState.ACTIVATED.getDiagnosticMessage()) ||
179   diag.contains(AMState.ASSIGNED.getDiagnosticMessage()) ||
180   diag.contains(AMState.LAUNCHED.getDiagnosticMessage()));
{code} 

Thoughts?

> Test failure : TestNetworkedJob
> ---
>
> Key: MAPREDUCE-6579
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6579
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: test
>Reporter: Rohith Sharma K S
>Assignee: Akira AJISAKA
> Attachments: MAPREDUCE-6579.01.patch, MAPREDUCE-6579.02.patch, 
> MAPREDUCE-6579.03.patch
>
>
> From 
> [https://builds.apache.org/job/PreCommit-YARN-Build/9976/artifact/patchprocess/patch-unit-hadoop-mapreduce-project_hadoop-mapreduce-client_hadoop-mapreduce-client-jobclient-jdk1.8.0_66.txt]
>  TestNetworkedJob are failed intermittently.
> {code}
> Running org.apache.hadoop.mapred.TestNetworkedJob
> Tests run: 5, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 81.131 sec 
> <<< FAILURE! - in org.apache.hadoop.mapred.TestNetworkedJob
> testNetworkedJob(org.apache.hadoop.mapred.TestNetworkedJob)  Time elapsed: 
> 30.55 sec  <<< FAILURE!
> org.junit.ComparisonFailure: expected:<[[Tue Dec 15 14:02:45 + 2015] 
> Application is Activated, waiting for resources to be assigned for AM.  
> Details : AM Partition =  ; Partition Resource = 
>  ; Queue's Absolute capacity = 100.0 % ; Queue's 
> Absolute used capacity = 0.0 % ; Queue's Absolute max capacity = 100.0 % ; ]> 
> but was:<[]>
>   at org.junit.Assert.assertEquals(Assert.java:115)
>   at org.junit.Assert.assertEquals(Assert.java:144)
>   at 
> org.apache.hadoop.mapred.TestNetworkedJob.testNetworkedJob(TestNetworkedJob.java:174)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAPREDUCE-6579) Test failure : TestNetworkedJob

2016-01-28 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15122804#comment-15122804
 ] 

Wangda Tan commented on MAPREDUCE-6579:
---

Agree, latest patch LGTM, thanks!

> Test failure : TestNetworkedJob
> ---
>
> Key: MAPREDUCE-6579
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6579
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: test
>Reporter: Rohith Sharma K S
>Assignee: Akira AJISAKA
> Attachments: MAPREDUCE-6579.01.patch, MAPREDUCE-6579.02.patch, 
> MAPREDUCE-6579.03.patch, MAPREDUCE-6579.04.patch
>
>
> From 
> [https://builds.apache.org/job/PreCommit-YARN-Build/9976/artifact/patchprocess/patch-unit-hadoop-mapreduce-project_hadoop-mapreduce-client_hadoop-mapreduce-client-jobclient-jdk1.8.0_66.txt]
>  TestNetworkedJob are failed intermittently.
> {code}
> Running org.apache.hadoop.mapred.TestNetworkedJob
> Tests run: 5, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 81.131 sec 
> <<< FAILURE! - in org.apache.hadoop.mapred.TestNetworkedJob
> testNetworkedJob(org.apache.hadoop.mapred.TestNetworkedJob)  Time elapsed: 
> 30.55 sec  <<< FAILURE!
> org.junit.ComparisonFailure: expected:<[[Tue Dec 15 14:02:45 + 2015] 
> Application is Activated, waiting for resources to be assigned for AM.  
> Details : AM Partition =  ; Partition Resource = 
>  ; Queue's Absolute capacity = 100.0 % ; Queue's 
> Absolute used capacity = 0.0 % ; Queue's Absolute max capacity = 100.0 % ; ]> 
> but was:<[]>
>   at org.junit.Assert.assertEquals(Assert.java:115)
>   at org.junit.Assert.assertEquals(Assert.java:144)
>   at 
> org.apache.hadoop.mapred.TestNetworkedJob.testNetworkedJob(TestNetworkedJob.java:174)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAPREDUCE-6579) JobStatus#getFailureInfo should not output diagnostic information when the job is running

2016-02-16 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15149092#comment-15149092
 ] 

Wangda Tan commented on MAPREDUCE-6579:
---

Apologize for the delays, I was on vacation in recent a couple of weeks.

Went through discussions,

For concerns from [~jlowe]:
bq. It's a little unfortunate that YARN-3946 started putting non-fatal messages 
into what is typically an app-driven diagnostic repository
bq. It seems these new messages only make sense to report when the job is 
active and are mostly noise afterwards.
As 
[mentioned|https://issues.apache.org/jira/browse/MAPREDUCE-6579?focusedCommentId=15146537=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15146537]
 by [~Naganarasimha], diagnostic message added by YARN-3946 only exists before 
AM registers to RM, and it will be cleaned up after AM registers. So I think 
this will not pollute the final failure message from app.

The only issue I can see is MR assumes diagnostic message == failure message: 
To make changes of YARN-3946 backward compatible to MR apps, I think 05-patch 
is the simplest way to fix the problem: it avoid taking any field modified by 
YARN-3946 and keeps everything else same.

Thoughts? [~Naganarasimha Garla]/[~ajisakaa]/[~sunilg].

> JobStatus#getFailureInfo should not output diagnostic information when the 
> job is running
> -
>
> Key: MAPREDUCE-6579
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6579
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: test
>Reporter: Rohith Sharma K S
>Assignee: Akira AJISAKA
>Priority: Blocker
> Attachments: MAPREDUCE-6579.01.patch, MAPREDUCE-6579.02.patch, 
> MAPREDUCE-6579.03.patch, MAPREDUCE-6579.04.patch, MAPREDUCE-6579.05.patch, 
> MAPREDUCE-6579.06.patch
>
>
> From 
> [https://builds.apache.org/job/PreCommit-YARN-Build/9976/artifact/patchprocess/patch-unit-hadoop-mapreduce-project_hadoop-mapreduce-client_hadoop-mapreduce-client-jobclient-jdk1.8.0_66.txt]
>  TestNetworkedJob are failed intermittently.
> {code}
> Running org.apache.hadoop.mapred.TestNetworkedJob
> Tests run: 5, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 81.131 sec 
> <<< FAILURE! - in org.apache.hadoop.mapred.TestNetworkedJob
> testNetworkedJob(org.apache.hadoop.mapred.TestNetworkedJob)  Time elapsed: 
> 30.55 sec  <<< FAILURE!
> org.junit.ComparisonFailure: expected:<[[Tue Dec 15 14:02:45 + 2015] 
> Application is Activated, waiting for resources to be assigned for AM.  
> Details : AM Partition =  ; Partition Resource = 
>  ; Queue's Absolute capacity = 100.0 % ; Queue's 
> Absolute used capacity = 0.0 % ; Queue's Absolute max capacity = 100.0 % ; ]> 
> but was:<[]>
>   at org.junit.Assert.assertEquals(Assert.java:115)
>   at org.junit.Assert.assertEquals(Assert.java:144)
>   at 
> org.apache.hadoop.mapred.TestNetworkedJob.testNetworkedJob(TestNetworkedJob.java:174)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAPREDUCE-6304) Specifying node labels when submitting MR jobs

2016-02-16 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15149292#comment-15149292
 ] 

Wangda Tan commented on MAPREDUCE-6304:
---

[~sunilg],

I would suggest don't backport this issue, it's a new feature instead of 
criticial bug fix. And I think user who wants to try node label feature should 
move to more stable 2.7.x release

> Specifying node labels when submitting MR jobs
> --
>
> Key: MAPREDUCE-6304
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6304
> Project: Hadoop Map/Reduce
>  Issue Type: New Feature
>Reporter: Jian Fang
>Assignee: Naganarasimha G R
>  Labels: mapreduce
> Fix For: 2.8.0
>
> Attachments: MAPREDUCE-6304.20150410-1.patch, 
> MAPREDUCE-6304.20150411-1.patch, MAPREDUCE-6304.20150501-1.patch, 
> MAPREDUCE-6304.20150510-1.patch, MAPREDUCE-6304.20150511-1.patch, 
> MAPREDUCE-6304.20150512-1.patch
>
>
> Per the discussion on YARN-796, we need a mechanism in MAPREDUCE to specify 
> node labels when submitting MR jobs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAPREDUCE-6579) JobStatus#getFailureInfo should not output diagnostic information when the job is running

2016-02-17 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15151219#comment-15151219
 ] 

Wangda Tan commented on MAPREDUCE-6579:
---

Hi [~Naganarasimha],

I think we need to fix both:
- JobStatus.getFailureInfo
- NotRunningJob.getDiagnostics

For original MR, diagnostic message can be only set when application finishes.

Diagnostic message returned only job is failed or killed In the 07 patch, I 
think we should at least include finished state. I'm not sure if existing MR 
sets diagnostic message when job finishes successfully, but it is very possible 
that people will do it in the future.

Thoughts?

> JobStatus#getFailureInfo should not output diagnostic information when the 
> job is running
> -
>
> Key: MAPREDUCE-6579
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6579
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: test
>Reporter: Rohith Sharma K S
>Assignee: Akira AJISAKA
>Priority: Blocker
> Attachments: MAPREDUCE-6579.01.patch, MAPREDUCE-6579.02.patch, 
> MAPREDUCE-6579.03.patch, MAPREDUCE-6579.04.patch, MAPREDUCE-6579.05.patch, 
> MAPREDUCE-6579.06.patch, MAPREDUCE-6579.07.patch
>
>
> From 
> [https://builds.apache.org/job/PreCommit-YARN-Build/9976/artifact/patchprocess/patch-unit-hadoop-mapreduce-project_hadoop-mapreduce-client_hadoop-mapreduce-client-jobclient-jdk1.8.0_66.txt]
>  TestNetworkedJob are failed intermittently.
> {code}
> Running org.apache.hadoop.mapred.TestNetworkedJob
> Tests run: 5, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 81.131 sec 
> <<< FAILURE! - in org.apache.hadoop.mapred.TestNetworkedJob
> testNetworkedJob(org.apache.hadoop.mapred.TestNetworkedJob)  Time elapsed: 
> 30.55 sec  <<< FAILURE!
> org.junit.ComparisonFailure: expected:<[[Tue Dec 15 14:02:45 + 2015] 
> Application is Activated, waiting for resources to be assigned for AM.  
> Details : AM Partition =  ; Partition Resource = 
>  ; Queue's Absolute capacity = 100.0 % ; Queue's 
> Absolute used capacity = 0.0 % ; Queue's Absolute max capacity = 100.0 % ; ]> 
> but was:<[]>
>   at org.junit.Assert.assertEquals(Assert.java:115)
>   at org.junit.Assert.assertEquals(Assert.java:144)
>   at 
> org.apache.hadoop.mapred.TestNetworkedJob.testNetworkedJob(TestNetworkedJob.java:174)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAPREDUCE-6579) JobStatus#getFailureInfo should not output diagnostic information when the job is running

2016-02-18 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15152840#comment-15152840
 ] 

Wangda Tan commented on MAPREDUCE-6579:
---

[~Naganarasimha], [~ajisakaa].

I agree to only handle the impact of YARN-3946 in this JIRA.

However, I still think we should show diagnostic message for completed 
applications. Without YARN-3946, diagnostic message is added to NotRunningJob 
in any case. We'd better not change that behavior.

Even though, according to definition of YARN, FINISHED state means 
successfully. However, for individual application, it is possible that adding 
some information or explanation to the FINISHED state.

> JobStatus#getFailureInfo should not output diagnostic information when the 
> job is running
> -
>
> Key: MAPREDUCE-6579
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6579
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: test
>Reporter: Rohith Sharma K S
>Assignee: Akira AJISAKA
>Priority: Blocker
> Attachments: MAPREDUCE-6579.01.patch, MAPREDUCE-6579.02.patch, 
> MAPREDUCE-6579.03.patch, MAPREDUCE-6579.04.patch, MAPREDUCE-6579.05.patch, 
> MAPREDUCE-6579.06.patch, MAPREDUCE-6579.07.patch
>
>
> From 
> [https://builds.apache.org/job/PreCommit-YARN-Build/9976/artifact/patchprocess/patch-unit-hadoop-mapreduce-project_hadoop-mapreduce-client_hadoop-mapreduce-client-jobclient-jdk1.8.0_66.txt]
>  TestNetworkedJob are failed intermittently.
> {code}
> Running org.apache.hadoop.mapred.TestNetworkedJob
> Tests run: 5, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 81.131 sec 
> <<< FAILURE! - in org.apache.hadoop.mapred.TestNetworkedJob
> testNetworkedJob(org.apache.hadoop.mapred.TestNetworkedJob)  Time elapsed: 
> 30.55 sec  <<< FAILURE!
> org.junit.ComparisonFailure: expected:<[[Tue Dec 15 14:02:45 + 2015] 
> Application is Activated, waiting for resources to be assigned for AM.  
> Details : AM Partition =  ; Partition Resource = 
>  ; Queue's Absolute capacity = 100.0 % ; Queue's 
> Absolute used capacity = 0.0 % ; Queue's Absolute max capacity = 100.0 % ; ]> 
> but was:<[]>
>   at org.junit.Assert.assertEquals(Assert.java:115)
>   at org.junit.Assert.assertEquals(Assert.java:144)
>   at 
> org.apache.hadoop.mapred.TestNetworkedJob.testNetworkedJob(TestNetworkedJob.java:174)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MAPREDUCE-6579) JobStatus#getFailureInfo should not output diagnostic information when the job is running

2016-03-15 Thread Wangda Tan (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-6579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated MAPREDUCE-6579:
--
   Resolution: Fixed
 Hadoop Flags: Reviewed
Fix Version/s: 2.8.0
   Status: Resolved  (was: Patch Available)

Committed to trunk/branch-2/branch-2.8, thanks [~ajisakaa] and reviews from 
[~Naganarasimha]/[~rohithsharma]. 

> JobStatus#getFailureInfo should not output diagnostic information when the 
> job is running
> -
>
> Key: MAPREDUCE-6579
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6579
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: test
>Reporter: Rohith Sharma K S
>Assignee: Akira AJISAKA
>Priority: Blocker
> Fix For: 2.8.0
>
> Attachments: MAPREDUCE-6579.01.patch, MAPREDUCE-6579.02.patch, 
> MAPREDUCE-6579.03.patch, MAPREDUCE-6579.04.patch, MAPREDUCE-6579.05.patch, 
> MAPREDUCE-6579.06.patch, MAPREDUCE-6579.07.patch, MAPREDUCE-6579.08.patch
>
>
> From 
> [https://builds.apache.org/job/PreCommit-YARN-Build/9976/artifact/patchprocess/patch-unit-hadoop-mapreduce-project_hadoop-mapreduce-client_hadoop-mapreduce-client-jobclient-jdk1.8.0_66.txt]
>  TestNetworkedJob are failed intermittently.
> {code}
> Running org.apache.hadoop.mapred.TestNetworkedJob
> Tests run: 5, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 81.131 sec 
> <<< FAILURE! - in org.apache.hadoop.mapred.TestNetworkedJob
> testNetworkedJob(org.apache.hadoop.mapred.TestNetworkedJob)  Time elapsed: 
> 30.55 sec  <<< FAILURE!
> org.junit.ComparisonFailure: expected:<[[Tue Dec 15 14:02:45 + 2015] 
> Application is Activated, waiting for resources to be assigned for AM.  
> Details : AM Partition =  ; Partition Resource = 
>  ; Queue's Absolute capacity = 100.0 % ; Queue's 
> Absolute used capacity = 0.0 % ; Queue's Absolute max capacity = 100.0 % ; ]> 
> but was:<[]>
>   at org.junit.Assert.assertEquals(Assert.java:115)
>   at org.junit.Assert.assertEquals(Assert.java:144)
>   at 
> org.apache.hadoop.mapred.TestNetworkedJob.testNetworkedJob(TestNetworkedJob.java:174)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAPREDUCE-6579) JobStatus#getFailureInfo should not output diagnostic information when the job is running

2016-03-15 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15195842#comment-15195842
 ] 

Wangda Tan commented on MAPREDUCE-6579:
---

Hi [~ajisakaa],

Sorry for my delays. I thought I have +1-ed already.

Thanks for updates, +1 to latest patch. Will commit this shortly.

> JobStatus#getFailureInfo should not output diagnostic information when the 
> job is running
> -
>
> Key: MAPREDUCE-6579
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6579
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: test
>Reporter: Rohith Sharma K S
>Assignee: Akira AJISAKA
>Priority: Blocker
> Attachments: MAPREDUCE-6579.01.patch, MAPREDUCE-6579.02.patch, 
> MAPREDUCE-6579.03.patch, MAPREDUCE-6579.04.patch, MAPREDUCE-6579.05.patch, 
> MAPREDUCE-6579.06.patch, MAPREDUCE-6579.07.patch, MAPREDUCE-6579.08.patch
>
>
> From 
> [https://builds.apache.org/job/PreCommit-YARN-Build/9976/artifact/patchprocess/patch-unit-hadoop-mapreduce-project_hadoop-mapreduce-client_hadoop-mapreduce-client-jobclient-jdk1.8.0_66.txt]
>  TestNetworkedJob are failed intermittently.
> {code}
> Running org.apache.hadoop.mapred.TestNetworkedJob
> Tests run: 5, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 81.131 sec 
> <<< FAILURE! - in org.apache.hadoop.mapred.TestNetworkedJob
> testNetworkedJob(org.apache.hadoop.mapred.TestNetworkedJob)  Time elapsed: 
> 30.55 sec  <<< FAILURE!
> org.junit.ComparisonFailure: expected:<[[Tue Dec 15 14:02:45 + 2015] 
> Application is Activated, waiting for resources to be assigned for AM.  
> Details : AM Partition =  ; Partition Resource = 
>  ; Queue's Absolute capacity = 100.0 % ; Queue's 
> Absolute used capacity = 0.0 % ; Queue's Absolute max capacity = 100.0 % ; ]> 
> but was:<[]>
>   at org.junit.Assert.assertEquals(Assert.java:115)
>   at org.junit.Assert.assertEquals(Assert.java:144)
>   at 
> org.apache.hadoop.mapred.TestNetworkedJob.testNetworkedJob(TestNetworkedJob.java:174)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAPREDUCE-6513) MR job got hanged forever when one NM unstable for some time

2016-04-12 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15237699#comment-15237699
 ] 

Wangda Tan commented on MAPREDUCE-6513:
---

Patch looks good to me, thanks [~varun_saxena]!

> MR job got hanged forever when one NM unstable for some time
> 
>
> Key: MAPREDUCE-6513
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6513
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: applicationmaster, resourcemanager
>Affects Versions: 2.7.0
>Reporter: Bob.zhao
>Assignee: Varun Saxena
>Priority: Critical
> Attachments: MAPREDUCE-6513.01.patch, MAPREDUCE-6513.02.patch, 
> MAPREDUCE-6513.03.patch
>
>
> when job is in-progress which is having more tasks,one node became unstable 
> due to some OS issue.After the node became unstable, the map on this node 
> status changed to KILLED state. 
> Currently maps which were running on unstable node are rescheduled, and all 
> are in scheduled state and wait for RM assign container.Seen ask requests for 
> map till Node is good (all those failed), there are no ask request after 
> this. But AM keeps on preempting the reducers (it's recycling).
> Finally reducers are waiting for complete mappers and mappers did n't get 
> container..
> My Question Is:
> 
> why map requests did not sent AM ,once after node recovery.?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MAPREDUCE-6302) Preempt reducers after a configurable timeout irrespective of headroom

2016-04-08 Thread Wangda Tan (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-6302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated MAPREDUCE-6302:
--
Fix Version/s: 2.6.5
   2.7.3

> Preempt reducers after a configurable timeout irrespective of headroom
> --
>
> Key: MAPREDUCE-6302
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6302
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Affects Versions: 2.6.0
>Reporter: mai shurong
>Assignee: Karthik Kambatla
>Priority: Critical
> Fix For: 2.8.0, 2.7.3, 2.6.5
>
> Attachments: AM_log_head10.txt.gz, AM_log_tail10.txt.gz, 
> MAPREDUCE-6302.branch-2.6.0001.patch, MAPREDUCE-6302.branch-2.7.0001.patch, 
> log.txt, mr-6302-1.patch, mr-6302-2.patch, mr-6302-3.patch, mr-6302-4.patch, 
> mr-6302-5.patch, mr-6302-6.patch, mr-6302-7.patch, mr-6302-prelim.patch, 
> mr-6302_branch-2.patch, queue_with_max163cores.png, 
> queue_with_max263cores.png, queue_with_max333cores.png
>
>
> I submit a  big job, which has 500 maps and 350 reduce, to a 
> queue(fairscheduler) with 300 max cores. When the big mapreduce job is 
> running 100% maps, the 300 reduces have occupied 300 max cores in the queue. 
> And then, a map fails and retry, waiting for a core, while the 300 reduces 
> are waiting for failed map to finish. So a deadlock occur. As a result, the 
> job is blocked, and the later job in the queue cannot run because no 
> available cores in the queue.
> I think there is the similar issue for memory of a queue .



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAPREDUCE-6302) Preempt reducers after a configurable timeout irrespective of headroom

2016-04-08 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15233001#comment-15233001
 ] 

Wangda Tan commented on MAPREDUCE-6302:
---

Done committed to branch-2.6/2.7

> Preempt reducers after a configurable timeout irrespective of headroom
> --
>
> Key: MAPREDUCE-6302
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6302
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Affects Versions: 2.6.0
>Reporter: mai shurong
>Assignee: Karthik Kambatla
>Priority: Critical
> Fix For: 2.8.0, 2.7.3, 2.6.5
>
> Attachments: AM_log_head10.txt.gz, AM_log_tail10.txt.gz, 
> MAPREDUCE-6302.branch-2.6.0001.patch, MAPREDUCE-6302.branch-2.7.0001.patch, 
> log.txt, mr-6302-1.patch, mr-6302-2.patch, mr-6302-3.patch, mr-6302-4.patch, 
> mr-6302-5.patch, mr-6302-6.patch, mr-6302-7.patch, mr-6302-prelim.patch, 
> mr-6302_branch-2.patch, queue_with_max163cores.png, 
> queue_with_max263cores.png, queue_with_max333cores.png
>
>
> I submit a  big job, which has 500 maps and 350 reduce, to a 
> queue(fairscheduler) with 300 max cores. When the big mapreduce job is 
> running 100% maps, the 300 reduces have occupied 300 max cores in the queue. 
> And then, a map fails and retry, waiting for a core, while the 300 reduces 
> are waiting for failed map to finish. So a deadlock occur. As a result, the 
> job is blocked, and the later job in the queue cannot run because no 
> available cores in the queue.
> I think there is the similar issue for memory of a queue .



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAPREDUCE-6302) Preempt reducers after a configurable timeout irrespective of headroom

2016-04-08 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15232987#comment-15232987
 ] 

Wangda Tan commented on MAPREDUCE-6302:
---

Apologize I forgot backporting patches to maintenance releases. Doing it now.

> Preempt reducers after a configurable timeout irrespective of headroom
> --
>
> Key: MAPREDUCE-6302
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6302
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Affects Versions: 2.6.0
>Reporter: mai shurong
>Assignee: Karthik Kambatla
>Priority: Critical
> Fix For: 2.8.0
>
> Attachments: AM_log_head10.txt.gz, AM_log_tail10.txt.gz, 
> MAPREDUCE-6302.branch-2.6.0001.patch, MAPREDUCE-6302.branch-2.7.0001.patch, 
> log.txt, mr-6302-1.patch, mr-6302-2.patch, mr-6302-3.patch, mr-6302-4.patch, 
> mr-6302-5.patch, mr-6302-6.patch, mr-6302-7.patch, mr-6302-prelim.patch, 
> mr-6302_branch-2.patch, queue_with_max163cores.png, 
> queue_with_max263cores.png, queue_with_max333cores.png
>
>
> I submit a  big job, which has 500 maps and 350 reduce, to a 
> queue(fairscheduler) with 300 max cores. When the big mapreduce job is 
> running 100% maps, the 300 reduces have occupied 300 max cores in the queue. 
> And then, a map fails and retry, waiting for a core, while the 300 reduces 
> are waiting for failed map to finish. So a deadlock occur. As a result, the 
> job is blocked, and the later job in the queue cannot run because no 
> available cores in the queue.
> I think there is the similar issue for memory of a queue .



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAPREDUCE-6108) ShuffleError OOM while reserving memory by MergeManagerImpl

2016-05-11 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6108?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15281053#comment-15281053
 ] 

Wangda Tan commented on MAPREDUCE-6108:
---

[~kasha], [~vinodkv] is this still an issue in existing code base? Can we close 
as not-reproducible if it cannot be reproduced?

Thanks,

> ShuffleError OOM while reserving memory by MergeManagerImpl
> ---
>
> Key: MAPREDUCE-6108
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6108
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Affects Versions: 2.5.1
>Reporter: Dongwook Kwon
>Priority: Critical
>
> Shuffle has OOM issue from time to time.  
> Such as this email reported.
> http://mail-archives.apache.org/mod_mbox/hadoop-mapreduce-dev/201408.mbox/%3ccabwxxjnk-on0xtrmurijd8sdgjjtamsvqw2czpm3oekj3ym...@mail.gmail.com%3E



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org



  1   2   >