[jira] [Commented] (MAPREDUCE-4819) AM can rerun job after reporting final job status to the client

2013-01-03 Thread Alejandro Abdelnur (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-4819?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13542809#comment-13542809
 ] 

Alejandro Abdelnur commented on MAPREDUCE-4819:
---

bq. For Job End notification. This is hitting a URL to indicate that the job 
has finished and if it has finished successfully or in error. I do need to do 
some integration tests with Oozie to validate that it can handle being informed 
more then once without having any real problems.

Oozie handles duplicate notifications correctly doing a NOP.

 AM can rerun job after reporting final job status to the client
 ---

 Key: MAPREDUCE-4819
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4819
 Project: Hadoop Map/Reduce
  Issue Type: Bug
  Components: mr-am
Affects Versions: 0.23.3, 2.0.1-alpha
Reporter: Jason Lowe
Assignee: Bikas Saha
Priority: Critical
 Attachments: MAPREDUCE-4819.1.patch, MAPREDUCE-4819.2.patch, 
 MAPREDUCE-4819.3.patch, MR-4819-bobby-trunk.txt, MR-4819-bobby-trunk.txt


 If the AM reports final job status to the client but then crashes before 
 unregistering with the RM then the RM can run another AM attempt.  Currently 
 AM re-attempts assume that the previous attempts did not reach a final job 
 state, and that causes the job to rerun (from scratch, if the output format 
 doesn't support recovery).
 Re-running the job when we've already told the client the final status of the 
 job is bad for a number of reasons.  If the job failed, it's confusing at 
 best since the client was already told the job failed but the subsequent 
 attempt could succeed.  If the job succeeded there could be data loss, as a 
 subsequent job launched by the client tries to consume the job's output as 
 input just as the re-attempt starts removing output files in preparation for 
 the output commit.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAPREDUCE-2217) The expire launching task should cover the UNASSIGNED task

2013-01-03 Thread Alejandro Abdelnur (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-2217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13542810#comment-13542810
 ] 

Alejandro Abdelnur commented on MAPREDUCE-2217:
---

+1. Nice job forcing the problem to verify the fix.

 The expire launching task should cover the UNASSIGNED task
 --

 Key: MAPREDUCE-2217
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2217
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
  Components: jobtracker
Affects Versions: 0.23.0, 1.1.1
Reporter: Scott Chen
Assignee: Karthik Kambatla
 Attachments: expose-bug-mr-2217.patch, MAPREDUCE-2217.1.txt, 
 MR-2217.patch, MR-2217.patch


 The ExpireLaunchingTask thread kills the task that are scheduled but not 
 responded.
 Currently if a task is scheduled on tasktracker and for some reason 
 tasktracker cannot put it to RUNNING.
 The task will just hang in the UNASSIGNED status and JobTracker will keep 
 waiting for it.
 JobTracker.ExpireLaunchingTask should be able to kill this task.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAPREDUCE-4819) AM can rerun job after reporting final job status to the client

2013-01-03 Thread Siddharth Seth (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-4819?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13542814#comment-13542814
 ] 

Siddharth Seth commented on MAPREDUCE-4819:
---

Bobby, Jason, Along with trying to ensure that a commit does not happen twice, 
I think there is value in committing the job history file before changing job 
status to SUCCESS - primarily for the RPC to behave consistently. It can 
otherwise see temporary final states, if the AM crashes during the history file 
persist, and won't be able to retrieve counters or other job status till the 
next AM attempt. This does have the drawback of a small performance hit though 
- and also makes job history a critical part of a job.
Using separate files for marking success / failure - am guessing this is to 
have a smaller change of a failing persist, as compared to persisting events 
via the HistoryFile, which may already have a backlog of events ?

Wondering if it's possible to achieve the same checks via the 
CommitterEventHandler instead of checking in the MRAppMaster class. i.e follow 
the regular recovery path - except the CommitHandler emits success / failed / 
abort events depending on the presence of these files / (history events).
Alternately, the current implementation could be simplified by using a custom 
RMCommunicator - which does not depend on JobImpl. i.e. the history copier and 
an RMCommunicator to unregister from the RM.

Comments on the current patch
- If the last AM attempt were to crash - data exists since the _SUCCESS_ file 
exists, RPC will not see SUCCESS.
- While the new AM is running - it will not be able to handle status, counter 
etc requests. This seems a little problematic if a success has been reported 
over RPC from the previous AM. Since this AM is dealing with the history file - 
could possibly have it return information from the history file ?
History commit before SUCCESS may help with the previous 2 points.

- If the recovered AppMaster is not the last retry - looks like the RM 
unregistration will not happen. (isLastAMRetry)
- Is a KILLED status also required - KILLED during commit should not be 
reported as FAILED
- The check for commitSuccess / commitFailure in the AM - the failure check can 
happen before the success check (low chance but a success file could be created 
followed by an RPC failure)
- CommitEventHandler.touchz could throw an exception if the file already exists 
- to prevent lost AMs from committing. (maybe not required after MAPREDUCE-4832 
?)
- historyService creation - can move into the common if (copyHistory) check
- Don't think AMStartedEvent cannot be ignored - the history server will have 
no info about past AMs. I think only the current AM needs to be ignored.

Wondering if it's possible to use HDFS dirs and timestamps to co-ordinate 
between an active AM and lost AMs. 
Also, are hdfs dir operations cheaper than file create operations (NN only / NN 
+DN) ? Nor sure if mkdir / 0 length file creation are NN only ops.


 AM can rerun job after reporting final job status to the client
 ---

 Key: MAPREDUCE-4819
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4819
 Project: Hadoop Map/Reduce
  Issue Type: Bug
  Components: mr-am
Affects Versions: 0.23.3, 2.0.1-alpha
Reporter: Jason Lowe
Assignee: Bikas Saha
Priority: Critical
 Attachments: MAPREDUCE-4819.1.patch, MAPREDUCE-4819.2.patch, 
 MAPREDUCE-4819.3.patch, MR-4819-bobby-trunk.txt, MR-4819-bobby-trunk.txt


 If the AM reports final job status to the client but then crashes before 
 unregistering with the RM then the RM can run another AM attempt.  Currently 
 AM re-attempts assume that the previous attempts did not reach a final job 
 state, and that causes the job to rerun (from scratch, if the output format 
 doesn't support recovery).
 Re-running the job when we've already told the client the final status of the 
 job is bad for a number of reasons.  If the job failed, it's confusing at 
 best since the client was already told the job failed but the subsequent 
 attempt could succeed.  If the job succeeded there could be data loss, as a 
 subsequent job launched by the client tries to consume the job's output as 
 input just as the re-attempt starts removing output files in preparation for 
 the output commit.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAPREDUCE-4904) TestMultipleLevelCaching failed in barnch-1

2013-01-03 Thread Luke Lu (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-4904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13542827#comment-13542827
 ] 

Luke Lu commented on MAPREDUCE-4904:


I meant that you should add a comment (e.g. // fall through. see 
MAPREDUCE-4904) to the patch :) Otherwise, the switch code would look a little 
strange to later maintainers and cause some unnecessary head-scratching.

 TestMultipleLevelCaching failed in barnch-1
 ---

 Key: MAPREDUCE-4904
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4904
 Project: Hadoop Map/Reduce
  Issue Type: Bug
  Components: test
Affects Versions: 1.2.0
Reporter: meng gong
Assignee: meng gong
 Fix For: 1.2.0

 Attachments: MAPREDUCE-4904.patch


 TestMultipleLevelCaching will failed:
 {noformat}
 Testcase: testMultiLevelCaching took 30.406 sec
 FAILED
 Number of local maps expected:0 but was:1
 junit.framework.AssertionFailedError: Number of local maps expected:0 but 
 was:1
 at 
 org.apache.hadoop.mapred.TestRackAwareTaskPlacement.launchJobAndTestCounters(TestRackAwareTaskPlacement.java:78)
 at 
 org.apache.hadoop.mapred.TestMultipleLevelCaching.testCachingAtLevel(TestMultipleLevelCaching.java:113)
 at 
 org.apache.hadoop.mapred.TestMultipleLevelCaching.testMultiLevelCaching(TestMultipleLevelCaching.java:69)
 {noformat}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAPREDUCE-4279) getClusterStatus() fails with null pointer exception when running jobs in local mode

2013-01-03 Thread Jarek Jarcec Cecho (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-4279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13542849#comment-13542849
 ] 

Jarek Jarcec Cecho commented on MAPREDUCE-4279:
---

I've recently detected this issue as well. Would it be possible to fix it?

 getClusterStatus() fails with null pointer exception when running jobs in 
 local mode
 

 Key: MAPREDUCE-4279
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4279
 Project: Hadoop Map/Reduce
  Issue Type: Bug
  Components: jobtracker
Affects Versions: 0.23.1, 2.0.0-alpha, 3.0.0
Reporter: Rahul Jain
Assignee: Devaraj K
 Attachments: MAPREDUCE-4279.patch


 While migrating code from 0.20.2 hadoop codebase to 0.23.1 we encountered 
 this issue for jobs run in local mode of execution:
 {code}
 java.lang.NullPointerException
   at 
 org.apache.hadoop.mapred.JobClient.arrayToStringList(JobClient.java:783)
   at org.apache.hadoop.mapred.JobClient.access$600(JobClient.java:138)
   at org.apache.hadoop.mapred.JobClient$4.run(JobClient.java:815)
   at org.apache.hadoop.mapred.JobClient$4.run(JobClient.java:812)
   at java.security.AccessController.doPrivileged(Native Method)
   at javax.security.auth.Subject.doAs(Subject.java:396)
   at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1177)
   at 
 org.apache.hadoop.mapred.JobClient.getClusterStatus(JobClient.java:812)
 {code}
 We are using cloudera distribution CDH4b2 for testing, however the underlying 
 code is 0.23.1 and I could see no difference in this implementation.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAPREDUCE-4884) streaming tests fail to start MiniMRCluster due to Queue configuration missing child queue names for root

2013-01-03 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-4884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13542848#comment-13542848
 ] 

Hudson commented on MAPREDUCE-4884:
---

Integrated in Hadoop-Yarn-trunk #85 (See 
[https://builds.apache.org/job/Hadoop-Yarn-trunk/85/])
MAPREDUCE-4884. Streaming tests fail to start MiniMRCluster due to missing 
queue configuration. Contributed by Chris Nauroth. (Revision 1427945)

 Result = SUCCESS
suresh : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1427945
Files : 
* /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt
* /hadoop/common/trunk/hadoop-tools/hadoop-streaming/pom.xml


 streaming tests fail to start MiniMRCluster due to Queue configuration 
 missing child queue names for root
 ---

 Key: MAPREDUCE-4884
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4884
 Project: Hadoop Map/Reduce
  Issue Type: Bug
  Components: contrib/streaming, test
Affects Versions: 3.0.0, trunk-win
Reporter: Chris Nauroth
Assignee: Chris Nauroth
 Fix For: 3.0.0

 Attachments: MAPREDUCE-4884.1.patch


 Multiple tests in hadoop-streaming, such as {{TestFileArgs}}, fail to 
 initialize {{MiniMRCluster}} due to a {{YarnException}} with reason Queue 
 configuration missing child queue names for root.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAPREDUCE-4884) streaming tests fail to start MiniMRCluster due to Queue configuration missing child queue names for root

2013-01-03 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-4884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13542870#comment-13542870
 ] 

Hudson commented on MAPREDUCE-4884:
---

Integrated in Hadoop-Hdfs-trunk #1274 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-trunk/1274/])
MAPREDUCE-4884. Streaming tests fail to start MiniMRCluster due to missing 
queue configuration. Contributed by Chris Nauroth. (Revision 1427945)

 Result = FAILURE
suresh : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1427945
Files : 
* /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt
* /hadoop/common/trunk/hadoop-tools/hadoop-streaming/pom.xml


 streaming tests fail to start MiniMRCluster due to Queue configuration 
 missing child queue names for root
 ---

 Key: MAPREDUCE-4884
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4884
 Project: Hadoop Map/Reduce
  Issue Type: Bug
  Components: contrib/streaming, test
Affects Versions: 3.0.0, trunk-win
Reporter: Chris Nauroth
Assignee: Chris Nauroth
 Fix For: 3.0.0

 Attachments: MAPREDUCE-4884.1.patch


 Multiple tests in hadoop-streaming, such as {{TestFileArgs}}, fail to 
 initialize {{MiniMRCluster}} due to a {{YarnException}} with reason Queue 
 configuration missing child queue names for root.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (MAPREDUCE-2217) The expire launching task should cover the UNASSIGNED task

2013-01-03 Thread Alejandro Abdelnur (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-2217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alejandro Abdelnur updated MAPREDUCE-2217:
--

Issue Type: Bug  (was: Improvement)

 The expire launching task should cover the UNASSIGNED task
 --

 Key: MAPREDUCE-2217
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2217
 Project: Hadoop Map/Reduce
  Issue Type: Bug
  Components: jobtracker
Affects Versions: 0.23.0, 1.1.1
Reporter: Scott Chen
Assignee: Karthik Kambatla
 Attachments: expose-bug-mr-2217.patch, MAPREDUCE-2217.1.txt, 
 MR-2217.patch, MR-2217.patch


 The ExpireLaunchingTask thread kills the task that are scheduled but not 
 responded.
 Currently if a task is scheduled on tasktracker and for some reason 
 tasktracker cannot put it to RUNNING.
 The task will just hang in the UNASSIGNED status and JobTracker will keep 
 waiting for it.
 JobTracker.ExpireLaunchingTask should be able to kill this task.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (MAPREDUCE-2217) The expire launching task should cover the UNASSIGNED task

2013-01-03 Thread Alejandro Abdelnur (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-2217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alejandro Abdelnur updated MAPREDUCE-2217:
--

   Resolution: Fixed
Fix Version/s: 1.2.0
 Hadoop Flags: Reviewed
   Status: Resolved  (was: Patch Available)

Thanks Scott  Karthik. Committed to branch-1.

 The expire launching task should cover the UNASSIGNED task
 --

 Key: MAPREDUCE-2217
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2217
 Project: Hadoop Map/Reduce
  Issue Type: Bug
  Components: jobtracker
Affects Versions: 0.23.0, 1.1.1
Reporter: Scott Chen
Assignee: Karthik Kambatla
 Fix For: 1.2.0

 Attachments: expose-bug-mr-2217.patch, MAPREDUCE-2217.1.txt, 
 MR-2217.patch, MR-2217.patch


 The ExpireLaunchingTask thread kills the task that are scheduled but not 
 responded.
 Currently if a task is scheduled on tasktracker and for some reason 
 tasktracker cannot put it to RUNNING.
 The task will just hang in the UNASSIGNED status and JobTracker will keep 
 waiting for it.
 JobTracker.ExpireLaunchingTask should be able to kill this task.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAPREDUCE-4884) streaming tests fail to start MiniMRCluster due to Queue configuration missing child queue names for root

2013-01-03 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-4884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13542938#comment-13542938
 ] 

Hudson commented on MAPREDUCE-4884:
---

Integrated in Hadoop-Mapreduce-trunk #1304 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1304/])
MAPREDUCE-4884. Streaming tests fail to start MiniMRCluster due to missing 
queue configuration. Contributed by Chris Nauroth. (Revision 1427945)

 Result = FAILURE
suresh : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1427945
Files : 
* /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt
* /hadoop/common/trunk/hadoop-tools/hadoop-streaming/pom.xml


 streaming tests fail to start MiniMRCluster due to Queue configuration 
 missing child queue names for root
 ---

 Key: MAPREDUCE-4884
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4884
 Project: Hadoop Map/Reduce
  Issue Type: Bug
  Components: contrib/streaming, test
Affects Versions: 3.0.0, trunk-win
Reporter: Chris Nauroth
Assignee: Chris Nauroth
 Fix For: 3.0.0

 Attachments: MAPREDUCE-4884.1.patch


 Multiple tests in hadoop-streaming, such as {{TestFileArgs}}, fail to 
 initialize {{MiniMRCluster}} due to a {{YarnException}} with reason Queue 
 configuration missing child queue names for root.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAPREDUCE-4049) plugin for generic shuffle service

2013-01-03 Thread Alejandro Abdelnur (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-4049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13542978#comment-13542978
 ] 

Alejandro Abdelnur commented on MAPREDUCE-4049:
---

On #1, IMO APPLICATION_INIT should be sent to all auxiliary services (and 
APPLICATION_STOP)

On #2, is there a use case to load both instead just the configured one?

 plugin for generic shuffle service
 --

 Key: MAPREDUCE-4049
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4049
 Project: Hadoop Map/Reduce
  Issue Type: Sub-task
  Components: performance, task, tasktracker
Affects Versions: 1.0.3, 1.1.0, 2.0.0-alpha, 3.0.0
Reporter: Avner BenHanoch
Assignee: Avner BenHanoch
  Labels: merge, plugin, rdma, shuffle
 Fix For: 3.0.0

 Attachments: HADOOP-1.x.y.patch, Hadoop Shuffle Plugin Design.rtf, 
 mapreduce-4049.patch


 Support generic shuffle service as set of two plugins: ShuffleProvider  
 ShuffleConsumer.
 This will satisfy the following needs:
 # Better shuffle and merge performance. For example: we are working on 
 shuffle plugin that performs shuffle over RDMA in fast networks (10gE, 40gE, 
 or Infiniband) instead of using the current HTTP shuffle. Based on the fast 
 RDMA shuffle, the plugin can also utilize a suitable merge approach during 
 the intermediate merges. Hence, getting much better performance.
 # Satisfy MAPREDUCE-3060 - generic shuffle service for avoiding hidden 
 dependency of NodeManager with a specific version of mapreduce shuffle 
 (currently targeted to 0.24.0).
 References:
 # Hadoop Acceleration through Network Levitated Merging, by Prof. Weikuan Yu 
 from Auburn University with others, 
 [http://pasl.eng.auburn.edu/pubs/sc11-netlev.pdf]
 # I am attaching 2 documents with suggested Top Level Design for both plugins 
 (currently, based on 1.0 branch)
 # I am providing link for downloading UDA - Mellanox's open source plugin 
 that implements generic shuffle service using RDMA and levitated merge.  
 Note: At this phase, the code is in C++ through JNI and you should consider 
 it as beta only.  Still, it can serve anyone that wants to implement or 
 contribute to levitated merge. (Please be advised that levitated merge is 
 mostly suit in very fast networks) - 
 [http://www.mellanox.com/content/pages.php?pg=products_dynproduct_family=144menu_section=69]

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAPREDUCE-4819) AM can rerun job after reporting final job status to the client

2013-01-03 Thread Robert Joseph Evans (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-4819?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13543034#comment-13543034
 ] 

Robert Joseph Evans commented on MAPREDUCE-4819:


Wow lots of comments.  Thanks for everyone looking at the patch.
bq. I had observed that if I made my AM crash (by putting an exit(1) in 
shutdownJob() then the history files would get orphaned and not cleaned up. Or 
something like that.

Thanks for the heads up. I will look into that.

bq. Why not end in success if the staging dir was cleaned up by the last 
attempt?

Because we crashed somewhere after staging was cleaned up and before we 
unregistered.  Crashing seems like an error to me, but I suppose we could 
change it.  As for what the client ultimately sees for success or failure, we 
will rely on the history server to report that.

bq. I am guessing that this code wont be necessary after we move the unregister 
to RM before the staging dir cleanup in MAPREDUCE-4841, right?
Yes and No.  Once MAPREDUCE-4841 goes in there is an increased possibility of 
leaking staging directories.  I have seen users in 1.0 blow away their staging 
directory to clean up, and caused jobs to fail.  Granted they are more likely 
to get errors from the distributed cache not finding the files it needs, but in 
either case I would like to be paranoid and guard against that.

bq. Why are we only eating/ignoring the JobEvents in the dispatcher? So that 
the JobImpl state machine is not triggered?

In the new code path we have not wired up everything.  JobImpl is created but 
the JobEventDispatcher is not.  I did not want to have to deal with recovering 
the complete state of the job.  Which in some cases may not even be possible.  
This is also why I am not brining up the RPC server.  Which now that you 
mention it I probably also need to update the UI/client to deal with that 
appropriately. The typo you found was just there for debugging this situation.  
(I'll fix the typo by the way)

bq. This might be a question of personal preference. I think an explicit 
transition to from the INIT to final state is cleaner than overriding the state 
in the getter.

I actually wanted to put in a stubbed out Job instead, but there are too many 
places that Job is cast to JobImpl just to get the state making it difficult to 
do so.  I will look again to see if I can split the two apart, or add in a 
state transition.

bq. Oozie handles duplicate notifications correctly doing a NOP.
Great.  I will look at the javadocs for job end notification again to be sure 
that we can default to notify instead.

bq. Using separate files for marking success / failure - am guessing this is to 
have a smaller change of a failing persist, as compared to persisting events 
via the HistoryFile, which may already have a backlog of events?

It was also a much smaller change to make.  The HistoryFile would be preferable 
if we wanted to guarantee at most once commit of the tasks, because there are 
so many of them.

bq. Wondering if it's possible to achieve the same checks via the 
CommitterEventHandler instead of checking in the MRAppMaster class. i.e follow 
the regular recovery path - except the CommitHandler emits success / failed / 
abort events depending on the presence of these files / (history events). 
bq. Alternately, the current implementation could be simplified by using a 
custom RMCommunicator - which does not depend on JobImpl. i.e. the history 
copier and an RMCommunicator to unregister from the RM.
Both of those seem like valid things to investigate.  I feel like I am close on 
this and want to get this working as is first and then I will look at the other 
approaches you suggested.  I do like the first one as it seems like it would be 
a lot simpler to implement, but I want a backup that I know functions before 
making drastic changes to the design.

bq. If the last AM attempt were to crash - data exists since the SUCCESS file 
exists, RPC will not see SUCCESS.
We have a lot of problems in general if the last AM were to crash.  It is 
possible that the history server would have no knowledge of the job what so 
ever even if it finished successfully.  This patch is not attempting to address 
those problems.

bq. While the new AM is running - it will not be able to handle status, counter 
etc requests. This seems a little problematic if a success has been reported 
over RPC from the previous AM. Since this AM is dealing with the history file - 
could possibly have it return information from the history file ? History 
commit before SUCCESS may help with the previous 2 points.

Yes History commit before returning success would help with those problems. I 
will look into it as an alternative approach.  my initial thought was to update 
the client/UI to wait for the AM to report a valid address so that no client is 
trying to get counters etc from an AM in this situation.


[jira] [Commented] (MAPREDUCE-4049) plugin for generic shuffle service

2013-01-03 Thread Avner BenHanoch (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-4049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13543056#comment-13543056
 ] 

Avner BenHanoch commented on MAPREDUCE-4049:


Hi Alejandro,

On #1 - Thanks!

On #2 - YES: 
 1. Since, ShuffleProvider is configured for the lifetime of TT; while, 
ShuffleConsumer is configured per job.  We don't want to restart 
MapReduce/TaskTrackers any time we want to use different shuffle.

 2. In addition, I expect that for 1 job there will be used just 1 type of 
shuffle.  *Still, TT supports multiple jobs of multiple users with different 
shufflemerge needs in parallel*.  Hence, multiple shuffle consumers may run in 
parallel (in the multiple jobs) = they will request data from multiple 
providers.  = *TT needs multiple providers in parallel* (You can consider 
multiple ShufleProviders in MRv1 as equivalent to multiple AuxiliaryServices 
that are allowed in MRv2).

 3. It could be that a ShuffleConsumerX will be ideal for jobs of one type, 
while ShuffleConsumerY will be ideal for jobs of other type (for example Grep 
vs. TeraSort).  Hence, multiple Shuffle-Consumer plugins may run in parallel in 
multiple jobs.  Each of the consumers will contact its desired shuffle 
provider.  Hence, all providers should be available in parallel (also, one 
shuffle service can be sensitive to type of network problems that doesn't 
disturb other shuffle services, hence, it should be possible to fallback to 
another shuffle on the fly).


on the design:
 1. It is clear that a ShuffleProvider is a daemon like TT, while 
ShuffleConsumer is a client that lives in the context of RT
 2. It is clear that multiple providers can run in parallel and each is able to 
serve shuffle request it gets.  
 3. A shuffle consumer instance will only contact one of the shuffle providers 
and will request its desired files only from from this provider.
 4. multiple consumers in multiple jobs may contact different providers
 5. *A shuffle provider that doesn't serve a request doesn't consume resources 
for it.*



regards,
  Avner

 plugin for generic shuffle service
 --

 Key: MAPREDUCE-4049
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4049
 Project: Hadoop Map/Reduce
  Issue Type: Sub-task
  Components: performance, task, tasktracker
Affects Versions: 1.0.3, 1.1.0, 2.0.0-alpha, 3.0.0
Reporter: Avner BenHanoch
Assignee: Avner BenHanoch
  Labels: merge, plugin, rdma, shuffle
 Fix For: 3.0.0

 Attachments: HADOOP-1.x.y.patch, Hadoop Shuffle Plugin Design.rtf, 
 mapreduce-4049.patch


 Support generic shuffle service as set of two plugins: ShuffleProvider  
 ShuffleConsumer.
 This will satisfy the following needs:
 # Better shuffle and merge performance. For example: we are working on 
 shuffle plugin that performs shuffle over RDMA in fast networks (10gE, 40gE, 
 or Infiniband) instead of using the current HTTP shuffle. Based on the fast 
 RDMA shuffle, the plugin can also utilize a suitable merge approach during 
 the intermediate merges. Hence, getting much better performance.
 # Satisfy MAPREDUCE-3060 - generic shuffle service for avoiding hidden 
 dependency of NodeManager with a specific version of mapreduce shuffle 
 (currently targeted to 0.24.0).
 References:
 # Hadoop Acceleration through Network Levitated Merging, by Prof. Weikuan Yu 
 from Auburn University with others, 
 [http://pasl.eng.auburn.edu/pubs/sc11-netlev.pdf]
 # I am attaching 2 documents with suggested Top Level Design for both plugins 
 (currently, based on 1.0 branch)
 # I am providing link for downloading UDA - Mellanox's open source plugin 
 that implements generic shuffle service using RDMA and levitated merge.  
 Note: At this phase, the code is in C++ through JNI and you should consider 
 it as beta only.  Still, it can serve anyone that wants to implement or 
 contribute to levitated merge. (Please be advised that levitated merge is 
 mostly suit in very fast networks) - 
 [http://www.mellanox.com/content/pages.php?pg=products_dynproduct_family=144menu_section=69]

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAPREDUCE-4279) getClusterStatus() fails with null pointer exception when running jobs in local mode

2013-01-03 Thread Robert Joseph Evans (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-4279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13543076#comment-13543076
 ] 

Robert Joseph Evans commented on MAPREDUCE-4279:


The change looks fine to me to +1.  I'll check it in.

 getClusterStatus() fails with null pointer exception when running jobs in 
 local mode
 

 Key: MAPREDUCE-4279
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4279
 Project: Hadoop Map/Reduce
  Issue Type: Bug
  Components: jobtracker
Affects Versions: 0.23.1, 2.0.0-alpha, 3.0.0
Reporter: Rahul Jain
Assignee: Devaraj K
 Attachments: MAPREDUCE-4279.patch


 While migrating code from 0.20.2 hadoop codebase to 0.23.1 we encountered 
 this issue for jobs run in local mode of execution:
 {code}
 java.lang.NullPointerException
   at 
 org.apache.hadoop.mapred.JobClient.arrayToStringList(JobClient.java:783)
   at org.apache.hadoop.mapred.JobClient.access$600(JobClient.java:138)
   at org.apache.hadoop.mapred.JobClient$4.run(JobClient.java:815)
   at org.apache.hadoop.mapred.JobClient$4.run(JobClient.java:812)
   at java.security.AccessController.doPrivileged(Native Method)
   at javax.security.auth.Subject.doAs(Subject.java:396)
   at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1177)
   at 
 org.apache.hadoop.mapred.JobClient.getClusterStatus(JobClient.java:812)
 {code}
 We are using cloudera distribution CDH4b2 for testing, however the underlying 
 code is 0.23.1 and I could see no difference in this implementation.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAPREDUCE-4279) getClusterStatus() fails with null pointer exception when running jobs in local mode

2013-01-03 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-4279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13543081#comment-13543081
 ] 

Hudson commented on MAPREDUCE-4279:
---

Integrated in Hadoop-trunk-Commit #3170 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/3170/])
MAPREDUCE-4279. getClusterStatus() fails with null pointer exception when 
running jobs in local mode (Devaraj K via bobby) (Revision 1428482)

 Result = FAILURE
bobby : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1428482
Files : 
* /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-common/src/main/java/org/apache/hadoop/mapred/LocalJobRunner.java
* 
/hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-common/src/test/java/org/apache/hadoop/mapred/TestJobClient.java


 getClusterStatus() fails with null pointer exception when running jobs in 
 local mode
 

 Key: MAPREDUCE-4279
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4279
 Project: Hadoop Map/Reduce
  Issue Type: Bug
  Components: jobtracker
Affects Versions: 0.23.1, 2.0.0-alpha, 3.0.0
Reporter: Rahul Jain
Assignee: Devaraj K
 Attachments: MAPREDUCE-4279.patch


 While migrating code from 0.20.2 hadoop codebase to 0.23.1 we encountered 
 this issue for jobs run in local mode of execution:
 {code}
 java.lang.NullPointerException
   at 
 org.apache.hadoop.mapred.JobClient.arrayToStringList(JobClient.java:783)
   at org.apache.hadoop.mapred.JobClient.access$600(JobClient.java:138)
   at org.apache.hadoop.mapred.JobClient$4.run(JobClient.java:815)
   at org.apache.hadoop.mapred.JobClient$4.run(JobClient.java:812)
   at java.security.AccessController.doPrivileged(Native Method)
   at javax.security.auth.Subject.doAs(Subject.java:396)
   at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1177)
   at 
 org.apache.hadoop.mapred.JobClient.getClusterStatus(JobClient.java:812)
 {code}
 We are using cloudera distribution CDH4b2 for testing, however the underlying 
 code is 0.23.1 and I could see no difference in this implementation.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAPREDUCE-4279) getClusterStatus() fails with null pointer exception when running jobs in local mode

2013-01-03 Thread Jarek Jarcec Cecho (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-4279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13543083#comment-13543083
 ] 

Jarek Jarcec Cecho commented on MAPREDUCE-4279:
---

Awesome, thank you Robert!

 getClusterStatus() fails with null pointer exception when running jobs in 
 local mode
 

 Key: MAPREDUCE-4279
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4279
 Project: Hadoop Map/Reduce
  Issue Type: Bug
  Components: jobtracker
Affects Versions: 0.23.1, 2.0.0-alpha, 3.0.0
Reporter: Rahul Jain
Assignee: Devaraj K
 Attachments: MAPREDUCE-4279.patch


 While migrating code from 0.20.2 hadoop codebase to 0.23.1 we encountered 
 this issue for jobs run in local mode of execution:
 {code}
 java.lang.NullPointerException
   at 
 org.apache.hadoop.mapred.JobClient.arrayToStringList(JobClient.java:783)
   at org.apache.hadoop.mapred.JobClient.access$600(JobClient.java:138)
   at org.apache.hadoop.mapred.JobClient$4.run(JobClient.java:815)
   at org.apache.hadoop.mapred.JobClient$4.run(JobClient.java:812)
   at java.security.AccessController.doPrivileged(Native Method)
   at javax.security.auth.Subject.doAs(Subject.java:396)
   at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1177)
   at 
 org.apache.hadoop.mapred.JobClient.getClusterStatus(JobClient.java:812)
 {code}
 We are using cloudera distribution CDH4b2 for testing, however the underlying 
 code is 0.23.1 and I could see no difference in this implementation.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (MAPREDUCE-4279) getClusterStatus() fails with null pointer exception when running jobs in local mode

2013-01-03 Thread Robert Joseph Evans (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-4279?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Joseph Evans updated MAPREDUCE-4279:
---

   Resolution: Fixed
Fix Version/s: 0.23.6
   2.0.3-alpha
   3.0.0
   Status: Resolved  (was: Patch Available)

Thanks Devaraj and Rahul,

I put this into trunk, branch-2, and branch-0.23

 getClusterStatus() fails with null pointer exception when running jobs in 
 local mode
 

 Key: MAPREDUCE-4279
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4279
 Project: Hadoop Map/Reduce
  Issue Type: Bug
  Components: jobtracker
Affects Versions: 0.23.1, 2.0.0-alpha, 3.0.0
Reporter: Rahul Jain
Assignee: Devaraj K
 Fix For: 3.0.0, 2.0.3-alpha, 0.23.6

 Attachments: MAPREDUCE-4279.patch


 While migrating code from 0.20.2 hadoop codebase to 0.23.1 we encountered 
 this issue for jobs run in local mode of execution:
 {code}
 java.lang.NullPointerException
   at 
 org.apache.hadoop.mapred.JobClient.arrayToStringList(JobClient.java:783)
   at org.apache.hadoop.mapred.JobClient.access$600(JobClient.java:138)
   at org.apache.hadoop.mapred.JobClient$4.run(JobClient.java:815)
   at org.apache.hadoop.mapred.JobClient$4.run(JobClient.java:812)
   at java.security.AccessController.doPrivileged(Native Method)
   at javax.security.auth.Subject.doAs(Subject.java:396)
   at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1177)
   at 
 org.apache.hadoop.mapred.JobClient.getClusterStatus(JobClient.java:812)
 {code}
 We are using cloudera distribution CDH4b2 for testing, however the underlying 
 code is 0.23.1 and I could see no difference in this implementation.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (MAPREDUCE-4458) Warn if java.library.path is used for AM or Task

2013-01-03 Thread Robert Parker (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-4458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Parker updated MAPREDUCE-4458:
-

Attachment: MAPREDUCE-4458-2.patch

Added static function, removed tab indentations. Modified the test file but did 
not implement a test because the createApplicationSubmissionContext function is 
mocked in the @Before function.

 Warn if java.library.path is used for AM or Task
 

 Key: MAPREDUCE-4458
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4458
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
  Components: mrv2
Affects Versions: 0.23.3, 3.0.0, 2.0.2-alpha
Reporter: Robert Joseph Evans
Assignee: Robert Parker
 Attachments: MAPREDUCE-4458-2.patch, MAPREDUCE-4458.patch


 If java.library.path is used on the command line for launching an MRAppMaster 
 or an MR Task, it could conflict with how standard Hadoop/HDFS JNI libraries 
 and dependencies are found.  At a minimum the client should output a warning 
 and ask the user to switch to LD_LIBRARY_PATH.  It would be nice to 
 automatically do this for them but parsing the command line is scary so just 
 a warning is probably good enough for now.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (MAPREDUCE-4458) Warn if java.library.path is used for AM or Task

2013-01-03 Thread Robert Parker (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-4458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Parker updated MAPREDUCE-4458:
-

Target Version/s:   (was: 0.23.3)
  Status: Patch Available  (was: Open)

 Warn if java.library.path is used for AM or Task
 

 Key: MAPREDUCE-4458
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4458
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
  Components: mrv2
Affects Versions: 2.0.2-alpha, 0.23.3, 3.0.0
Reporter: Robert Joseph Evans
Assignee: Robert Parker
 Attachments: MAPREDUCE-4458-2.patch, MAPREDUCE-4458.patch


 If java.library.path is used on the command line for launching an MRAppMaster 
 or an MR Task, it could conflict with how standard Hadoop/HDFS JNI libraries 
 and dependencies are found.  At a minimum the client should output a warning 
 and ask the user to switch to LD_LIBRARY_PATH.  It would be nice to 
 automatically do this for them but parsing the command line is scary so just 
 a warning is probably good enough for now.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Resolved] (MAPREDUCE-4655) MergeManager.reserve can OutOfMemoryError if more than 10% of max memory is used on non-MapOutputs

2013-01-03 Thread Sandy Ryza (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-4655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandy Ryza resolved MAPREDUCE-4655.
---

Resolution: Invalid

 MergeManager.reserve can OutOfMemoryError if more than 10% of max memory is 
 used on non-MapOutputs
 --

 Key: MAPREDUCE-4655
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4655
 Project: Hadoop Map/Reduce
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.0.1-alpha
Reporter: Sandy Ryza

 The MergeManager does a memory check, using a limit that defaults to 90% of 
 Runtime.getRuntime().maxMemory(). Allocations that would bring the total 
 memory allocated by the MergeManager over this limit are asked to wait until 
 memory frees up. Disk is used for single allocations that would be over 25% 
 of the memory limit.
 If some other part of the reducer were to be using more than 10% of the 
 memory. the current check wouldn't stop an OutOfMemoryError.
 Before creating an in-memory MapOutput, a check can be done using 
 Runtime.getRuntime().freeMemory(), waiting until memory is freed up if it 
 fails.
 12/08/17 10:36:29 INFO mapreduce.Job: Task Id : 
 attempt_1342723342632_0010_r_05_0, Status : FAILED 
 Error: org.apache.hadoop.mapreduce.task.reduce.Shuffle$ShuffleError: error in 
 shuffle in fetcher#6 
 at org.apache.hadoop.mapreduce.task.reduce.Shuffle.run(Shuffle.java:123) 
 at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:371) 
 at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:152) 
 at java.security.AccessController.doPrivileged(Native Method) 
 at javax.security.auth.Subject.doAs(Subject.java:416) 
 at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1232)
  
 at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:147) 
 Caused by: java.lang.OutOfMemoryError: Java heap space 
 at 
 org.apache.hadoop.io.BoundedByteArrayOutputStream.init(BoundedByteArrayOutputStream.java:58)
  
 at 
 org.apache.hadoop.io.BoundedByteArrayOutputStream.init(BoundedByteArrayOutputStream.java:45)
  
 at 
 org.apache.hadoop.mapreduce.task.reduce.MapOutput.init(MapOutput.java:97) 
 at 
 org.apache.hadoop.mapreduce.task.reduce.MergeManager.unconditionalReserve(MergeManager.java:286)
  
 at 
 org.apache.hadoop.mapreduce.task.reduce.MergeManager.reserve(MergeManager.java:276)
  
 at 
 org.apache.hadoop.mapreduce.task.reduce.Fetcher.copyMapOutput(Fetcher.java:327)
  
 at 
 org.apache.hadoop.mapreduce.task.reduce.Fetcher.copyFromHost(Fetcher.java:273)
  
 at org.apache.hadoop.mapreduce.task.reduce.Fetcher.run(Fetcher.java:153)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAPREDUCE-4049) plugin for generic shuffle service

2013-01-03 Thread Alejandro Abdelnur (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-4049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13543235#comment-13543235
 ] 

Alejandro Abdelnur commented on MAPREDUCE-4049:
---

Got it, thxs for the detailed explanation. 

* Does this mean that providers must lazy initialize on the first request?
* Are you planing to support loading N providers in your patch?

 plugin for generic shuffle service
 --

 Key: MAPREDUCE-4049
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4049
 Project: Hadoop Map/Reduce
  Issue Type: Sub-task
  Components: performance, task, tasktracker
Affects Versions: 1.0.3, 1.1.0, 2.0.0-alpha, 3.0.0
Reporter: Avner BenHanoch
Assignee: Avner BenHanoch
  Labels: merge, plugin, rdma, shuffle
 Fix For: 3.0.0

 Attachments: HADOOP-1.x.y.patch, Hadoop Shuffle Plugin Design.rtf, 
 mapreduce-4049.patch


 Support generic shuffle service as set of two plugins: ShuffleProvider  
 ShuffleConsumer.
 This will satisfy the following needs:
 # Better shuffle and merge performance. For example: we are working on 
 shuffle plugin that performs shuffle over RDMA in fast networks (10gE, 40gE, 
 or Infiniband) instead of using the current HTTP shuffle. Based on the fast 
 RDMA shuffle, the plugin can also utilize a suitable merge approach during 
 the intermediate merges. Hence, getting much better performance.
 # Satisfy MAPREDUCE-3060 - generic shuffle service for avoiding hidden 
 dependency of NodeManager with a specific version of mapreduce shuffle 
 (currently targeted to 0.24.0).
 References:
 # Hadoop Acceleration through Network Levitated Merging, by Prof. Weikuan Yu 
 from Auburn University with others, 
 [http://pasl.eng.auburn.edu/pubs/sc11-netlev.pdf]
 # I am attaching 2 documents with suggested Top Level Design for both plugins 
 (currently, based on 1.0 branch)
 # I am providing link for downloading UDA - Mellanox's open source plugin 
 that implements generic shuffle service using RDMA and levitated merge.  
 Note: At this phase, the code is in C++ through JNI and you should consider 
 it as beta only.  Still, it can serve anyone that wants to implement or 
 contribute to levitated merge. (Please be advised that levitated merge is 
 mostly suit in very fast networks) - 
 [http://www.mellanox.com/content/pages.php?pg=products_dynproduct_family=144menu_section=69]

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAPREDUCE-4049) plugin for generic shuffle service

2013-01-03 Thread Avner BenHanoch (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-4049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13543254#comment-13543254
 ] 

Avner BenHanoch commented on MAPREDUCE-4049:


1. I don't use the term must.  Each provider can choose its desired 
optimization.  If the performance of a provider worth for the user the user 
will use it.  In general, I think that the major resouce that is used by 
providers is the cache of MOFs.  Since this cache is filled upon serving 
requests than the price of unused provider that was loaded is cheap.  Other 
than that, I think that providers mainly listen for incoming requests.

2. In my patch I plan to support just 1 provider (in addition to the built in 
MapOutputHttpServlet).  This is enough for my use case.  Support of N providers 
is legitimate idea.  If it is needed by someone, I prefer that it will be 
handled outside my patch.

 plugin for generic shuffle service
 --

 Key: MAPREDUCE-4049
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4049
 Project: Hadoop Map/Reduce
  Issue Type: Sub-task
  Components: performance, task, tasktracker
Affects Versions: 1.0.3, 1.1.0, 2.0.0-alpha, 3.0.0
Reporter: Avner BenHanoch
Assignee: Avner BenHanoch
  Labels: merge, plugin, rdma, shuffle
 Fix For: 3.0.0

 Attachments: HADOOP-1.x.y.patch, Hadoop Shuffle Plugin Design.rtf, 
 mapreduce-4049.patch


 Support generic shuffle service as set of two plugins: ShuffleProvider  
 ShuffleConsumer.
 This will satisfy the following needs:
 # Better shuffle and merge performance. For example: we are working on 
 shuffle plugin that performs shuffle over RDMA in fast networks (10gE, 40gE, 
 or Infiniband) instead of using the current HTTP shuffle. Based on the fast 
 RDMA shuffle, the plugin can also utilize a suitable merge approach during 
 the intermediate merges. Hence, getting much better performance.
 # Satisfy MAPREDUCE-3060 - generic shuffle service for avoiding hidden 
 dependency of NodeManager with a specific version of mapreduce shuffle 
 (currently targeted to 0.24.0).
 References:
 # Hadoop Acceleration through Network Levitated Merging, by Prof. Weikuan Yu 
 from Auburn University with others, 
 [http://pasl.eng.auburn.edu/pubs/sc11-netlev.pdf]
 # I am attaching 2 documents with suggested Top Level Design for both plugins 
 (currently, based on 1.0 branch)
 # I am providing link for downloading UDA - Mellanox's open source plugin 
 that implements generic shuffle service using RDMA and levitated merge.  
 Note: At this phase, the code is in C++ through JNI and you should consider 
 it as beta only.  Still, it can serve anyone that wants to implement or 
 contribute to levitated merge. (Please be advised that levitated merge is 
 mostly suit in very fast networks) - 
 [http://www.mellanox.com/content/pages.php?pg=products_dynproduct_family=144menu_section=69]

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (MAPREDUCE-4819) AM can rerun job after reporting final job status to the client

2013-01-03 Thread Robert Joseph Evans (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-4819?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Joseph Evans updated MAPREDUCE-4819:
---

Attachment: MR-4819-bobby-trunk.txt

This patch should be fully functional.

I have included the work by Bikas to put the Job history file in a location 
that is deleted with the staging directory.  I have fixed a few bugs in the 
original where we were not registering with the RM correctly.  And also where 
the Web App Proxy would return a 500 error if hit when recovery was happening.

I have manually tested this by having the AM exit/halt before, during, and 
after job commit.  I tested it with the job commit failing and succeeding.  
Everything appears to be working as expected.

I did not change JobImpl forcedState because adding in the transitions was more 
then I wanted to do right now.  I am happy to file a follow up JIRA to make 
those changes if we want them.

I have also not added in the kill state.  Again it looked a bit tricky because 
of the multithreading and I would prefer to get something working in now and 
add that as part of a follow up JIRA.

I talked with Kihwal Lee about the extra HDFS load for an empty file vs a 
directory and he said about the only extra load is the extra PRC call to close 
it, and because it is just two files per job I left it as is.  If you feel 
strongly about it I can fix it on a separate JIRA.

About the only thing that is left for this is integration with MAPREDUCE-4832.

 AM can rerun job after reporting final job status to the client
 ---

 Key: MAPREDUCE-4819
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4819
 Project: Hadoop Map/Reduce
  Issue Type: Bug
  Components: mr-am
Affects Versions: 0.23.3, 2.0.1-alpha
Reporter: Jason Lowe
Assignee: Bikas Saha
Priority: Critical
 Attachments: MAPREDUCE-4819.1.patch, MAPREDUCE-4819.2.patch, 
 MAPREDUCE-4819.3.patch, MR-4819-bobby-trunk.txt, MR-4819-bobby-trunk.txt, 
 MR-4819-bobby-trunk.txt


 If the AM reports final job status to the client but then crashes before 
 unregistering with the RM then the RM can run another AM attempt.  Currently 
 AM re-attempts assume that the previous attempts did not reach a final job 
 state, and that causes the job to rerun (from scratch, if the output format 
 doesn't support recovery).
 Re-running the job when we've already told the client the final status of the 
 job is bad for a number of reasons.  If the job failed, it's confusing at 
 best since the client was already told the job failed but the subsequent 
 attempt could succeed.  If the job succeeded there could be data loss, as a 
 subsequent job launched by the client tries to consume the job's output as 
 input just as the re-attempt starts removing output files in preparation for 
 the output commit.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAPREDUCE-4819) AM can rerun job after reporting final job status to the client

2013-01-03 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-4819?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13543297#comment-13543297
 ] 

Hadoop QA commented on MAPREDUCE-4819:
--

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  
http://issues.apache.org/jira/secure/attachment/12563151/MR-4819-bobby-trunk.txt
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 6 new 
or modified test files.

  {color:red}-1 javac{color}.  The applied patch generated 2015 javac 
compiler warnings (more than the trunk's current 2014 warnings).

{color:green}+1 javadoc{color}.  The javadoc tool did not generate any 
warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:red}-1 findbugs{color}.  The patch appears to introduce 1 new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 core tests{color}.  The patch failed these unit tests in 
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app 
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-common 
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core 
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-hs 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-web-proxy:

  
org.apache.hadoop.mapreduce.v2.app.commit.TestCommitterEventHandler
  
org.apache.hadoop.mapreduce.jobhistory.TestJobHistoryEventHandler

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3189//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3189//artifact/trunk/patchprocess/newPatchFindbugsWarningshadoop-mapreduce-client-app.html
Javac warnings: 
https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3189//artifact/trunk/patchprocess/diffJavacWarnings.txt
Console output: 
https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3189//console

This message is automatically generated.

 AM can rerun job after reporting final job status to the client
 ---

 Key: MAPREDUCE-4819
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4819
 Project: Hadoop Map/Reduce
  Issue Type: Bug
  Components: mr-am
Affects Versions: 0.23.3, 2.0.1-alpha
Reporter: Jason Lowe
Assignee: Bikas Saha
Priority: Critical
 Attachments: MAPREDUCE-4819.1.patch, MAPREDUCE-4819.2.patch, 
 MAPREDUCE-4819.3.patch, MR-4819-bobby-trunk.txt, MR-4819-bobby-trunk.txt, 
 MR-4819-bobby-trunk.txt


 If the AM reports final job status to the client but then crashes before 
 unregistering with the RM then the RM can run another AM attempt.  Currently 
 AM re-attempts assume that the previous attempts did not reach a final job 
 state, and that causes the job to rerun (from scratch, if the output format 
 doesn't support recovery).
 Re-running the job when we've already told the client the final status of the 
 job is bad for a number of reasons.  If the job failed, it's confusing at 
 best since the client was already told the job failed but the subsequent 
 attempt could succeed.  If the job succeeded there could be data loss, as a 
 subsequent job launched by the client tries to consume the job's output as 
 input just as the re-attempt starts removing output files in preparation for 
 the output commit.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (MAPREDUCE-4909) TestKeyValueTextInputFormat fails with Open JDK 7 on Windows

2013-01-03 Thread Arpit Agarwal (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-4909?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arpit Agarwal updated MAPREDUCE-4909:
-

Component/s: test

 TestKeyValueTextInputFormat fails with Open JDK 7 on Windows
 

 Key: MAPREDUCE-4909
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4909
 Project: Hadoop Map/Reduce
  Issue Type: Bug
  Components: test
Affects Versions: 1-win
Reporter: Arpit Agarwal
Assignee: Arpit Agarwal
 Attachments: MAPREDUCE-4909.patch, MAPREDUCE-4909.patch, 
 MAPREDUCE-4909.patch, MAPREDUCE-4909.patch


 TestKeyValueTextInputFormat.testFormat fails with Open JDK 7. The root cause 
 appears to be a failure to delete in-use files via LocalFileSystem.delete 
 (RawLocalFileSystem.delete).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (MAPREDUCE-4909) TestKeyValueTextInputFormat fails with Open JDK 7 on Windows

2013-01-03 Thread Arpit Agarwal (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-4909?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arpit Agarwal updated MAPREDUCE-4909:
-

Attachment: MAPREDUCE-4909.patch

Removed the comments altogether.

 TestKeyValueTextInputFormat fails with Open JDK 7 on Windows
 

 Key: MAPREDUCE-4909
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4909
 Project: Hadoop Map/Reduce
  Issue Type: Bug
Affects Versions: 1-win
Reporter: Arpit Agarwal
Assignee: Arpit Agarwal
 Attachments: MAPREDUCE-4909.patch, MAPREDUCE-4909.patch, 
 MAPREDUCE-4909.patch, MAPREDUCE-4909.patch


 TestKeyValueTextInputFormat.testFormat fails with Open JDK 7. The root cause 
 appears to be a failure to delete in-use files via LocalFileSystem.delete 
 (RawLocalFileSystem.delete).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (MAPREDUCE-4909) TestKeyValueTextInputFormat fails with Open JDK 7 on Windows

2013-01-03 Thread Arpit Agarwal (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-4909?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arpit Agarwal updated MAPREDUCE-4909:
-

Affects Version/s: (was: 1-win)
   1.2.0

 TestKeyValueTextInputFormat fails with Open JDK 7 on Windows
 

 Key: MAPREDUCE-4909
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4909
 Project: Hadoop Map/Reduce
  Issue Type: Bug
  Components: test
Affects Versions: 1.2.0
Reporter: Arpit Agarwal
Assignee: Arpit Agarwal
 Attachments: MAPREDUCE-4909.patch, MAPREDUCE-4909.patch, 
 MAPREDUCE-4909.patch, MAPREDUCE-4909.patch


 TestKeyValueTextInputFormat.testFormat fails with Open JDK 7. The root cause 
 appears to be a failure to delete in-use files via LocalFileSystem.delete 
 (RawLocalFileSystem.delete).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (MAPREDUCE-4909) TestKeyValueTextInputFormat fails with Open JDK 7 on Windows

2013-01-03 Thread Arpit Agarwal (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-4909?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arpit Agarwal updated MAPREDUCE-4909:
-

Fix Version/s: 1.2.0

 TestKeyValueTextInputFormat fails with Open JDK 7 on Windows
 

 Key: MAPREDUCE-4909
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4909
 Project: Hadoop Map/Reduce
  Issue Type: Bug
  Components: test
Affects Versions: 1.2.0
Reporter: Arpit Agarwal
Assignee: Arpit Agarwal
 Fix For: 1.2.0

 Attachments: MAPREDUCE-4909.patch, MAPREDUCE-4909.patch, 
 MAPREDUCE-4909.patch, MAPREDUCE-4909.patch


 TestKeyValueTextInputFormat.testFormat fails with Open JDK 7. The root cause 
 appears to be a failure to delete in-use files via LocalFileSystem.delete 
 (RawLocalFileSystem.delete).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (MAPREDUCE-4909) TestKeyValueTextInputFormat fails with Open JDK 7 on Windows

2013-01-03 Thread Arpit Agarwal (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-4909?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arpit Agarwal updated MAPREDUCE-4909:
-

Target Version/s:   (was: 1-win)

 TestKeyValueTextInputFormat fails with Open JDK 7 on Windows
 

 Key: MAPREDUCE-4909
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4909
 Project: Hadoop Map/Reduce
  Issue Type: Bug
  Components: test
Affects Versions: 1.2.0
Reporter: Arpit Agarwal
Assignee: Arpit Agarwal
 Fix For: 1.2.0

 Attachments: MAPREDUCE-4909.patch, MAPREDUCE-4909.patch, 
 MAPREDUCE-4909.patch, MAPREDUCE-4909.patch


 TestKeyValueTextInputFormat.testFormat fails with Open JDK 7. The root cause 
 appears to be a failure to delete in-use files via LocalFileSystem.delete 
 (RawLocalFileSystem.delete).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAPREDUCE-4909) TestKeyValueTextInputFormat fails with Open JDK 7 on Windows

2013-01-03 Thread Suresh Srinivas (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-4909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13543316#comment-13543316
 ] 

Suresh Srinivas commented on MAPREDUCE-4909:


+1 for the patch.

 TestKeyValueTextInputFormat fails with Open JDK 7 on Windows
 

 Key: MAPREDUCE-4909
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4909
 Project: Hadoop Map/Reduce
  Issue Type: Bug
  Components: test
Affects Versions: 1.2.0
Reporter: Arpit Agarwal
Assignee: Arpit Agarwal
 Fix For: 1.2.0

 Attachments: MAPREDUCE-4909.patch, MAPREDUCE-4909.patch, 
 MAPREDUCE-4909.patch, MAPREDUCE-4909.patch


 TestKeyValueTextInputFormat.testFormat fails with Open JDK 7. The root cause 
 appears to be a failure to delete in-use files via LocalFileSystem.delete 
 (RawLocalFileSystem.delete).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Resolved] (MAPREDUCE-4909) TestKeyValueTextInputFormat fails with Open JDK 7 on Windows

2013-01-03 Thread Suresh Srinivas (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-4909?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Suresh Srinivas resolved MAPREDUCE-4909.


  Resolution: Fixed
Hadoop Flags: Reviewed

I committed the patch. Thank you Arpit!

 TestKeyValueTextInputFormat fails with Open JDK 7 on Windows
 

 Key: MAPREDUCE-4909
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4909
 Project: Hadoop Map/Reduce
  Issue Type: Bug
  Components: test
Affects Versions: 1.2.0
Reporter: Arpit Agarwal
Assignee: Arpit Agarwal
 Fix For: 1.2.0

 Attachments: MAPREDUCE-4909.patch, MAPREDUCE-4909.patch, 
 MAPREDUCE-4909.patch, MAPREDUCE-4909.patch


 TestKeyValueTextInputFormat.testFormat fails with Open JDK 7. The root cause 
 appears to be a failure to delete in-use files via LocalFileSystem.delete 
 (RawLocalFileSystem.delete).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAPREDUCE-4909) TestKeyValueTextInputFormat fails with Open JDK 7 on Windows

2013-01-03 Thread Suresh Srinivas (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-4909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13543325#comment-13543325
 ] 

Suresh Srinivas commented on MAPREDUCE-4909:


I also committed this change to branch-1-win.

 TestKeyValueTextInputFormat fails with Open JDK 7 on Windows
 

 Key: MAPREDUCE-4909
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4909
 Project: Hadoop Map/Reduce
  Issue Type: Bug
  Components: test
Affects Versions: 1.2.0
Reporter: Arpit Agarwal
Assignee: Arpit Agarwal
 Fix For: 1.2.0

 Attachments: MAPREDUCE-4909.patch, MAPREDUCE-4909.patch, 
 MAPREDUCE-4909.patch, MAPREDUCE-4909.patch


 TestKeyValueTextInputFormat.testFormat fails with Open JDK 7. The root cause 
 appears to be a failure to delete in-use files via LocalFileSystem.delete 
 (RawLocalFileSystem.delete).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAPREDUCE-4909) TestKeyValueTextInputFormat fails with Open JDK 7 on Windows

2013-01-03 Thread Arpit Agarwal (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-4909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13543328#comment-13543328
 ] 

Arpit Agarwal commented on MAPREDUCE-4909:
--

Thanks Suresh!

 TestKeyValueTextInputFormat fails with Open JDK 7 on Windows
 

 Key: MAPREDUCE-4909
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4909
 Project: Hadoop Map/Reduce
  Issue Type: Bug
  Components: test
Affects Versions: 1.2.0
Reporter: Arpit Agarwal
Assignee: Arpit Agarwal
 Fix For: 1.2.0

 Attachments: MAPREDUCE-4909.patch, MAPREDUCE-4909.patch, 
 MAPREDUCE-4909.patch, MAPREDUCE-4909.patch


 TestKeyValueTextInputFormat.testFormat fails with Open JDK 7. The root cause 
 appears to be a failure to delete in-use files via LocalFileSystem.delete 
 (RawLocalFileSystem.delete).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAPREDUCE-4832) MR AM can get in a split brain situation

2013-01-03 Thread Siddharth Seth (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-4832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13543334#comment-13543334
 ] 

Siddharth Seth commented on MAPREDUCE-4832:
---

Was talking to Hitesh offline about this patch. Is this needed at the moment ? 
Seems like it's possible to avoid multiple AMs by tuning the 
AM_LIVENESS_INTERVAL (10 minutes by default) and MR_AM_TO_RM_WAIT_INTERVAL_MS 
(6 minutes by default). A new AM should only be started after the existing AM 
is done.
 
That said, this is definitely an interesting approach to fix the problem.
- Could add a check to ensure the window interval is greater than the AM-RM 
heartbeat.
- Does getClock() need to be part of the RMHeartbeatHandler. Looks like the 
AppContext can provide this - I think a couple of places use the AppContext, 
others use th RMHeartbeatHandler.

Recovery and restart are still WIP. I believe the  MR_AM_TO_RM_WAIT_INTERVAL_MS 
will need to be looked at again in context of recovery. This patch, or a sync 
via hdfs seems more useful at that point ?

 MR AM can get in a split brain situation
 

 Key: MAPREDUCE-4832
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4832
 Project: Hadoop Map/Reduce
  Issue Type: Bug
  Components: applicationmaster
Affects Versions: 2.0.2-alpha, 0.23.5
Reporter: Robert Joseph Evans
Assignee: Jason Lowe
Priority: Critical
 Attachments: MAPREDUCE-4832.patch


 It is possible for a networking issue to happen where the RM thinks an AM has 
 gone down and launches a replacement, but the previous AM is still up and 
 running.  If the previous AM does not need any more resources from the RM it 
 could try to commit either tasks or jobs.  This could cause lots of problems 
 where the second AM finishes and tries to commit too.  This could result in 
 data corruption.  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (MAPREDUCE-4819) AM can rerun job after reporting final job status to the client

2013-01-03 Thread Robert Joseph Evans (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-4819?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Joseph Evans updated MAPREDUCE-4819:
---

Attachment: MR-4819-bobby-trunk.txt

Fixes Findbugs issue, and test failures.  Both were test issues I had missed 
previously.

 AM can rerun job after reporting final job status to the client
 ---

 Key: MAPREDUCE-4819
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4819
 Project: Hadoop Map/Reduce
  Issue Type: Bug
  Components: mr-am
Affects Versions: 0.23.3, 2.0.1-alpha
Reporter: Jason Lowe
Assignee: Bikas Saha
Priority: Critical
 Attachments: MAPREDUCE-4819.1.patch, MAPREDUCE-4819.2.patch, 
 MAPREDUCE-4819.3.patch, MR-4819-bobby-trunk.txt, MR-4819-bobby-trunk.txt, 
 MR-4819-bobby-trunk.txt, MR-4819-bobby-trunk.txt


 If the AM reports final job status to the client but then crashes before 
 unregistering with the RM then the RM can run another AM attempt.  Currently 
 AM re-attempts assume that the previous attempts did not reach a final job 
 state, and that causes the job to rerun (from scratch, if the output format 
 doesn't support recovery).
 Re-running the job when we've already told the client the final status of the 
 job is bad for a number of reasons.  If the job failed, it's confusing at 
 best since the client was already told the job failed but the subsequent 
 attempt could succeed.  If the job succeeded there could be data loss, as a 
 subsequent job launched by the client tries to consume the job's output as 
 input just as the re-attempt starts removing output files in preparation for 
 the output commit.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (MAPREDUCE-4819) AM can rerun job after reporting final job status to the client

2013-01-03 Thread Robert Joseph Evans (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-4819?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Joseph Evans updated MAPREDUCE-4819:
---

Attachment: MR-4819-bobby-trunk.txt

With the latest comments on MAPREDUCE-4832 I removed the place holder in here 
for code from it.  Now this should be able to stand alone, and be committed if 
deemed acceptable.

 AM can rerun job after reporting final job status to the client
 ---

 Key: MAPREDUCE-4819
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4819
 Project: Hadoop Map/Reduce
  Issue Type: Bug
  Components: mr-am
Affects Versions: 0.23.3, 2.0.1-alpha
Reporter: Jason Lowe
Assignee: Bikas Saha
Priority: Critical
 Attachments: MAPREDUCE-4819.1.patch, MAPREDUCE-4819.2.patch, 
 MAPREDUCE-4819.3.patch, MR-4819-bobby-trunk.txt, MR-4819-bobby-trunk.txt, 
 MR-4819-bobby-trunk.txt, MR-4819-bobby-trunk.txt, MR-4819-bobby-trunk.txt


 If the AM reports final job status to the client but then crashes before 
 unregistering with the RM then the RM can run another AM attempt.  Currently 
 AM re-attempts assume that the previous attempts did not reach a final job 
 state, and that causes the job to rerun (from scratch, if the output format 
 doesn't support recovery).
 Re-running the job when we've already told the client the final status of the 
 job is bad for a number of reasons.  If the job failed, it's confusing at 
 best since the client was already told the job failed but the subsequent 
 attempt could succeed.  If the job succeeded there could be data loss, as a 
 subsequent job launched by the client tries to consume the job's output as 
 input just as the re-attempt starts removing output files in preparation for 
 the output commit.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAPREDUCE-4819) AM can rerun job after reporting final job status to the client

2013-01-03 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-4819?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13543376#comment-13543376
 ] 

Hadoop QA commented on MAPREDUCE-4819:
--

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  
http://issues.apache.org/jira/secure/attachment/12563172/MR-4819-bobby-trunk.txt
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 7 new 
or modified test files.

  {color:red}-1 javac{color}.  The applied patch generated 2015 javac 
compiler warnings (more than the trunk's current 2014 warnings).

{color:green}+1 javadoc{color}.  The javadoc tool did not generate any 
warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 core tests{color}.  The patch failed these unit tests in 
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app 
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-common 
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core 
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-hs 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-web-proxy:

  org.apache.hadoop.mapreduce.v2.app.TestRecovery
  
org.apache.hadoop.mapreduce.v2.app.webapp.TestAMWebServicesJobs
  
org.apache.hadoop.mapreduce.v2.app.webapp.TestAMWebServicesTasks
  org.apache.hadoop.mapreduce.v2.hs.webapp.TestHsWebServicesJobs
  
org.apache.hadoop.mapreduce.v2.hs.webapp.TestHsWebServicesTasks
  
org.apache.hadoop.mapreduce.v2.hs.webapp.TestHsWebServicesJobsQuery

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3190//testReport/
Javac warnings: 
https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3190//artifact/trunk/patchprocess/diffJavacWarnings.txt
Console output: 
https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3190//console

This message is automatically generated.

 AM can rerun job after reporting final job status to the client
 ---

 Key: MAPREDUCE-4819
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4819
 Project: Hadoop Map/Reduce
  Issue Type: Bug
  Components: mr-am
Affects Versions: 0.23.3, 2.0.1-alpha
Reporter: Jason Lowe
Assignee: Bikas Saha
Priority: Critical
 Attachments: MAPREDUCE-4819.1.patch, MAPREDUCE-4819.2.patch, 
 MAPREDUCE-4819.3.patch, MR-4819-bobby-trunk.txt, MR-4819-bobby-trunk.txt, 
 MR-4819-bobby-trunk.txt, MR-4819-bobby-trunk.txt, MR-4819-bobby-trunk.txt


 If the AM reports final job status to the client but then crashes before 
 unregistering with the RM then the RM can run another AM attempt.  Currently 
 AM re-attempts assume that the previous attempts did not reach a final job 
 state, and that causes the job to rerun (from scratch, if the output format 
 doesn't support recovery).
 Re-running the job when we've already told the client the final status of the 
 job is bad for a number of reasons.  If the job failed, it's confusing at 
 best since the client was already told the job failed but the subsequent 
 attempt could succeed.  If the job succeeded there could be data loss, as a 
 subsequent job launched by the client tries to consume the job's output as 
 input just as the re-attempt starts removing output files in preparation for 
 the output commit.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAPREDUCE-4892) CombineFileInputFormat node input split can be skewed on small clusters

2013-01-03 Thread Chris Nauroth (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-4892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13543377#comment-13543377
 ] 

Chris Nauroth commented on MAPREDUCE-4892:
--

+1 for the patch

I applied the patch locally and ran {{TestCombineFileInputFormat}}.  Code looks 
good.  I can't think of any other edge cases that this patch doesn't handle.


 CombineFileInputFormat node input split can be skewed on small clusters
 ---

 Key: MAPREDUCE-4892
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4892
 Project: Hadoop Map/Reduce
  Issue Type: Bug
Reporter: Bikas Saha
Assignee: Bikas Saha
 Fix For: 3.0.0

 Attachments: MAPREDUCE-4892.1.patch


 The CombineFileInputFormat split generation logic tries to group blocks by 
 node in order to create splits. It iterates through the nodes and creates 
 splits on them until there aren't enough blocks left on a node that can be 
 grouped into a valid split. If the first few nodes have a lot of blocks on 
 them then they can end up getting a disproportionately large share of the 
 total number of splits created. This can result in poor locality of maps. 
 This problem is likely to happen on small clusters where its easier to create 
 a skew in the distribution of blocks on nodes.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAPREDUCE-4819) AM can rerun job after reporting final job status to the client

2013-01-03 Thread Robert Joseph Evans (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-4819?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13543380#comment-13543380
 ] 

Robert Joseph Evans commented on MAPREDUCE-4819:


I am investigating the test failures.  I think they are unrelated to this 
patch, because they work just fine for me when I run them without up-merging to 
the latest trunk.

 AM can rerun job after reporting final job status to the client
 ---

 Key: MAPREDUCE-4819
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4819
 Project: Hadoop Map/Reduce
  Issue Type: Bug
  Components: mr-am
Affects Versions: 0.23.3, 2.0.1-alpha
Reporter: Jason Lowe
Assignee: Bikas Saha
Priority: Critical
 Attachments: MAPREDUCE-4819.1.patch, MAPREDUCE-4819.2.patch, 
 MAPREDUCE-4819.3.patch, MR-4819-bobby-trunk.txt, MR-4819-bobby-trunk.txt, 
 MR-4819-bobby-trunk.txt, MR-4819-bobby-trunk.txt, MR-4819-bobby-trunk.txt


 If the AM reports final job status to the client but then crashes before 
 unregistering with the RM then the RM can run another AM attempt.  Currently 
 AM re-attempts assume that the previous attempts did not reach a final job 
 state, and that causes the job to rerun (from scratch, if the output format 
 doesn't support recovery).
 Re-running the job when we've already told the client the final status of the 
 job is bad for a number of reasons.  If the job failed, it's confusing at 
 best since the client was already told the job failed but the subsequent 
 attempt could succeed.  If the job succeeded there could be data loss, as a 
 subsequent job launched by the client tries to consume the job's output as 
 input just as the re-attempt starts removing output files in preparation for 
 the output commit.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAPREDUCE-4819) AM can rerun job after reporting final job status to the client

2013-01-03 Thread Robert Joseph Evans (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-4819?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13543383#comment-13543383
 ] 

Robert Joseph Evans commented on MAPREDUCE-4819:


For some reason all of the web service tests were failing with out of memory 
errors, that I have not been able to reproduce yet myself.  The TestRecovery 
failures I also have not been able to reproduce, but I did not see any OOMs 
there.

 AM can rerun job after reporting final job status to the client
 ---

 Key: MAPREDUCE-4819
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4819
 Project: Hadoop Map/Reduce
  Issue Type: Bug
  Components: mr-am
Affects Versions: 0.23.3, 2.0.1-alpha
Reporter: Jason Lowe
Assignee: Bikas Saha
Priority: Critical
 Attachments: MAPREDUCE-4819.1.patch, MAPREDUCE-4819.2.patch, 
 MAPREDUCE-4819.3.patch, MR-4819-bobby-trunk.txt, MR-4819-bobby-trunk.txt, 
 MR-4819-bobby-trunk.txt, MR-4819-bobby-trunk.txt, MR-4819-bobby-trunk.txt


 If the AM reports final job status to the client but then crashes before 
 unregistering with the RM then the RM can run another AM attempt.  Currently 
 AM re-attempts assume that the previous attempts did not reach a final job 
 state, and that causes the job to rerun (from scratch, if the output format 
 doesn't support recovery).
 Re-running the job when we've already told the client the final status of the 
 job is bad for a number of reasons.  If the job failed, it's confusing at 
 best since the client was already told the job failed but the subsequent 
 attempt could succeed.  If the job succeeded there could be data loss, as a 
 subsequent job launched by the client tries to consume the job's output as 
 input just as the re-attempt starts removing output files in preparation for 
 the output commit.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAPREDUCE-4819) AM can rerun job after reporting final job status to the client

2013-01-03 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-4819?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13543388#comment-13543388
 ] 

Hadoop QA commented on MAPREDUCE-4819:
--

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  
http://issues.apache.org/jira/secure/attachment/12563176/MR-4819-bobby-trunk.txt
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 7 new 
or modified test files.

  {color:red}-1 javac{color}.  The applied patch generated 2015 javac 
compiler warnings (more than the trunk's current 2014 warnings).

{color:green}+1 javadoc{color}.  The javadoc tool did not generate any 
warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app 
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-common 
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core 
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-hs 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-web-proxy.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3191//testReport/
Javac warnings: 
https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3191//artifact/trunk/patchprocess/diffJavacWarnings.txt
Console output: 
https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3191//console

This message is automatically generated.

 AM can rerun job after reporting final job status to the client
 ---

 Key: MAPREDUCE-4819
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4819
 Project: Hadoop Map/Reduce
  Issue Type: Bug
  Components: mr-am
Affects Versions: 0.23.3, 2.0.1-alpha
Reporter: Jason Lowe
Assignee: Bikas Saha
Priority: Critical
 Attachments: MAPREDUCE-4819.1.patch, MAPREDUCE-4819.2.patch, 
 MAPREDUCE-4819.3.patch, MR-4819-bobby-trunk.txt, MR-4819-bobby-trunk.txt, 
 MR-4819-bobby-trunk.txt, MR-4819-bobby-trunk.txt, MR-4819-bobby-trunk.txt


 If the AM reports final job status to the client but then crashes before 
 unregistering with the RM then the RM can run another AM attempt.  Currently 
 AM re-attempts assume that the previous attempts did not reach a final job 
 state, and that causes the job to rerun (from scratch, if the output format 
 doesn't support recovery).
 Re-running the job when we've already told the client the final status of the 
 job is bad for a number of reasons.  If the job failed, it's confusing at 
 best since the client was already told the job failed but the subsequent 
 attempt could succeed.  If the job succeeded there could be data loss, as a 
 subsequent job launched by the client tries to consume the job's output as 
 input just as the re-attempt starts removing output files in preparation for 
 the output commit.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAPREDUCE-4832) MR AM can get in a split brain situation

2013-01-03 Thread Bikas Saha (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-4832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13543390#comment-13543390
 ] 

Bikas Saha commented on MAPREDUCE-4832:
---

Independent of this change, this looks like a problem that needs to be solved 
in the platform than in the AM. Something like making sure the NM maintains an 
expire time on its containers and terminates them when the expire time is 
reached. The expire time is extended whenever the NM heartbeats with the RM. So 
if the NM loses contact with the RM or if the RM thinks the AM should not be 
running anymore on that NM,then the expire time will not be extended. RM starts 
retries after the expire time has elapsed. The logic is similar but self 
contained within the platform. AM's could do similar stuff to their containers. 
Thus providing an automatic garbage collection when an AM crashes.

 MR AM can get in a split brain situation
 

 Key: MAPREDUCE-4832
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4832
 Project: Hadoop Map/Reduce
  Issue Type: Bug
  Components: applicationmaster
Affects Versions: 2.0.2-alpha, 0.23.5
Reporter: Robert Joseph Evans
Assignee: Jason Lowe
Priority: Critical
 Attachments: MAPREDUCE-4832.patch


 It is possible for a networking issue to happen where the RM thinks an AM has 
 gone down and launches a replacement, but the previous AM is still up and 
 running.  If the previous AM does not need any more resources from the RM it 
 could try to commit either tasks or jobs.  This could cause lots of problems 
 where the second AM finishes and tries to commit too.  This could result in 
 data corruption.  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAPREDUCE-4819) AM can rerun job after reporting final job status to the client

2013-01-03 Thread Robert Joseph Evans (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-4819?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13543394#comment-13543394
 ] 

Robert Joseph Evans commented on MAPREDUCE-4819:


OK looking at it all of the failures appear to be associated with the hadoop4 
machine. I will work with tgraves to see if we can figure out what is happening.

 AM can rerun job after reporting final job status to the client
 ---

 Key: MAPREDUCE-4819
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4819
 Project: Hadoop Map/Reduce
  Issue Type: Bug
  Components: mr-am
Affects Versions: 0.23.3, 2.0.1-alpha
Reporter: Jason Lowe
Assignee: Bikas Saha
Priority: Critical
 Attachments: MAPREDUCE-4819.1.patch, MAPREDUCE-4819.2.patch, 
 MAPREDUCE-4819.3.patch, MR-4819-bobby-trunk.txt, MR-4819-bobby-trunk.txt, 
 MR-4819-bobby-trunk.txt, MR-4819-bobby-trunk.txt, MR-4819-bobby-trunk.txt


 If the AM reports final job status to the client but then crashes before 
 unregistering with the RM then the RM can run another AM attempt.  Currently 
 AM re-attempts assume that the previous attempts did not reach a final job 
 state, and that causes the job to rerun (from scratch, if the output format 
 doesn't support recovery).
 Re-running the job when we've already told the client the final status of the 
 job is bad for a number of reasons.  If the job failed, it's confusing at 
 best since the client was already told the job failed but the subsequent 
 attempt could succeed.  If the job succeeded there could be data loss, as a 
 subsequent job launched by the client tries to consume the job's output as 
 input just as the re-attempt starts removing output files in preparation for 
 the output commit.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAPREDUCE-4819) AM can rerun job after reporting final job status to the client

2013-01-03 Thread Bikas Saha (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-4819?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13543406#comment-13543406
 ] 

Bikas Saha commented on MAPREDUCE-4819:
---

When staging dir exists but commitStarted marker does not exist, then it means 
that its a retry that should continue as normal, right?
If yes, shouldnt copyHistory be set to false for the above case? Looks like 
copyHistory should be set to true only inside the following block. Only when 
commit started, do we need to copy history and end. In other cases, we should 
not copy history. Changes initial value of copyHistory to false and set it when 
needed?
{code}
+  } else if (commitStarted) {
{code}

Typos errorHappendShutDown NoopEventHanlder

If we change this code to create new file or fail then AM knows when it has 
lost its race to commit. Does this provide a simpler fix for MAPREDUCE-4832? 
When AM tries to initiate commit, then only the first one manages to write the 
commit_start file in HDFS. So racing AM's will fail after the first one 
succeeds. The marker still exists for the purpose of signalling start of commit 
(ie for this jira). It should not matter which AM commits the result because 
the computation is deterministic. The AM that failed to commit could check/wait 
for end of commit marker in order to makes sure that the last retry succeeds 
(if that is necessary).
{code}
+private void touchz(Path p) throws IOException {
+  fs.create(p).close();
+}
{code}


 AM can rerun job after reporting final job status to the client
 ---

 Key: MAPREDUCE-4819
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4819
 Project: Hadoop Map/Reduce
  Issue Type: Bug
  Components: mr-am
Affects Versions: 0.23.3, 2.0.1-alpha
Reporter: Jason Lowe
Assignee: Bikas Saha
Priority: Critical
 Attachments: MAPREDUCE-4819.1.patch, MAPREDUCE-4819.2.patch, 
 MAPREDUCE-4819.3.patch, MR-4819-bobby-trunk.txt, MR-4819-bobby-trunk.txt, 
 MR-4819-bobby-trunk.txt, MR-4819-bobby-trunk.txt, MR-4819-bobby-trunk.txt


 If the AM reports final job status to the client but then crashes before 
 unregistering with the RM then the RM can run another AM attempt.  Currently 
 AM re-attempts assume that the previous attempts did not reach a final job 
 state, and that causes the job to rerun (from scratch, if the output format 
 doesn't support recovery).
 Re-running the job when we've already told the client the final status of the 
 job is bad for a number of reasons.  If the job failed, it's confusing at 
 best since the client was already told the job failed but the subsequent 
 attempt could succeed.  If the job succeeded there could be data loss, as a 
 subsequent job launched by the client tries to consume the job's output as 
 input just as the re-attempt starts removing output files in preparation for 
 the output commit.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAPREDUCE-4832) MR AM can get in a split brain situation

2013-01-03 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-4832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13543414#comment-13543414
 ] 

Jason Lowe commented on MAPREDUCE-4832:
---

bq. Seems like it's possible to avoid multiple AMs by tuning the 
AM_LIVENESS_INTERVAL (10 minutes by default) and MR_AM_TO_RM_WAIT_INTERVAL_MS 
(6 minutes by default). A new AM should only be started after the existing AM 
is done.

That *almost* solves the problem, but there are some corner cases left 
unsolved.  For example:

1) AM is running on a node whose NM suddenly declares itself UNHEALTHY via 
health-check script
2) RM removes node from active nodes and kills all containers running on that 
node
3) Network cut occurs.  NM did not receive notification to kill the containers 
and/or NM crashes.  AM is unable to communicate to RM.
4) RM now thinks all containers are dead on that node, proceeds to relaunch a 
new AM attempt
5) Now for the next 6 minutes (or whatever the expiry interval is for the AM to 
RM) we have two app attempts running simultaneously.  If the old AM attempt is 
able to reach HDFS or whatever it needs to commit, we could end up committing 
twice.

bq. Could add a check to ensure the window interval is greater than the AM-RM 
heartbeat.

Actually that's not strictly necessary.  The code can function correctly even 
if the commit window is smaller than the heartbeat interval.  For example, job 
commit is woken up when a fresh heartbeat arrives, and task commit polls 
periodically for whether the heartbeat has occurred recently.  It's not 
mandatory that the interval between heartbeats is smaller than the commit 
window for a commit to proceed, but it is more likely a commit operation will 
be stalled waiting for a fresh heartbeat if configured that way.

bq. Does getClock() need to be part of the RMHeartbeatHandler. Looks like the 
AppContext can provide this

I put it in the interface so the caller can access the same clock used to 
timestamp the heartbeat in case it could be different from the AppContext clock 
or if the caller didn't have access to the AppContext.  But that's probably 
never going to be a real concern, so I'll take it out.

And to address Bikas' comment:
bq. Independent of this change, this looks like a problem that needs to be 
solved in the platform than in the AM.

We might be able to close all the corner cases in the framework.  For example, 
the above scenario could be solved if the RM were to wait for confirmation from 
the NM of the containers actually expiring before proceeding to launch another 
attempt.  If the NM is unreachable before the confirmation is received, it 
could wait for the AM expiry interval before launching a new attempt.  It could 
mean that we wait a lot longer than necessary, but at least we'd know with 
confidence that two attempts aren't running simultaneously.


 MR AM can get in a split brain situation
 

 Key: MAPREDUCE-4832
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4832
 Project: Hadoop Map/Reduce
  Issue Type: Bug
  Components: applicationmaster
Affects Versions: 2.0.2-alpha, 0.23.5
Reporter: Robert Joseph Evans
Assignee: Jason Lowe
Priority: Critical
 Attachments: MAPREDUCE-4832.patch


 It is possible for a networking issue to happen where the RM thinks an AM has 
 gone down and launches a replacement, but the previous AM is still up and 
 running.  If the previous AM does not need any more resources from the RM it 
 could try to commit either tasks or jobs.  This could cause lots of problems 
 where the second AM finishes and tries to commit too.  This could result in 
 data corruption.  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAPREDUCE-4819) AM can rerun job after reporting final job status to the client

2013-01-03 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-4819?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13543425#comment-13543425
 ] 

Jason Lowe commented on MAPREDUCE-4819:
---

bq. If we change this code to create new file or fail then AM knows when it has 
lost its race to commit. Does this provide a simpler fix for MAPREDUCE-4832?

If an app attempt sees the file, how does it even know whether there's an 
active race that was lost?  The other AM could have simply crashed mid-commit.  
The losing AM could just assume that's the case and unregister from the RM with 
a FAILED status assuming job commit failed.  (Or maybe wait for some 
configurable timeout just in case.)

However this would only cover job commit, and two racing app attempts could 
still commit output for tasks simultaneously.  MAPREDUCE-4832 prevents two 
racing app attempts from committing the same task output, as at most one will 
be active and allowed to commit.  That could be bad if the old attempt is 
re-committing output for a fetch-failure map task while the second attempt is 
trying to recover, for example.  Task output could be lost in that case.

 AM can rerun job after reporting final job status to the client
 ---

 Key: MAPREDUCE-4819
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4819
 Project: Hadoop Map/Reduce
  Issue Type: Bug
  Components: mr-am
Affects Versions: 0.23.3, 2.0.1-alpha
Reporter: Jason Lowe
Assignee: Bikas Saha
Priority: Critical
 Attachments: MAPREDUCE-4819.1.patch, MAPREDUCE-4819.2.patch, 
 MAPREDUCE-4819.3.patch, MR-4819-bobby-trunk.txt, MR-4819-bobby-trunk.txt, 
 MR-4819-bobby-trunk.txt, MR-4819-bobby-trunk.txt, MR-4819-bobby-trunk.txt


 If the AM reports final job status to the client but then crashes before 
 unregistering with the RM then the RM can run another AM attempt.  Currently 
 AM re-attempts assume that the previous attempts did not reach a final job 
 state, and that causes the job to rerun (from scratch, if the output format 
 doesn't support recovery).
 Re-running the job when we've already told the client the final status of the 
 job is bad for a number of reasons.  If the job failed, it's confusing at 
 best since the client was already told the job failed but the subsequent 
 attempt could succeed.  If the job succeeded there could be data loss, as a 
 subsequent job launched by the client tries to consume the job's output as 
 input just as the re-attempt starts removing output files in preparation for 
 the output commit.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (MAPREDUCE-3685) There are some bugs in implementation of MergeManager

2013-01-03 Thread Ravi Prakash (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-3685?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ravi Prakash updated MAPREDUCE-3685:


Priority: Critical  (was: Minor)

 There are some bugs in implementation of MergeManager
 -

 Key: MAPREDUCE-3685
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3685
 Project: Hadoop Map/Reduce
  Issue Type: Bug
  Components: mrv2
Affects Versions: 0.23.1
Reporter: anty.rao
Assignee: anty
Priority: Critical
 Attachments: MAPREDUCE-3685-branch-0.23.1.patch, 
 MAPREDUCE-3685-branch-0.23.1.patch, MAPREDUCE-3685-branch-0.23.1.patch, 
 MAPREDUCE-3685.patch




--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (MAPREDUCE-3685) There are some bugs in implementation of MergeManager

2013-01-03 Thread Ravi Prakash (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-3685?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ravi Prakash updated MAPREDUCE-3685:


Target Version/s: 2.0.0-alpha, trunk, 0.23.6  (was: 2.0.0-alpha, trunk)

 There are some bugs in implementation of MergeManager
 -

 Key: MAPREDUCE-3685
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3685
 Project: Hadoop Map/Reduce
  Issue Type: Bug
  Components: mrv2
Affects Versions: 0.23.1
Reporter: anty.rao
Assignee: anty
Priority: Critical
 Attachments: MAPREDUCE-3685-branch-0.23.1.patch, 
 MAPREDUCE-3685-branch-0.23.1.patch, MAPREDUCE-3685-branch-0.23.1.patch, 
 MAPREDUCE-3685.patch




--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (MAPREDUCE-3685) There are some bugs in implementation of MergeManager

2013-01-03 Thread Ravi Prakash (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-3685?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ravi Prakash updated MAPREDUCE-3685:


Status: Patch Available  (was: Open)

Submitting patch on behalf of Anty!

 There are some bugs in implementation of MergeManager
 -

 Key: MAPREDUCE-3685
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3685
 Project: Hadoop Map/Reduce
  Issue Type: Bug
  Components: mrv2
Affects Versions: 0.23.1
Reporter: anty.rao
Assignee: anty
Priority: Critical
 Attachments: MAPREDUCE-3685-branch-0.23.1.patch, 
 MAPREDUCE-3685-branch-0.23.1.patch, MAPREDUCE-3685-branch-0.23.1.patch, 
 MAPREDUCE-3685.patch




--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (MAPREDUCE-4832) MR AM can get in a split brain situation

2013-01-03 Thread Jason Lowe (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-4832?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Lowe updated MAPREDUCE-4832:
--

Attachment: MAPREDUCE-4832.patch

Updated patch to remove getClock() from RMHeartbeatHandler interface.

 MR AM can get in a split brain situation
 

 Key: MAPREDUCE-4832
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4832
 Project: Hadoop Map/Reduce
  Issue Type: Bug
  Components: applicationmaster
Affects Versions: 2.0.2-alpha, 0.23.5
Reporter: Robert Joseph Evans
Assignee: Jason Lowe
Priority: Critical
 Attachments: MAPREDUCE-4832.patch, MAPREDUCE-4832.patch


 It is possible for a networking issue to happen where the RM thinks an AM has 
 gone down and launches a replacement, but the previous AM is still up and 
 running.  If the previous AM does not need any more resources from the RM it 
 could try to commit either tasks or jobs.  This could cause lots of problems 
 where the second AM finishes and tries to commit too.  This could result in 
 data corruption.  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAPREDUCE-3685) There are some bugs in implementation of MergeManager

2013-01-03 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-3685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13543460#comment-13543460
 ] 

Hadoop QA commented on MAPREDUCE-3685:
--

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  
http://issues.apache.org/jira/secure/attachment/12522248/MAPREDUCE-3685-branch-0.23.1.patch
  against trunk revision .

{color:red}-1 patch{color}.  The patch command could not apply the patch.

Console output: 
https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3193//console

This message is automatically generated.

 There are some bugs in implementation of MergeManager
 -

 Key: MAPREDUCE-3685
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3685
 Project: Hadoop Map/Reduce
  Issue Type: Bug
  Components: mrv2
Affects Versions: 0.23.1
Reporter: anty.rao
Assignee: anty
Priority: Critical
 Attachments: MAPREDUCE-3685-branch-0.23.1.patch, 
 MAPREDUCE-3685-branch-0.23.1.patch, MAPREDUCE-3685-branch-0.23.1.patch, 
 MAPREDUCE-3685.patch




--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Assigned] (MAPREDUCE-2286) ASF mapreduce

2013-01-03 Thread Miguel Ochoa (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-2286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Miguel Ochoa reassigned MAPREDUCE-2286:
---

Assignee: Miguel Ochoa

 ASF mapreduce
 -

 Key: MAPREDUCE-2286
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2286
 Project: Hadoop Map/Reduce
  Issue Type: Sub-task
  Components: benchmarks, client, contrib/streaming, jobtracker, pipes
 Environment: 2.2 Commodore
Reporter: Miguel Ochoa
Assignee: Miguel Ochoa
Priority: Trivial
   Original Estimate: 50h
  Remaining Estimate: 50h

 This sub-net ensures versions in description, however projects or 
 manufacturing will have to be in working conditioning in the time of unknown 
 versions.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAPREDUCE-4832) MR AM can get in a split brain situation

2013-01-03 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-4832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13543475#comment-13543475
 ] 

Hadoop QA commented on MAPREDUCE-4832:
--

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12563197/MAPREDUCE-4832.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 7 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  The javadoc tool did not generate any 
warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app 
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3192//testReport/
Console output: 
https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3192//console

This message is automatically generated.

 MR AM can get in a split brain situation
 

 Key: MAPREDUCE-4832
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4832
 Project: Hadoop Map/Reduce
  Issue Type: Bug
  Components: applicationmaster
Affects Versions: 2.0.2-alpha, 0.23.5
Reporter: Robert Joseph Evans
Assignee: Jason Lowe
Priority: Critical
 Attachments: MAPREDUCE-4832.patch, MAPREDUCE-4832.patch


 It is possible for a networking issue to happen where the RM thinks an AM has 
 gone down and launches a replacement, but the previous AM is still up and 
 running.  If the previous AM does not need any more resources from the RM it 
 could try to commit either tasks or jobs.  This could cause lots of problems 
 where the second AM finishes and tries to commit too.  This could result in 
 data corruption.  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAPREDUCE-4819) AM can rerun job after reporting final job status to the client

2013-01-03 Thread Siddharth Seth (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-4819?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13543517#comment-13543517
 ] 

Siddharth Seth commented on MAPREDUCE-4819:
---

bq. I think it already will. We are not opening the file for append, we are 
trying to create it.
fs.create(Path) - overwrites by default, instead of throwing an exception. 
There's another form which does not overwrite. Don't think this is a problem 
once 4832 goes in.

bq. I have also not added in the kill state. Again it looked a bit tricky 
because of the multithreading and I would prefer to get something working in 
now and add that as part of a follow up JIRA. 
ok. This seems like it will be easier if we rely on the history file as the 
commit log instead of the 3/more individual files.

RPC clients not being able to communicate with the AM / history (or getting 
alternate states) after having seen a SUCCESS state seems to be independent of 
this patch. Separate jira.

This seems ok for now since it's gotten some attention and has been tried out. 
I think handling all of this via the CommitHandler is a cleaner approach, and 
we can move to that at a later point.


 AM can rerun job after reporting final job status to the client
 ---

 Key: MAPREDUCE-4819
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4819
 Project: Hadoop Map/Reduce
  Issue Type: Bug
  Components: mr-am
Affects Versions: 0.23.3, 2.0.1-alpha
Reporter: Jason Lowe
Assignee: Bikas Saha
Priority: Critical
 Attachments: MAPREDUCE-4819.1.patch, MAPREDUCE-4819.2.patch, 
 MAPREDUCE-4819.3.patch, MR-4819-bobby-trunk.txt, MR-4819-bobby-trunk.txt, 
 MR-4819-bobby-trunk.txt, MR-4819-bobby-trunk.txt, MR-4819-bobby-trunk.txt


 If the AM reports final job status to the client but then crashes before 
 unregistering with the RM then the RM can run another AM attempt.  Currently 
 AM re-attempts assume that the previous attempts did not reach a final job 
 state, and that causes the job to rerun (from scratch, if the output format 
 doesn't support recovery).
 Re-running the job when we've already told the client the final status of the 
 job is bad for a number of reasons.  If the job failed, it's confusing at 
 best since the client was already told the job failed but the subsequent 
 attempt could succeed.  If the job succeeded there could be data loss, as a 
 subsequent job launched by the client tries to consume the job's output as 
 input just as the re-attempt starts removing output files in preparation for 
 the output commit.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (MAPREDUCE-4904) TestMultipleLevelCaching failed in barnch-1

2013-01-03 Thread Junping Du (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-4904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junping Du updated MAPREDUCE-4904:
--

Attachment: MAPREDUCE-4904-v2.patch

Incorporate Luke's comments with adding comments to fall through in switch case.

 TestMultipleLevelCaching failed in barnch-1
 ---

 Key: MAPREDUCE-4904
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4904
 Project: Hadoop Map/Reduce
  Issue Type: Bug
  Components: test
Affects Versions: 1.2.0
Reporter: meng gong
Assignee: meng gong
 Fix For: 1.2.0

 Attachments: MAPREDUCE-4904.patch, MAPREDUCE-4904-v2.patch


 TestMultipleLevelCaching will failed:
 {noformat}
 Testcase: testMultiLevelCaching took 30.406 sec
 FAILED
 Number of local maps expected:0 but was:1
 junit.framework.AssertionFailedError: Number of local maps expected:0 but 
 was:1
 at 
 org.apache.hadoop.mapred.TestRackAwareTaskPlacement.launchJobAndTestCounters(TestRackAwareTaskPlacement.java:78)
 at 
 org.apache.hadoop.mapred.TestMultipleLevelCaching.testCachingAtLevel(TestMultipleLevelCaching.java:113)
 at 
 org.apache.hadoop.mapred.TestMultipleLevelCaching.testMultiLevelCaching(TestMultipleLevelCaching.java:69)
 {noformat}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (MAPREDUCE-2286) ASF mapreduce

2013-01-03 Thread Miguel Ochoa (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-2286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Miguel Ochoa updated MAPREDUCE-2286:


Attachment: 01 - Mutual NDA - 2010.doc

 ASF mapreduce
 -

 Key: MAPREDUCE-2286
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2286
 Project: Hadoop Map/Reduce
  Issue Type: Sub-task
  Components: benchmarks, client, contrib/streaming, jobtracker, pipes
 Environment: 2.2 Commodore
Reporter: Miguel Ochoa
Assignee: Miguel Ochoa
Priority: Trivial
 Attachments: 01 - Mutual NDA - 2010.doc

   Original Estimate: 50h
  Remaining Estimate: 50h

 This sub-net ensures versions in description, however projects or 
 manufacturing will have to be in working conditioning in the time of unknown 
 versions.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (MAPREDUCE-4904) TestMultipleLevelCaching failed in barnch-1

2013-01-03 Thread Junping Du (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-4904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junping Du updated MAPREDUCE-4904:
--

Attachment: MAPREDUCE-4904-v2.patch

use --no-prefix to generate patch in new v2 patch.

 TestMultipleLevelCaching failed in barnch-1
 ---

 Key: MAPREDUCE-4904
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4904
 Project: Hadoop Map/Reduce
  Issue Type: Bug
  Components: test
Affects Versions: 1.2.0
Reporter: meng gong
Assignee: meng gong
 Fix For: 1.2.0

 Attachments: MAPREDUCE-4904.patch, MAPREDUCE-4904-v2.patch, 
 MAPREDUCE-4904-v2.patch


 TestMultipleLevelCaching will failed:
 {noformat}
 Testcase: testMultiLevelCaching took 30.406 sec
 FAILED
 Number of local maps expected:0 but was:1
 junit.framework.AssertionFailedError: Number of local maps expected:0 but 
 was:1
 at 
 org.apache.hadoop.mapred.TestRackAwareTaskPlacement.launchJobAndTestCounters(TestRackAwareTaskPlacement.java:78)
 at 
 org.apache.hadoop.mapred.TestMultipleLevelCaching.testCachingAtLevel(TestMultipleLevelCaching.java:113)
 at 
 org.apache.hadoop.mapred.TestMultipleLevelCaching.testMultiLevelCaching(TestMultipleLevelCaching.java:69)
 {noformat}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (MAPREDUCE-4819) AM can rerun job after reporting final job status to the client

2013-01-03 Thread Arun C Murthy (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-4819?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arun C Murthy updated MAPREDUCE-4819:
-

Priority: Blocker  (was: Critical)

 AM can rerun job after reporting final job status to the client
 ---

 Key: MAPREDUCE-4819
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4819
 Project: Hadoop Map/Reduce
  Issue Type: Bug
  Components: mr-am
Affects Versions: 0.23.3, 2.0.1-alpha
Reporter: Jason Lowe
Assignee: Bikas Saha
Priority: Blocker
 Attachments: MAPREDUCE-4819.1.patch, MAPREDUCE-4819.2.patch, 
 MAPREDUCE-4819.3.patch, MR-4819-bobby-trunk.txt, MR-4819-bobby-trunk.txt, 
 MR-4819-bobby-trunk.txt, MR-4819-bobby-trunk.txt, MR-4819-bobby-trunk.txt


 If the AM reports final job status to the client but then crashes before 
 unregistering with the RM then the RM can run another AM attempt.  Currently 
 AM re-attempts assume that the previous attempts did not reach a final job 
 state, and that causes the job to rerun (from scratch, if the output format 
 doesn't support recovery).
 Re-running the job when we've already told the client the final status of the 
 job is bad for a number of reasons.  If the job failed, it's confusing at 
 best since the client was already told the job failed but the subsequent 
 attempt could succeed.  If the job succeeded there could be data loss, as a 
 subsequent job launched by the client tries to consume the job's output as 
 input just as the re-attempt starts removing output files in preparation for 
 the output commit.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAPREDUCE-4049) plugin for generic shuffle service

2013-01-03 Thread Avner BenHanoch (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-4049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13543629#comment-13543629
 ] 

Avner BenHanoch commented on MAPREDUCE-4049:


Hi Alejandro,

re #2, my intuation is that supporting 1 external shuffle service (in addition 
to the built-in shuffle service) is the keep it simple solution.  I feel that 
the use case of N providers is theoretical.  Hence, I prefer to keep the conf 
and code simple.

Avner

 plugin for generic shuffle service
 --

 Key: MAPREDUCE-4049
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4049
 Project: Hadoop Map/Reduce
  Issue Type: Sub-task
  Components: performance, task, tasktracker
Affects Versions: 1.0.3, 1.1.0, 2.0.0-alpha, 3.0.0
Reporter: Avner BenHanoch
Assignee: Avner BenHanoch
  Labels: merge, plugin, rdma, shuffle
 Fix For: 3.0.0

 Attachments: HADOOP-1.x.y.patch, Hadoop Shuffle Plugin Design.rtf, 
 mapreduce-4049.patch


 Support generic shuffle service as set of two plugins: ShuffleProvider  
 ShuffleConsumer.
 This will satisfy the following needs:
 # Better shuffle and merge performance. For example: we are working on 
 shuffle plugin that performs shuffle over RDMA in fast networks (10gE, 40gE, 
 or Infiniband) instead of using the current HTTP shuffle. Based on the fast 
 RDMA shuffle, the plugin can also utilize a suitable merge approach during 
 the intermediate merges. Hence, getting much better performance.
 # Satisfy MAPREDUCE-3060 - generic shuffle service for avoiding hidden 
 dependency of NodeManager with a specific version of mapreduce shuffle 
 (currently targeted to 0.24.0).
 References:
 # Hadoop Acceleration through Network Levitated Merging, by Prof. Weikuan Yu 
 from Auburn University with others, 
 [http://pasl.eng.auburn.edu/pubs/sc11-netlev.pdf]
 # I am attaching 2 documents with suggested Top Level Design for both plugins 
 (currently, based on 1.0 branch)
 # I am providing link for downloading UDA - Mellanox's open source plugin 
 that implements generic shuffle service using RDMA and levitated merge.  
 Note: At this phase, the code is in C++ through JNI and you should consider 
 it as beta only.  Still, it can serve anyone that wants to implement or 
 contribute to levitated merge. (Please be advised that levitated merge is 
 mostly suit in very fast networks) - 
 [http://www.mellanox.com/content/pages.php?pg=products_dynproduct_family=144menu_section=69]

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAPREDUCE-4819) AM can rerun job after reporting final job status to the client

2013-01-03 Thread Bikas Saha (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-4819?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13543630#comment-13543630
 ] 

Bikas Saha commented on MAPREDUCE-4819:
---

Sid, how about creating some jiras so that your ideas dont get lost as 
comments. 

 AM can rerun job after reporting final job status to the client
 ---

 Key: MAPREDUCE-4819
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4819
 Project: Hadoop Map/Reduce
  Issue Type: Bug
  Components: mr-am
Affects Versions: 0.23.3, 2.0.1-alpha
Reporter: Jason Lowe
Assignee: Bikas Saha
Priority: Blocker
 Attachments: MAPREDUCE-4819.1.patch, MAPREDUCE-4819.2.patch, 
 MAPREDUCE-4819.3.patch, MR-4819-bobby-trunk.txt, MR-4819-bobby-trunk.txt, 
 MR-4819-bobby-trunk.txt, MR-4819-bobby-trunk.txt, MR-4819-bobby-trunk.txt


 If the AM reports final job status to the client but then crashes before 
 unregistering with the RM then the RM can run another AM attempt.  Currently 
 AM re-attempts assume that the previous attempts did not reach a final job 
 state, and that causes the job to rerun (from scratch, if the output format 
 doesn't support recovery).
 Re-running the job when we've already told the client the final status of the 
 job is bad for a number of reasons.  If the job failed, it's confusing at 
 best since the client was already told the job failed but the subsequent 
 attempt could succeed.  If the job succeeded there could be data loss, as a 
 subsequent job launched by the client tries to consume the job's output as 
 input just as the re-attempt starts removing output files in preparation for 
 the output commit.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAPREDUCE-4049) plugin for generic shuffle service

2013-01-03 Thread Alex Rosenbaum (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-4049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13543632#comment-13543632
 ] 

Alex Rosenbaum commented on MAPREDUCE-4049:
---

I’ll be on vacation between Jan 6 to 13 (returning on Monday the 14th)
Redirecting issues:
· VMA - Olga Shern ol...@mellanox.commailto:ol...@mellanox.com
· UDA - Avner Ben Hanoch 
avn...@mellanox.commailto:avn...@mellanox.com

Regards,

Alex Rosenbaum
Director RD Application Acceleration
Mellanox Technologies
13 Zarhin st, Raanana, Israel
+972 (74) 712-9215

Follow us on Twitterhttp://twitter.com/mellanoxtech and 
Facebookhttp://www.facebook.com/pages/Mellanox-Technologies/223164879116


 plugin for generic shuffle service
 --

 Key: MAPREDUCE-4049
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4049
 Project: Hadoop Map/Reduce
  Issue Type: Sub-task
  Components: performance, task, tasktracker
Affects Versions: 1.0.3, 1.1.0, 2.0.0-alpha, 3.0.0
Reporter: Avner BenHanoch
Assignee: Avner BenHanoch
  Labels: merge, plugin, rdma, shuffle
 Fix For: 3.0.0

 Attachments: HADOOP-1.x.y.patch, Hadoop Shuffle Plugin Design.rtf, 
 mapreduce-4049.patch


 Support generic shuffle service as set of two plugins: ShuffleProvider  
 ShuffleConsumer.
 This will satisfy the following needs:
 # Better shuffle and merge performance. For example: we are working on 
 shuffle plugin that performs shuffle over RDMA in fast networks (10gE, 40gE, 
 or Infiniband) instead of using the current HTTP shuffle. Based on the fast 
 RDMA shuffle, the plugin can also utilize a suitable merge approach during 
 the intermediate merges. Hence, getting much better performance.
 # Satisfy MAPREDUCE-3060 - generic shuffle service for avoiding hidden 
 dependency of NodeManager with a specific version of mapreduce shuffle 
 (currently targeted to 0.24.0).
 References:
 # Hadoop Acceleration through Network Levitated Merging, by Prof. Weikuan Yu 
 from Auburn University with others, 
 [http://pasl.eng.auburn.edu/pubs/sc11-netlev.pdf]
 # I am attaching 2 documents with suggested Top Level Design for both plugins 
 (currently, based on 1.0 branch)
 # I am providing link for downloading UDA - Mellanox's open source plugin 
 that implements generic shuffle service using RDMA and levitated merge.  
 Note: At this phase, the code is in C++ through JNI and you should consider 
 it as beta only.  Still, it can serve anyone that wants to implement or 
 contribute to levitated merge. (Please be advised that levitated merge is 
 mostly suit in very fast networks) - 
 [http://www.mellanox.com/content/pages.php?pg=products_dynproduct_family=144menu_section=69]

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Issue Comment Deleted] (MAPREDUCE-4049) plugin for generic shuffle service

2013-01-03 Thread Harsh J (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-4049?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Harsh J updated MAPREDUCE-4049:
---

Comment: was deleted

(was: I’ll be on vacation between Jan 6 to 13 (returning on Monday the 14th)
Redirecting issues:
· VMA - Olga Shern ol...@mellanox.commailto:ol...@mellanox.com
· UDA - Avner Ben Hanoch 
avn...@mellanox.commailto:avn...@mellanox.com

Regards,

Alex Rosenbaum
Director RD Application Acceleration
Mellanox Technologies
13 Zarhin st, Raanana, Israel
+972 (74) 712-9215

Follow us on Twitterhttp://twitter.com/mellanoxtech and 
Facebookhttp://www.facebook.com/pages/Mellanox-Technologies/223164879116
)

 plugin for generic shuffle service
 --

 Key: MAPREDUCE-4049
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4049
 Project: Hadoop Map/Reduce
  Issue Type: Sub-task
  Components: performance, task, tasktracker
Affects Versions: 1.0.3, 1.1.0, 2.0.0-alpha, 3.0.0
Reporter: Avner BenHanoch
Assignee: Avner BenHanoch
  Labels: merge, plugin, rdma, shuffle
 Fix For: 3.0.0

 Attachments: HADOOP-1.x.y.patch, Hadoop Shuffle Plugin Design.rtf, 
 mapreduce-4049.patch


 Support generic shuffle service as set of two plugins: ShuffleProvider  
 ShuffleConsumer.
 This will satisfy the following needs:
 # Better shuffle and merge performance. For example: we are working on 
 shuffle plugin that performs shuffle over RDMA in fast networks (10gE, 40gE, 
 or Infiniband) instead of using the current HTTP shuffle. Based on the fast 
 RDMA shuffle, the plugin can also utilize a suitable merge approach during 
 the intermediate merges. Hence, getting much better performance.
 # Satisfy MAPREDUCE-3060 - generic shuffle service for avoiding hidden 
 dependency of NodeManager with a specific version of mapreduce shuffle 
 (currently targeted to 0.24.0).
 References:
 # Hadoop Acceleration through Network Levitated Merging, by Prof. Weikuan Yu 
 from Auburn University with others, 
 [http://pasl.eng.auburn.edu/pubs/sc11-netlev.pdf]
 # I am attaching 2 documents with suggested Top Level Design for both plugins 
 (currently, based on 1.0 branch)
 # I am providing link for downloading UDA - Mellanox's open source plugin 
 that implements generic shuffle service using RDMA and levitated merge.  
 Note: At this phase, the code is in C++ through JNI and you should consider 
 it as beta only.  Still, it can serve anyone that wants to implement or 
 contribute to levitated merge. (Please be advised that levitated merge is 
 mostly suit in very fast networks) - 
 [http://www.mellanox.com/content/pages.php?pg=products_dynproduct_family=144menu_section=69]

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira