[jira] [Commented] (MAPREDUCE-4819) AM can rerun job after reporting final job status to the client
[ https://issues.apache.org/jira/browse/MAPREDUCE-4819?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13542809#comment-13542809 ] Alejandro Abdelnur commented on MAPREDUCE-4819: --- bq. For Job End notification. This is hitting a URL to indicate that the job has finished and if it has finished successfully or in error. I do need to do some integration tests with Oozie to validate that it can handle being informed more then once without having any real problems. Oozie handles duplicate notifications correctly doing a NOP. AM can rerun job after reporting final job status to the client --- Key: MAPREDUCE-4819 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4819 Project: Hadoop Map/Reduce Issue Type: Bug Components: mr-am Affects Versions: 0.23.3, 2.0.1-alpha Reporter: Jason Lowe Assignee: Bikas Saha Priority: Critical Attachments: MAPREDUCE-4819.1.patch, MAPREDUCE-4819.2.patch, MAPREDUCE-4819.3.patch, MR-4819-bobby-trunk.txt, MR-4819-bobby-trunk.txt If the AM reports final job status to the client but then crashes before unregistering with the RM then the RM can run another AM attempt. Currently AM re-attempts assume that the previous attempts did not reach a final job state, and that causes the job to rerun (from scratch, if the output format doesn't support recovery). Re-running the job when we've already told the client the final status of the job is bad for a number of reasons. If the job failed, it's confusing at best since the client was already told the job failed but the subsequent attempt could succeed. If the job succeeded there could be data loss, as a subsequent job launched by the client tries to consume the job's output as input just as the re-attempt starts removing output files in preparation for the output commit. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-2217) The expire launching task should cover the UNASSIGNED task
[ https://issues.apache.org/jira/browse/MAPREDUCE-2217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13542810#comment-13542810 ] Alejandro Abdelnur commented on MAPREDUCE-2217: --- +1. Nice job forcing the problem to verify the fix. The expire launching task should cover the UNASSIGNED task -- Key: MAPREDUCE-2217 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2217 Project: Hadoop Map/Reduce Issue Type: Improvement Components: jobtracker Affects Versions: 0.23.0, 1.1.1 Reporter: Scott Chen Assignee: Karthik Kambatla Attachments: expose-bug-mr-2217.patch, MAPREDUCE-2217.1.txt, MR-2217.patch, MR-2217.patch The ExpireLaunchingTask thread kills the task that are scheduled but not responded. Currently if a task is scheduled on tasktracker and for some reason tasktracker cannot put it to RUNNING. The task will just hang in the UNASSIGNED status and JobTracker will keep waiting for it. JobTracker.ExpireLaunchingTask should be able to kill this task. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-4819) AM can rerun job after reporting final job status to the client
[ https://issues.apache.org/jira/browse/MAPREDUCE-4819?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13542814#comment-13542814 ] Siddharth Seth commented on MAPREDUCE-4819: --- Bobby, Jason, Along with trying to ensure that a commit does not happen twice, I think there is value in committing the job history file before changing job status to SUCCESS - primarily for the RPC to behave consistently. It can otherwise see temporary final states, if the AM crashes during the history file persist, and won't be able to retrieve counters or other job status till the next AM attempt. This does have the drawback of a small performance hit though - and also makes job history a critical part of a job. Using separate files for marking success / failure - am guessing this is to have a smaller change of a failing persist, as compared to persisting events via the HistoryFile, which may already have a backlog of events ? Wondering if it's possible to achieve the same checks via the CommitterEventHandler instead of checking in the MRAppMaster class. i.e follow the regular recovery path - except the CommitHandler emits success / failed / abort events depending on the presence of these files / (history events). Alternately, the current implementation could be simplified by using a custom RMCommunicator - which does not depend on JobImpl. i.e. the history copier and an RMCommunicator to unregister from the RM. Comments on the current patch - If the last AM attempt were to crash - data exists since the _SUCCESS_ file exists, RPC will not see SUCCESS. - While the new AM is running - it will not be able to handle status, counter etc requests. This seems a little problematic if a success has been reported over RPC from the previous AM. Since this AM is dealing with the history file - could possibly have it return information from the history file ? History commit before SUCCESS may help with the previous 2 points. - If the recovered AppMaster is not the last retry - looks like the RM unregistration will not happen. (isLastAMRetry) - Is a KILLED status also required - KILLED during commit should not be reported as FAILED - The check for commitSuccess / commitFailure in the AM - the failure check can happen before the success check (low chance but a success file could be created followed by an RPC failure) - CommitEventHandler.touchz could throw an exception if the file already exists - to prevent lost AMs from committing. (maybe not required after MAPREDUCE-4832 ?) - historyService creation - can move into the common if (copyHistory) check - Don't think AMStartedEvent cannot be ignored - the history server will have no info about past AMs. I think only the current AM needs to be ignored. Wondering if it's possible to use HDFS dirs and timestamps to co-ordinate between an active AM and lost AMs. Also, are hdfs dir operations cheaper than file create operations (NN only / NN +DN) ? Nor sure if mkdir / 0 length file creation are NN only ops. AM can rerun job after reporting final job status to the client --- Key: MAPREDUCE-4819 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4819 Project: Hadoop Map/Reduce Issue Type: Bug Components: mr-am Affects Versions: 0.23.3, 2.0.1-alpha Reporter: Jason Lowe Assignee: Bikas Saha Priority: Critical Attachments: MAPREDUCE-4819.1.patch, MAPREDUCE-4819.2.patch, MAPREDUCE-4819.3.patch, MR-4819-bobby-trunk.txt, MR-4819-bobby-trunk.txt If the AM reports final job status to the client but then crashes before unregistering with the RM then the RM can run another AM attempt. Currently AM re-attempts assume that the previous attempts did not reach a final job state, and that causes the job to rerun (from scratch, if the output format doesn't support recovery). Re-running the job when we've already told the client the final status of the job is bad for a number of reasons. If the job failed, it's confusing at best since the client was already told the job failed but the subsequent attempt could succeed. If the job succeeded there could be data loss, as a subsequent job launched by the client tries to consume the job's output as input just as the re-attempt starts removing output files in preparation for the output commit. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-4904) TestMultipleLevelCaching failed in barnch-1
[ https://issues.apache.org/jira/browse/MAPREDUCE-4904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13542827#comment-13542827 ] Luke Lu commented on MAPREDUCE-4904: I meant that you should add a comment (e.g. // fall through. see MAPREDUCE-4904) to the patch :) Otherwise, the switch code would look a little strange to later maintainers and cause some unnecessary head-scratching. TestMultipleLevelCaching failed in barnch-1 --- Key: MAPREDUCE-4904 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4904 Project: Hadoop Map/Reduce Issue Type: Bug Components: test Affects Versions: 1.2.0 Reporter: meng gong Assignee: meng gong Fix For: 1.2.0 Attachments: MAPREDUCE-4904.patch TestMultipleLevelCaching will failed: {noformat} Testcase: testMultiLevelCaching took 30.406 sec FAILED Number of local maps expected:0 but was:1 junit.framework.AssertionFailedError: Number of local maps expected:0 but was:1 at org.apache.hadoop.mapred.TestRackAwareTaskPlacement.launchJobAndTestCounters(TestRackAwareTaskPlacement.java:78) at org.apache.hadoop.mapred.TestMultipleLevelCaching.testCachingAtLevel(TestMultipleLevelCaching.java:113) at org.apache.hadoop.mapred.TestMultipleLevelCaching.testMultiLevelCaching(TestMultipleLevelCaching.java:69) {noformat} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-4279) getClusterStatus() fails with null pointer exception when running jobs in local mode
[ https://issues.apache.org/jira/browse/MAPREDUCE-4279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13542849#comment-13542849 ] Jarek Jarcec Cecho commented on MAPREDUCE-4279: --- I've recently detected this issue as well. Would it be possible to fix it? getClusterStatus() fails with null pointer exception when running jobs in local mode Key: MAPREDUCE-4279 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4279 Project: Hadoop Map/Reduce Issue Type: Bug Components: jobtracker Affects Versions: 0.23.1, 2.0.0-alpha, 3.0.0 Reporter: Rahul Jain Assignee: Devaraj K Attachments: MAPREDUCE-4279.patch While migrating code from 0.20.2 hadoop codebase to 0.23.1 we encountered this issue for jobs run in local mode of execution: {code} java.lang.NullPointerException at org.apache.hadoop.mapred.JobClient.arrayToStringList(JobClient.java:783) at org.apache.hadoop.mapred.JobClient.access$600(JobClient.java:138) at org.apache.hadoop.mapred.JobClient$4.run(JobClient.java:815) at org.apache.hadoop.mapred.JobClient$4.run(JobClient.java:812) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1177) at org.apache.hadoop.mapred.JobClient.getClusterStatus(JobClient.java:812) {code} We are using cloudera distribution CDH4b2 for testing, however the underlying code is 0.23.1 and I could see no difference in this implementation. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-4884) streaming tests fail to start MiniMRCluster due to Queue configuration missing child queue names for root
[ https://issues.apache.org/jira/browse/MAPREDUCE-4884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13542848#comment-13542848 ] Hudson commented on MAPREDUCE-4884: --- Integrated in Hadoop-Yarn-trunk #85 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/85/]) MAPREDUCE-4884. Streaming tests fail to start MiniMRCluster due to missing queue configuration. Contributed by Chris Nauroth. (Revision 1427945) Result = SUCCESS suresh : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1427945 Files : * /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt * /hadoop/common/trunk/hadoop-tools/hadoop-streaming/pom.xml streaming tests fail to start MiniMRCluster due to Queue configuration missing child queue names for root --- Key: MAPREDUCE-4884 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4884 Project: Hadoop Map/Reduce Issue Type: Bug Components: contrib/streaming, test Affects Versions: 3.0.0, trunk-win Reporter: Chris Nauroth Assignee: Chris Nauroth Fix For: 3.0.0 Attachments: MAPREDUCE-4884.1.patch Multiple tests in hadoop-streaming, such as {{TestFileArgs}}, fail to initialize {{MiniMRCluster}} due to a {{YarnException}} with reason Queue configuration missing child queue names for root. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-4884) streaming tests fail to start MiniMRCluster due to Queue configuration missing child queue names for root
[ https://issues.apache.org/jira/browse/MAPREDUCE-4884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13542870#comment-13542870 ] Hudson commented on MAPREDUCE-4884: --- Integrated in Hadoop-Hdfs-trunk #1274 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/1274/]) MAPREDUCE-4884. Streaming tests fail to start MiniMRCluster due to missing queue configuration. Contributed by Chris Nauroth. (Revision 1427945) Result = FAILURE suresh : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1427945 Files : * /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt * /hadoop/common/trunk/hadoop-tools/hadoop-streaming/pom.xml streaming tests fail to start MiniMRCluster due to Queue configuration missing child queue names for root --- Key: MAPREDUCE-4884 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4884 Project: Hadoop Map/Reduce Issue Type: Bug Components: contrib/streaming, test Affects Versions: 3.0.0, trunk-win Reporter: Chris Nauroth Assignee: Chris Nauroth Fix For: 3.0.0 Attachments: MAPREDUCE-4884.1.patch Multiple tests in hadoop-streaming, such as {{TestFileArgs}}, fail to initialize {{MiniMRCluster}} due to a {{YarnException}} with reason Queue configuration missing child queue names for root. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (MAPREDUCE-2217) The expire launching task should cover the UNASSIGNED task
[ https://issues.apache.org/jira/browse/MAPREDUCE-2217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alejandro Abdelnur updated MAPREDUCE-2217: -- Issue Type: Bug (was: Improvement) The expire launching task should cover the UNASSIGNED task -- Key: MAPREDUCE-2217 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2217 Project: Hadoop Map/Reduce Issue Type: Bug Components: jobtracker Affects Versions: 0.23.0, 1.1.1 Reporter: Scott Chen Assignee: Karthik Kambatla Attachments: expose-bug-mr-2217.patch, MAPREDUCE-2217.1.txt, MR-2217.patch, MR-2217.patch The ExpireLaunchingTask thread kills the task that are scheduled but not responded. Currently if a task is scheduled on tasktracker and for some reason tasktracker cannot put it to RUNNING. The task will just hang in the UNASSIGNED status and JobTracker will keep waiting for it. JobTracker.ExpireLaunchingTask should be able to kill this task. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (MAPREDUCE-2217) The expire launching task should cover the UNASSIGNED task
[ https://issues.apache.org/jira/browse/MAPREDUCE-2217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alejandro Abdelnur updated MAPREDUCE-2217: -- Resolution: Fixed Fix Version/s: 1.2.0 Hadoop Flags: Reviewed Status: Resolved (was: Patch Available) Thanks Scott Karthik. Committed to branch-1. The expire launching task should cover the UNASSIGNED task -- Key: MAPREDUCE-2217 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2217 Project: Hadoop Map/Reduce Issue Type: Bug Components: jobtracker Affects Versions: 0.23.0, 1.1.1 Reporter: Scott Chen Assignee: Karthik Kambatla Fix For: 1.2.0 Attachments: expose-bug-mr-2217.patch, MAPREDUCE-2217.1.txt, MR-2217.patch, MR-2217.patch The ExpireLaunchingTask thread kills the task that are scheduled but not responded. Currently if a task is scheduled on tasktracker and for some reason tasktracker cannot put it to RUNNING. The task will just hang in the UNASSIGNED status and JobTracker will keep waiting for it. JobTracker.ExpireLaunchingTask should be able to kill this task. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-4884) streaming tests fail to start MiniMRCluster due to Queue configuration missing child queue names for root
[ https://issues.apache.org/jira/browse/MAPREDUCE-4884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13542938#comment-13542938 ] Hudson commented on MAPREDUCE-4884: --- Integrated in Hadoop-Mapreduce-trunk #1304 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1304/]) MAPREDUCE-4884. Streaming tests fail to start MiniMRCluster due to missing queue configuration. Contributed by Chris Nauroth. (Revision 1427945) Result = FAILURE suresh : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1427945 Files : * /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt * /hadoop/common/trunk/hadoop-tools/hadoop-streaming/pom.xml streaming tests fail to start MiniMRCluster due to Queue configuration missing child queue names for root --- Key: MAPREDUCE-4884 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4884 Project: Hadoop Map/Reduce Issue Type: Bug Components: contrib/streaming, test Affects Versions: 3.0.0, trunk-win Reporter: Chris Nauroth Assignee: Chris Nauroth Fix For: 3.0.0 Attachments: MAPREDUCE-4884.1.patch Multiple tests in hadoop-streaming, such as {{TestFileArgs}}, fail to initialize {{MiniMRCluster}} due to a {{YarnException}} with reason Queue configuration missing child queue names for root. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-4049) plugin for generic shuffle service
[ https://issues.apache.org/jira/browse/MAPREDUCE-4049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13542978#comment-13542978 ] Alejandro Abdelnur commented on MAPREDUCE-4049: --- On #1, IMO APPLICATION_INIT should be sent to all auxiliary services (and APPLICATION_STOP) On #2, is there a use case to load both instead just the configured one? plugin for generic shuffle service -- Key: MAPREDUCE-4049 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4049 Project: Hadoop Map/Reduce Issue Type: Sub-task Components: performance, task, tasktracker Affects Versions: 1.0.3, 1.1.0, 2.0.0-alpha, 3.0.0 Reporter: Avner BenHanoch Assignee: Avner BenHanoch Labels: merge, plugin, rdma, shuffle Fix For: 3.0.0 Attachments: HADOOP-1.x.y.patch, Hadoop Shuffle Plugin Design.rtf, mapreduce-4049.patch Support generic shuffle service as set of two plugins: ShuffleProvider ShuffleConsumer. This will satisfy the following needs: # Better shuffle and merge performance. For example: we are working on shuffle plugin that performs shuffle over RDMA in fast networks (10gE, 40gE, or Infiniband) instead of using the current HTTP shuffle. Based on the fast RDMA shuffle, the plugin can also utilize a suitable merge approach during the intermediate merges. Hence, getting much better performance. # Satisfy MAPREDUCE-3060 - generic shuffle service for avoiding hidden dependency of NodeManager with a specific version of mapreduce shuffle (currently targeted to 0.24.0). References: # Hadoop Acceleration through Network Levitated Merging, by Prof. Weikuan Yu from Auburn University with others, [http://pasl.eng.auburn.edu/pubs/sc11-netlev.pdf] # I am attaching 2 documents with suggested Top Level Design for both plugins (currently, based on 1.0 branch) # I am providing link for downloading UDA - Mellanox's open source plugin that implements generic shuffle service using RDMA and levitated merge. Note: At this phase, the code is in C++ through JNI and you should consider it as beta only. Still, it can serve anyone that wants to implement or contribute to levitated merge. (Please be advised that levitated merge is mostly suit in very fast networks) - [http://www.mellanox.com/content/pages.php?pg=products_dynproduct_family=144menu_section=69] -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-4819) AM can rerun job after reporting final job status to the client
[ https://issues.apache.org/jira/browse/MAPREDUCE-4819?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13543034#comment-13543034 ] Robert Joseph Evans commented on MAPREDUCE-4819: Wow lots of comments. Thanks for everyone looking at the patch. bq. I had observed that if I made my AM crash (by putting an exit(1) in shutdownJob() then the history files would get orphaned and not cleaned up. Or something like that. Thanks for the heads up. I will look into that. bq. Why not end in success if the staging dir was cleaned up by the last attempt? Because we crashed somewhere after staging was cleaned up and before we unregistered. Crashing seems like an error to me, but I suppose we could change it. As for what the client ultimately sees for success or failure, we will rely on the history server to report that. bq. I am guessing that this code wont be necessary after we move the unregister to RM before the staging dir cleanup in MAPREDUCE-4841, right? Yes and No. Once MAPREDUCE-4841 goes in there is an increased possibility of leaking staging directories. I have seen users in 1.0 blow away their staging directory to clean up, and caused jobs to fail. Granted they are more likely to get errors from the distributed cache not finding the files it needs, but in either case I would like to be paranoid and guard against that. bq. Why are we only eating/ignoring the JobEvents in the dispatcher? So that the JobImpl state machine is not triggered? In the new code path we have not wired up everything. JobImpl is created but the JobEventDispatcher is not. I did not want to have to deal with recovering the complete state of the job. Which in some cases may not even be possible. This is also why I am not brining up the RPC server. Which now that you mention it I probably also need to update the UI/client to deal with that appropriately. The typo you found was just there for debugging this situation. (I'll fix the typo by the way) bq. This might be a question of personal preference. I think an explicit transition to from the INIT to final state is cleaner than overriding the state in the getter. I actually wanted to put in a stubbed out Job instead, but there are too many places that Job is cast to JobImpl just to get the state making it difficult to do so. I will look again to see if I can split the two apart, or add in a state transition. bq. Oozie handles duplicate notifications correctly doing a NOP. Great. I will look at the javadocs for job end notification again to be sure that we can default to notify instead. bq. Using separate files for marking success / failure - am guessing this is to have a smaller change of a failing persist, as compared to persisting events via the HistoryFile, which may already have a backlog of events? It was also a much smaller change to make. The HistoryFile would be preferable if we wanted to guarantee at most once commit of the tasks, because there are so many of them. bq. Wondering if it's possible to achieve the same checks via the CommitterEventHandler instead of checking in the MRAppMaster class. i.e follow the regular recovery path - except the CommitHandler emits success / failed / abort events depending on the presence of these files / (history events). bq. Alternately, the current implementation could be simplified by using a custom RMCommunicator - which does not depend on JobImpl. i.e. the history copier and an RMCommunicator to unregister from the RM. Both of those seem like valid things to investigate. I feel like I am close on this and want to get this working as is first and then I will look at the other approaches you suggested. I do like the first one as it seems like it would be a lot simpler to implement, but I want a backup that I know functions before making drastic changes to the design. bq. If the last AM attempt were to crash - data exists since the SUCCESS file exists, RPC will not see SUCCESS. We have a lot of problems in general if the last AM were to crash. It is possible that the history server would have no knowledge of the job what so ever even if it finished successfully. This patch is not attempting to address those problems. bq. While the new AM is running - it will not be able to handle status, counter etc requests. This seems a little problematic if a success has been reported over RPC from the previous AM. Since this AM is dealing with the history file - could possibly have it return information from the history file ? History commit before SUCCESS may help with the previous 2 points. Yes History commit before returning success would help with those problems. I will look into it as an alternative approach. my initial thought was to update the client/UI to wait for the AM to report a valid address so that no client is trying to get counters etc from an AM in this situation.
[jira] [Commented] (MAPREDUCE-4049) plugin for generic shuffle service
[ https://issues.apache.org/jira/browse/MAPREDUCE-4049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13543056#comment-13543056 ] Avner BenHanoch commented on MAPREDUCE-4049: Hi Alejandro, On #1 - Thanks! On #2 - YES: 1. Since, ShuffleProvider is configured for the lifetime of TT; while, ShuffleConsumer is configured per job. We don't want to restart MapReduce/TaskTrackers any time we want to use different shuffle. 2. In addition, I expect that for 1 job there will be used just 1 type of shuffle. *Still, TT supports multiple jobs of multiple users with different shufflemerge needs in parallel*. Hence, multiple shuffle consumers may run in parallel (in the multiple jobs) = they will request data from multiple providers. = *TT needs multiple providers in parallel* (You can consider multiple ShufleProviders in MRv1 as equivalent to multiple AuxiliaryServices that are allowed in MRv2). 3. It could be that a ShuffleConsumerX will be ideal for jobs of one type, while ShuffleConsumerY will be ideal for jobs of other type (for example Grep vs. TeraSort). Hence, multiple Shuffle-Consumer plugins may run in parallel in multiple jobs. Each of the consumers will contact its desired shuffle provider. Hence, all providers should be available in parallel (also, one shuffle service can be sensitive to type of network problems that doesn't disturb other shuffle services, hence, it should be possible to fallback to another shuffle on the fly). on the design: 1. It is clear that a ShuffleProvider is a daemon like TT, while ShuffleConsumer is a client that lives in the context of RT 2. It is clear that multiple providers can run in parallel and each is able to serve shuffle request it gets. 3. A shuffle consumer instance will only contact one of the shuffle providers and will request its desired files only from from this provider. 4. multiple consumers in multiple jobs may contact different providers 5. *A shuffle provider that doesn't serve a request doesn't consume resources for it.* regards, Avner plugin for generic shuffle service -- Key: MAPREDUCE-4049 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4049 Project: Hadoop Map/Reduce Issue Type: Sub-task Components: performance, task, tasktracker Affects Versions: 1.0.3, 1.1.0, 2.0.0-alpha, 3.0.0 Reporter: Avner BenHanoch Assignee: Avner BenHanoch Labels: merge, plugin, rdma, shuffle Fix For: 3.0.0 Attachments: HADOOP-1.x.y.patch, Hadoop Shuffle Plugin Design.rtf, mapreduce-4049.patch Support generic shuffle service as set of two plugins: ShuffleProvider ShuffleConsumer. This will satisfy the following needs: # Better shuffle and merge performance. For example: we are working on shuffle plugin that performs shuffle over RDMA in fast networks (10gE, 40gE, or Infiniband) instead of using the current HTTP shuffle. Based on the fast RDMA shuffle, the plugin can also utilize a suitable merge approach during the intermediate merges. Hence, getting much better performance. # Satisfy MAPREDUCE-3060 - generic shuffle service for avoiding hidden dependency of NodeManager with a specific version of mapreduce shuffle (currently targeted to 0.24.0). References: # Hadoop Acceleration through Network Levitated Merging, by Prof. Weikuan Yu from Auburn University with others, [http://pasl.eng.auburn.edu/pubs/sc11-netlev.pdf] # I am attaching 2 documents with suggested Top Level Design for both plugins (currently, based on 1.0 branch) # I am providing link for downloading UDA - Mellanox's open source plugin that implements generic shuffle service using RDMA and levitated merge. Note: At this phase, the code is in C++ through JNI and you should consider it as beta only. Still, it can serve anyone that wants to implement or contribute to levitated merge. (Please be advised that levitated merge is mostly suit in very fast networks) - [http://www.mellanox.com/content/pages.php?pg=products_dynproduct_family=144menu_section=69] -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-4279) getClusterStatus() fails with null pointer exception when running jobs in local mode
[ https://issues.apache.org/jira/browse/MAPREDUCE-4279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13543076#comment-13543076 ] Robert Joseph Evans commented on MAPREDUCE-4279: The change looks fine to me to +1. I'll check it in. getClusterStatus() fails with null pointer exception when running jobs in local mode Key: MAPREDUCE-4279 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4279 Project: Hadoop Map/Reduce Issue Type: Bug Components: jobtracker Affects Versions: 0.23.1, 2.0.0-alpha, 3.0.0 Reporter: Rahul Jain Assignee: Devaraj K Attachments: MAPREDUCE-4279.patch While migrating code from 0.20.2 hadoop codebase to 0.23.1 we encountered this issue for jobs run in local mode of execution: {code} java.lang.NullPointerException at org.apache.hadoop.mapred.JobClient.arrayToStringList(JobClient.java:783) at org.apache.hadoop.mapred.JobClient.access$600(JobClient.java:138) at org.apache.hadoop.mapred.JobClient$4.run(JobClient.java:815) at org.apache.hadoop.mapred.JobClient$4.run(JobClient.java:812) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1177) at org.apache.hadoop.mapred.JobClient.getClusterStatus(JobClient.java:812) {code} We are using cloudera distribution CDH4b2 for testing, however the underlying code is 0.23.1 and I could see no difference in this implementation. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-4279) getClusterStatus() fails with null pointer exception when running jobs in local mode
[ https://issues.apache.org/jira/browse/MAPREDUCE-4279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13543081#comment-13543081 ] Hudson commented on MAPREDUCE-4279: --- Integrated in Hadoop-trunk-Commit #3170 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/3170/]) MAPREDUCE-4279. getClusterStatus() fails with null pointer exception when running jobs in local mode (Devaraj K via bobby) (Revision 1428482) Result = FAILURE bobby : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1428482 Files : * /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt * /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-common/src/main/java/org/apache/hadoop/mapred/LocalJobRunner.java * /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-common/src/test/java/org/apache/hadoop/mapred/TestJobClient.java getClusterStatus() fails with null pointer exception when running jobs in local mode Key: MAPREDUCE-4279 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4279 Project: Hadoop Map/Reduce Issue Type: Bug Components: jobtracker Affects Versions: 0.23.1, 2.0.0-alpha, 3.0.0 Reporter: Rahul Jain Assignee: Devaraj K Attachments: MAPREDUCE-4279.patch While migrating code from 0.20.2 hadoop codebase to 0.23.1 we encountered this issue for jobs run in local mode of execution: {code} java.lang.NullPointerException at org.apache.hadoop.mapred.JobClient.arrayToStringList(JobClient.java:783) at org.apache.hadoop.mapred.JobClient.access$600(JobClient.java:138) at org.apache.hadoop.mapred.JobClient$4.run(JobClient.java:815) at org.apache.hadoop.mapred.JobClient$4.run(JobClient.java:812) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1177) at org.apache.hadoop.mapred.JobClient.getClusterStatus(JobClient.java:812) {code} We are using cloudera distribution CDH4b2 for testing, however the underlying code is 0.23.1 and I could see no difference in this implementation. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-4279) getClusterStatus() fails with null pointer exception when running jobs in local mode
[ https://issues.apache.org/jira/browse/MAPREDUCE-4279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13543083#comment-13543083 ] Jarek Jarcec Cecho commented on MAPREDUCE-4279: --- Awesome, thank you Robert! getClusterStatus() fails with null pointer exception when running jobs in local mode Key: MAPREDUCE-4279 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4279 Project: Hadoop Map/Reduce Issue Type: Bug Components: jobtracker Affects Versions: 0.23.1, 2.0.0-alpha, 3.0.0 Reporter: Rahul Jain Assignee: Devaraj K Attachments: MAPREDUCE-4279.patch While migrating code from 0.20.2 hadoop codebase to 0.23.1 we encountered this issue for jobs run in local mode of execution: {code} java.lang.NullPointerException at org.apache.hadoop.mapred.JobClient.arrayToStringList(JobClient.java:783) at org.apache.hadoop.mapred.JobClient.access$600(JobClient.java:138) at org.apache.hadoop.mapred.JobClient$4.run(JobClient.java:815) at org.apache.hadoop.mapred.JobClient$4.run(JobClient.java:812) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1177) at org.apache.hadoop.mapred.JobClient.getClusterStatus(JobClient.java:812) {code} We are using cloudera distribution CDH4b2 for testing, however the underlying code is 0.23.1 and I could see no difference in this implementation. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (MAPREDUCE-4279) getClusterStatus() fails with null pointer exception when running jobs in local mode
[ https://issues.apache.org/jira/browse/MAPREDUCE-4279?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Joseph Evans updated MAPREDUCE-4279: --- Resolution: Fixed Fix Version/s: 0.23.6 2.0.3-alpha 3.0.0 Status: Resolved (was: Patch Available) Thanks Devaraj and Rahul, I put this into trunk, branch-2, and branch-0.23 getClusterStatus() fails with null pointer exception when running jobs in local mode Key: MAPREDUCE-4279 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4279 Project: Hadoop Map/Reduce Issue Type: Bug Components: jobtracker Affects Versions: 0.23.1, 2.0.0-alpha, 3.0.0 Reporter: Rahul Jain Assignee: Devaraj K Fix For: 3.0.0, 2.0.3-alpha, 0.23.6 Attachments: MAPREDUCE-4279.patch While migrating code from 0.20.2 hadoop codebase to 0.23.1 we encountered this issue for jobs run in local mode of execution: {code} java.lang.NullPointerException at org.apache.hadoop.mapred.JobClient.arrayToStringList(JobClient.java:783) at org.apache.hadoop.mapred.JobClient.access$600(JobClient.java:138) at org.apache.hadoop.mapred.JobClient$4.run(JobClient.java:815) at org.apache.hadoop.mapred.JobClient$4.run(JobClient.java:812) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1177) at org.apache.hadoop.mapred.JobClient.getClusterStatus(JobClient.java:812) {code} We are using cloudera distribution CDH4b2 for testing, however the underlying code is 0.23.1 and I could see no difference in this implementation. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (MAPREDUCE-4458) Warn if java.library.path is used for AM or Task
[ https://issues.apache.org/jira/browse/MAPREDUCE-4458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Parker updated MAPREDUCE-4458: - Attachment: MAPREDUCE-4458-2.patch Added static function, removed tab indentations. Modified the test file but did not implement a test because the createApplicationSubmissionContext function is mocked in the @Before function. Warn if java.library.path is used for AM or Task Key: MAPREDUCE-4458 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4458 Project: Hadoop Map/Reduce Issue Type: Improvement Components: mrv2 Affects Versions: 0.23.3, 3.0.0, 2.0.2-alpha Reporter: Robert Joseph Evans Assignee: Robert Parker Attachments: MAPREDUCE-4458-2.patch, MAPREDUCE-4458.patch If java.library.path is used on the command line for launching an MRAppMaster or an MR Task, it could conflict with how standard Hadoop/HDFS JNI libraries and dependencies are found. At a minimum the client should output a warning and ask the user to switch to LD_LIBRARY_PATH. It would be nice to automatically do this for them but parsing the command line is scary so just a warning is probably good enough for now. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (MAPREDUCE-4458) Warn if java.library.path is used for AM or Task
[ https://issues.apache.org/jira/browse/MAPREDUCE-4458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Parker updated MAPREDUCE-4458: - Target Version/s: (was: 0.23.3) Status: Patch Available (was: Open) Warn if java.library.path is used for AM or Task Key: MAPREDUCE-4458 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4458 Project: Hadoop Map/Reduce Issue Type: Improvement Components: mrv2 Affects Versions: 2.0.2-alpha, 0.23.3, 3.0.0 Reporter: Robert Joseph Evans Assignee: Robert Parker Attachments: MAPREDUCE-4458-2.patch, MAPREDUCE-4458.patch If java.library.path is used on the command line for launching an MRAppMaster or an MR Task, it could conflict with how standard Hadoop/HDFS JNI libraries and dependencies are found. At a minimum the client should output a warning and ask the user to switch to LD_LIBRARY_PATH. It would be nice to automatically do this for them but parsing the command line is scary so just a warning is probably good enough for now. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Resolved] (MAPREDUCE-4655) MergeManager.reserve can OutOfMemoryError if more than 10% of max memory is used on non-MapOutputs
[ https://issues.apache.org/jira/browse/MAPREDUCE-4655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandy Ryza resolved MAPREDUCE-4655. --- Resolution: Invalid MergeManager.reserve can OutOfMemoryError if more than 10% of max memory is used on non-MapOutputs -- Key: MAPREDUCE-4655 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4655 Project: Hadoop Map/Reduce Issue Type: Bug Components: nodemanager Affects Versions: 2.0.1-alpha Reporter: Sandy Ryza The MergeManager does a memory check, using a limit that defaults to 90% of Runtime.getRuntime().maxMemory(). Allocations that would bring the total memory allocated by the MergeManager over this limit are asked to wait until memory frees up. Disk is used for single allocations that would be over 25% of the memory limit. If some other part of the reducer were to be using more than 10% of the memory. the current check wouldn't stop an OutOfMemoryError. Before creating an in-memory MapOutput, a check can be done using Runtime.getRuntime().freeMemory(), waiting until memory is freed up if it fails. 12/08/17 10:36:29 INFO mapreduce.Job: Task Id : attempt_1342723342632_0010_r_05_0, Status : FAILED Error: org.apache.hadoop.mapreduce.task.reduce.Shuffle$ShuffleError: error in shuffle in fetcher#6 at org.apache.hadoop.mapreduce.task.reduce.Shuffle.run(Shuffle.java:123) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:371) at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:152) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:416) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1232) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:147) Caused by: java.lang.OutOfMemoryError: Java heap space at org.apache.hadoop.io.BoundedByteArrayOutputStream.init(BoundedByteArrayOutputStream.java:58) at org.apache.hadoop.io.BoundedByteArrayOutputStream.init(BoundedByteArrayOutputStream.java:45) at org.apache.hadoop.mapreduce.task.reduce.MapOutput.init(MapOutput.java:97) at org.apache.hadoop.mapreduce.task.reduce.MergeManager.unconditionalReserve(MergeManager.java:286) at org.apache.hadoop.mapreduce.task.reduce.MergeManager.reserve(MergeManager.java:276) at org.apache.hadoop.mapreduce.task.reduce.Fetcher.copyMapOutput(Fetcher.java:327) at org.apache.hadoop.mapreduce.task.reduce.Fetcher.copyFromHost(Fetcher.java:273) at org.apache.hadoop.mapreduce.task.reduce.Fetcher.run(Fetcher.java:153) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-4049) plugin for generic shuffle service
[ https://issues.apache.org/jira/browse/MAPREDUCE-4049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13543235#comment-13543235 ] Alejandro Abdelnur commented on MAPREDUCE-4049: --- Got it, thxs for the detailed explanation. * Does this mean that providers must lazy initialize on the first request? * Are you planing to support loading N providers in your patch? plugin for generic shuffle service -- Key: MAPREDUCE-4049 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4049 Project: Hadoop Map/Reduce Issue Type: Sub-task Components: performance, task, tasktracker Affects Versions: 1.0.3, 1.1.0, 2.0.0-alpha, 3.0.0 Reporter: Avner BenHanoch Assignee: Avner BenHanoch Labels: merge, plugin, rdma, shuffle Fix For: 3.0.0 Attachments: HADOOP-1.x.y.patch, Hadoop Shuffle Plugin Design.rtf, mapreduce-4049.patch Support generic shuffle service as set of two plugins: ShuffleProvider ShuffleConsumer. This will satisfy the following needs: # Better shuffle and merge performance. For example: we are working on shuffle plugin that performs shuffle over RDMA in fast networks (10gE, 40gE, or Infiniband) instead of using the current HTTP shuffle. Based on the fast RDMA shuffle, the plugin can also utilize a suitable merge approach during the intermediate merges. Hence, getting much better performance. # Satisfy MAPREDUCE-3060 - generic shuffle service for avoiding hidden dependency of NodeManager with a specific version of mapreduce shuffle (currently targeted to 0.24.0). References: # Hadoop Acceleration through Network Levitated Merging, by Prof. Weikuan Yu from Auburn University with others, [http://pasl.eng.auburn.edu/pubs/sc11-netlev.pdf] # I am attaching 2 documents with suggested Top Level Design for both plugins (currently, based on 1.0 branch) # I am providing link for downloading UDA - Mellanox's open source plugin that implements generic shuffle service using RDMA and levitated merge. Note: At this phase, the code is in C++ through JNI and you should consider it as beta only. Still, it can serve anyone that wants to implement or contribute to levitated merge. (Please be advised that levitated merge is mostly suit in very fast networks) - [http://www.mellanox.com/content/pages.php?pg=products_dynproduct_family=144menu_section=69] -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-4049) plugin for generic shuffle service
[ https://issues.apache.org/jira/browse/MAPREDUCE-4049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13543254#comment-13543254 ] Avner BenHanoch commented on MAPREDUCE-4049: 1. I don't use the term must. Each provider can choose its desired optimization. If the performance of a provider worth for the user the user will use it. In general, I think that the major resouce that is used by providers is the cache of MOFs. Since this cache is filled upon serving requests than the price of unused provider that was loaded is cheap. Other than that, I think that providers mainly listen for incoming requests. 2. In my patch I plan to support just 1 provider (in addition to the built in MapOutputHttpServlet). This is enough for my use case. Support of N providers is legitimate idea. If it is needed by someone, I prefer that it will be handled outside my patch. plugin for generic shuffle service -- Key: MAPREDUCE-4049 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4049 Project: Hadoop Map/Reduce Issue Type: Sub-task Components: performance, task, tasktracker Affects Versions: 1.0.3, 1.1.0, 2.0.0-alpha, 3.0.0 Reporter: Avner BenHanoch Assignee: Avner BenHanoch Labels: merge, plugin, rdma, shuffle Fix For: 3.0.0 Attachments: HADOOP-1.x.y.patch, Hadoop Shuffle Plugin Design.rtf, mapreduce-4049.patch Support generic shuffle service as set of two plugins: ShuffleProvider ShuffleConsumer. This will satisfy the following needs: # Better shuffle and merge performance. For example: we are working on shuffle plugin that performs shuffle over RDMA in fast networks (10gE, 40gE, or Infiniband) instead of using the current HTTP shuffle. Based on the fast RDMA shuffle, the plugin can also utilize a suitable merge approach during the intermediate merges. Hence, getting much better performance. # Satisfy MAPREDUCE-3060 - generic shuffle service for avoiding hidden dependency of NodeManager with a specific version of mapreduce shuffle (currently targeted to 0.24.0). References: # Hadoop Acceleration through Network Levitated Merging, by Prof. Weikuan Yu from Auburn University with others, [http://pasl.eng.auburn.edu/pubs/sc11-netlev.pdf] # I am attaching 2 documents with suggested Top Level Design for both plugins (currently, based on 1.0 branch) # I am providing link for downloading UDA - Mellanox's open source plugin that implements generic shuffle service using RDMA and levitated merge. Note: At this phase, the code is in C++ through JNI and you should consider it as beta only. Still, it can serve anyone that wants to implement or contribute to levitated merge. (Please be advised that levitated merge is mostly suit in very fast networks) - [http://www.mellanox.com/content/pages.php?pg=products_dynproduct_family=144menu_section=69] -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (MAPREDUCE-4819) AM can rerun job after reporting final job status to the client
[ https://issues.apache.org/jira/browse/MAPREDUCE-4819?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Joseph Evans updated MAPREDUCE-4819: --- Attachment: MR-4819-bobby-trunk.txt This patch should be fully functional. I have included the work by Bikas to put the Job history file in a location that is deleted with the staging directory. I have fixed a few bugs in the original where we were not registering with the RM correctly. And also where the Web App Proxy would return a 500 error if hit when recovery was happening. I have manually tested this by having the AM exit/halt before, during, and after job commit. I tested it with the job commit failing and succeeding. Everything appears to be working as expected. I did not change JobImpl forcedState because adding in the transitions was more then I wanted to do right now. I am happy to file a follow up JIRA to make those changes if we want them. I have also not added in the kill state. Again it looked a bit tricky because of the multithreading and I would prefer to get something working in now and add that as part of a follow up JIRA. I talked with Kihwal Lee about the extra HDFS load for an empty file vs a directory and he said about the only extra load is the extra PRC call to close it, and because it is just two files per job I left it as is. If you feel strongly about it I can fix it on a separate JIRA. About the only thing that is left for this is integration with MAPREDUCE-4832. AM can rerun job after reporting final job status to the client --- Key: MAPREDUCE-4819 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4819 Project: Hadoop Map/Reduce Issue Type: Bug Components: mr-am Affects Versions: 0.23.3, 2.0.1-alpha Reporter: Jason Lowe Assignee: Bikas Saha Priority: Critical Attachments: MAPREDUCE-4819.1.patch, MAPREDUCE-4819.2.patch, MAPREDUCE-4819.3.patch, MR-4819-bobby-trunk.txt, MR-4819-bobby-trunk.txt, MR-4819-bobby-trunk.txt If the AM reports final job status to the client but then crashes before unregistering with the RM then the RM can run another AM attempt. Currently AM re-attempts assume that the previous attempts did not reach a final job state, and that causes the job to rerun (from scratch, if the output format doesn't support recovery). Re-running the job when we've already told the client the final status of the job is bad for a number of reasons. If the job failed, it's confusing at best since the client was already told the job failed but the subsequent attempt could succeed. If the job succeeded there could be data loss, as a subsequent job launched by the client tries to consume the job's output as input just as the re-attempt starts removing output files in preparation for the output commit. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-4819) AM can rerun job after reporting final job status to the client
[ https://issues.apache.org/jira/browse/MAPREDUCE-4819?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13543297#comment-13543297 ] Hadoop QA commented on MAPREDUCE-4819: -- {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12563151/MR-4819-bobby-trunk.txt against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 6 new or modified test files. {color:red}-1 javac{color}. The applied patch generated 2015 javac compiler warnings (more than the trunk's current 2014 warnings). {color:green}+1 javadoc{color}. The javadoc tool did not generate any warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:red}-1 findbugs{color}. The patch appears to introduce 1 new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-common hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-hs hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-web-proxy: org.apache.hadoop.mapreduce.v2.app.commit.TestCommitterEventHandler org.apache.hadoop.mapreduce.jobhistory.TestJobHistoryEventHandler {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3189//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3189//artifact/trunk/patchprocess/newPatchFindbugsWarningshadoop-mapreduce-client-app.html Javac warnings: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3189//artifact/trunk/patchprocess/diffJavacWarnings.txt Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3189//console This message is automatically generated. AM can rerun job after reporting final job status to the client --- Key: MAPREDUCE-4819 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4819 Project: Hadoop Map/Reduce Issue Type: Bug Components: mr-am Affects Versions: 0.23.3, 2.0.1-alpha Reporter: Jason Lowe Assignee: Bikas Saha Priority: Critical Attachments: MAPREDUCE-4819.1.patch, MAPREDUCE-4819.2.patch, MAPREDUCE-4819.3.patch, MR-4819-bobby-trunk.txt, MR-4819-bobby-trunk.txt, MR-4819-bobby-trunk.txt If the AM reports final job status to the client but then crashes before unregistering with the RM then the RM can run another AM attempt. Currently AM re-attempts assume that the previous attempts did not reach a final job state, and that causes the job to rerun (from scratch, if the output format doesn't support recovery). Re-running the job when we've already told the client the final status of the job is bad for a number of reasons. If the job failed, it's confusing at best since the client was already told the job failed but the subsequent attempt could succeed. If the job succeeded there could be data loss, as a subsequent job launched by the client tries to consume the job's output as input just as the re-attempt starts removing output files in preparation for the output commit. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (MAPREDUCE-4909) TestKeyValueTextInputFormat fails with Open JDK 7 on Windows
[ https://issues.apache.org/jira/browse/MAPREDUCE-4909?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Arpit Agarwal updated MAPREDUCE-4909: - Component/s: test TestKeyValueTextInputFormat fails with Open JDK 7 on Windows Key: MAPREDUCE-4909 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4909 Project: Hadoop Map/Reduce Issue Type: Bug Components: test Affects Versions: 1-win Reporter: Arpit Agarwal Assignee: Arpit Agarwal Attachments: MAPREDUCE-4909.patch, MAPREDUCE-4909.patch, MAPREDUCE-4909.patch, MAPREDUCE-4909.patch TestKeyValueTextInputFormat.testFormat fails with Open JDK 7. The root cause appears to be a failure to delete in-use files via LocalFileSystem.delete (RawLocalFileSystem.delete). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (MAPREDUCE-4909) TestKeyValueTextInputFormat fails with Open JDK 7 on Windows
[ https://issues.apache.org/jira/browse/MAPREDUCE-4909?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Arpit Agarwal updated MAPREDUCE-4909: - Attachment: MAPREDUCE-4909.patch Removed the comments altogether. TestKeyValueTextInputFormat fails with Open JDK 7 on Windows Key: MAPREDUCE-4909 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4909 Project: Hadoop Map/Reduce Issue Type: Bug Affects Versions: 1-win Reporter: Arpit Agarwal Assignee: Arpit Agarwal Attachments: MAPREDUCE-4909.patch, MAPREDUCE-4909.patch, MAPREDUCE-4909.patch, MAPREDUCE-4909.patch TestKeyValueTextInputFormat.testFormat fails with Open JDK 7. The root cause appears to be a failure to delete in-use files via LocalFileSystem.delete (RawLocalFileSystem.delete). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (MAPREDUCE-4909) TestKeyValueTextInputFormat fails with Open JDK 7 on Windows
[ https://issues.apache.org/jira/browse/MAPREDUCE-4909?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Arpit Agarwal updated MAPREDUCE-4909: - Affects Version/s: (was: 1-win) 1.2.0 TestKeyValueTextInputFormat fails with Open JDK 7 on Windows Key: MAPREDUCE-4909 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4909 Project: Hadoop Map/Reduce Issue Type: Bug Components: test Affects Versions: 1.2.0 Reporter: Arpit Agarwal Assignee: Arpit Agarwal Attachments: MAPREDUCE-4909.patch, MAPREDUCE-4909.patch, MAPREDUCE-4909.patch, MAPREDUCE-4909.patch TestKeyValueTextInputFormat.testFormat fails with Open JDK 7. The root cause appears to be a failure to delete in-use files via LocalFileSystem.delete (RawLocalFileSystem.delete). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (MAPREDUCE-4909) TestKeyValueTextInputFormat fails with Open JDK 7 on Windows
[ https://issues.apache.org/jira/browse/MAPREDUCE-4909?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Arpit Agarwal updated MAPREDUCE-4909: - Fix Version/s: 1.2.0 TestKeyValueTextInputFormat fails with Open JDK 7 on Windows Key: MAPREDUCE-4909 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4909 Project: Hadoop Map/Reduce Issue Type: Bug Components: test Affects Versions: 1.2.0 Reporter: Arpit Agarwal Assignee: Arpit Agarwal Fix For: 1.2.0 Attachments: MAPREDUCE-4909.patch, MAPREDUCE-4909.patch, MAPREDUCE-4909.patch, MAPREDUCE-4909.patch TestKeyValueTextInputFormat.testFormat fails with Open JDK 7. The root cause appears to be a failure to delete in-use files via LocalFileSystem.delete (RawLocalFileSystem.delete). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (MAPREDUCE-4909) TestKeyValueTextInputFormat fails with Open JDK 7 on Windows
[ https://issues.apache.org/jira/browse/MAPREDUCE-4909?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Arpit Agarwal updated MAPREDUCE-4909: - Target Version/s: (was: 1-win) TestKeyValueTextInputFormat fails with Open JDK 7 on Windows Key: MAPREDUCE-4909 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4909 Project: Hadoop Map/Reduce Issue Type: Bug Components: test Affects Versions: 1.2.0 Reporter: Arpit Agarwal Assignee: Arpit Agarwal Fix For: 1.2.0 Attachments: MAPREDUCE-4909.patch, MAPREDUCE-4909.patch, MAPREDUCE-4909.patch, MAPREDUCE-4909.patch TestKeyValueTextInputFormat.testFormat fails with Open JDK 7. The root cause appears to be a failure to delete in-use files via LocalFileSystem.delete (RawLocalFileSystem.delete). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-4909) TestKeyValueTextInputFormat fails with Open JDK 7 on Windows
[ https://issues.apache.org/jira/browse/MAPREDUCE-4909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13543316#comment-13543316 ] Suresh Srinivas commented on MAPREDUCE-4909: +1 for the patch. TestKeyValueTextInputFormat fails with Open JDK 7 on Windows Key: MAPREDUCE-4909 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4909 Project: Hadoop Map/Reduce Issue Type: Bug Components: test Affects Versions: 1.2.0 Reporter: Arpit Agarwal Assignee: Arpit Agarwal Fix For: 1.2.0 Attachments: MAPREDUCE-4909.patch, MAPREDUCE-4909.patch, MAPREDUCE-4909.patch, MAPREDUCE-4909.patch TestKeyValueTextInputFormat.testFormat fails with Open JDK 7. The root cause appears to be a failure to delete in-use files via LocalFileSystem.delete (RawLocalFileSystem.delete). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Resolved] (MAPREDUCE-4909) TestKeyValueTextInputFormat fails with Open JDK 7 on Windows
[ https://issues.apache.org/jira/browse/MAPREDUCE-4909?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Suresh Srinivas resolved MAPREDUCE-4909. Resolution: Fixed Hadoop Flags: Reviewed I committed the patch. Thank you Arpit! TestKeyValueTextInputFormat fails with Open JDK 7 on Windows Key: MAPREDUCE-4909 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4909 Project: Hadoop Map/Reduce Issue Type: Bug Components: test Affects Versions: 1.2.0 Reporter: Arpit Agarwal Assignee: Arpit Agarwal Fix For: 1.2.0 Attachments: MAPREDUCE-4909.patch, MAPREDUCE-4909.patch, MAPREDUCE-4909.patch, MAPREDUCE-4909.patch TestKeyValueTextInputFormat.testFormat fails with Open JDK 7. The root cause appears to be a failure to delete in-use files via LocalFileSystem.delete (RawLocalFileSystem.delete). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-4909) TestKeyValueTextInputFormat fails with Open JDK 7 on Windows
[ https://issues.apache.org/jira/browse/MAPREDUCE-4909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13543325#comment-13543325 ] Suresh Srinivas commented on MAPREDUCE-4909: I also committed this change to branch-1-win. TestKeyValueTextInputFormat fails with Open JDK 7 on Windows Key: MAPREDUCE-4909 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4909 Project: Hadoop Map/Reduce Issue Type: Bug Components: test Affects Versions: 1.2.0 Reporter: Arpit Agarwal Assignee: Arpit Agarwal Fix For: 1.2.0 Attachments: MAPREDUCE-4909.patch, MAPREDUCE-4909.patch, MAPREDUCE-4909.patch, MAPREDUCE-4909.patch TestKeyValueTextInputFormat.testFormat fails with Open JDK 7. The root cause appears to be a failure to delete in-use files via LocalFileSystem.delete (RawLocalFileSystem.delete). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-4909) TestKeyValueTextInputFormat fails with Open JDK 7 on Windows
[ https://issues.apache.org/jira/browse/MAPREDUCE-4909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13543328#comment-13543328 ] Arpit Agarwal commented on MAPREDUCE-4909: -- Thanks Suresh! TestKeyValueTextInputFormat fails with Open JDK 7 on Windows Key: MAPREDUCE-4909 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4909 Project: Hadoop Map/Reduce Issue Type: Bug Components: test Affects Versions: 1.2.0 Reporter: Arpit Agarwal Assignee: Arpit Agarwal Fix For: 1.2.0 Attachments: MAPREDUCE-4909.patch, MAPREDUCE-4909.patch, MAPREDUCE-4909.patch, MAPREDUCE-4909.patch TestKeyValueTextInputFormat.testFormat fails with Open JDK 7. The root cause appears to be a failure to delete in-use files via LocalFileSystem.delete (RawLocalFileSystem.delete). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-4832) MR AM can get in a split brain situation
[ https://issues.apache.org/jira/browse/MAPREDUCE-4832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13543334#comment-13543334 ] Siddharth Seth commented on MAPREDUCE-4832: --- Was talking to Hitesh offline about this patch. Is this needed at the moment ? Seems like it's possible to avoid multiple AMs by tuning the AM_LIVENESS_INTERVAL (10 minutes by default) and MR_AM_TO_RM_WAIT_INTERVAL_MS (6 minutes by default). A new AM should only be started after the existing AM is done. That said, this is definitely an interesting approach to fix the problem. - Could add a check to ensure the window interval is greater than the AM-RM heartbeat. - Does getClock() need to be part of the RMHeartbeatHandler. Looks like the AppContext can provide this - I think a couple of places use the AppContext, others use th RMHeartbeatHandler. Recovery and restart are still WIP. I believe the MR_AM_TO_RM_WAIT_INTERVAL_MS will need to be looked at again in context of recovery. This patch, or a sync via hdfs seems more useful at that point ? MR AM can get in a split brain situation Key: MAPREDUCE-4832 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4832 Project: Hadoop Map/Reduce Issue Type: Bug Components: applicationmaster Affects Versions: 2.0.2-alpha, 0.23.5 Reporter: Robert Joseph Evans Assignee: Jason Lowe Priority: Critical Attachments: MAPREDUCE-4832.patch It is possible for a networking issue to happen where the RM thinks an AM has gone down and launches a replacement, but the previous AM is still up and running. If the previous AM does not need any more resources from the RM it could try to commit either tasks or jobs. This could cause lots of problems where the second AM finishes and tries to commit too. This could result in data corruption. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (MAPREDUCE-4819) AM can rerun job after reporting final job status to the client
[ https://issues.apache.org/jira/browse/MAPREDUCE-4819?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Joseph Evans updated MAPREDUCE-4819: --- Attachment: MR-4819-bobby-trunk.txt Fixes Findbugs issue, and test failures. Both were test issues I had missed previously. AM can rerun job after reporting final job status to the client --- Key: MAPREDUCE-4819 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4819 Project: Hadoop Map/Reduce Issue Type: Bug Components: mr-am Affects Versions: 0.23.3, 2.0.1-alpha Reporter: Jason Lowe Assignee: Bikas Saha Priority: Critical Attachments: MAPREDUCE-4819.1.patch, MAPREDUCE-4819.2.patch, MAPREDUCE-4819.3.patch, MR-4819-bobby-trunk.txt, MR-4819-bobby-trunk.txt, MR-4819-bobby-trunk.txt, MR-4819-bobby-trunk.txt If the AM reports final job status to the client but then crashes before unregistering with the RM then the RM can run another AM attempt. Currently AM re-attempts assume that the previous attempts did not reach a final job state, and that causes the job to rerun (from scratch, if the output format doesn't support recovery). Re-running the job when we've already told the client the final status of the job is bad for a number of reasons. If the job failed, it's confusing at best since the client was already told the job failed but the subsequent attempt could succeed. If the job succeeded there could be data loss, as a subsequent job launched by the client tries to consume the job's output as input just as the re-attempt starts removing output files in preparation for the output commit. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (MAPREDUCE-4819) AM can rerun job after reporting final job status to the client
[ https://issues.apache.org/jira/browse/MAPREDUCE-4819?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Joseph Evans updated MAPREDUCE-4819: --- Attachment: MR-4819-bobby-trunk.txt With the latest comments on MAPREDUCE-4832 I removed the place holder in here for code from it. Now this should be able to stand alone, and be committed if deemed acceptable. AM can rerun job after reporting final job status to the client --- Key: MAPREDUCE-4819 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4819 Project: Hadoop Map/Reduce Issue Type: Bug Components: mr-am Affects Versions: 0.23.3, 2.0.1-alpha Reporter: Jason Lowe Assignee: Bikas Saha Priority: Critical Attachments: MAPREDUCE-4819.1.patch, MAPREDUCE-4819.2.patch, MAPREDUCE-4819.3.patch, MR-4819-bobby-trunk.txt, MR-4819-bobby-trunk.txt, MR-4819-bobby-trunk.txt, MR-4819-bobby-trunk.txt, MR-4819-bobby-trunk.txt If the AM reports final job status to the client but then crashes before unregistering with the RM then the RM can run another AM attempt. Currently AM re-attempts assume that the previous attempts did not reach a final job state, and that causes the job to rerun (from scratch, if the output format doesn't support recovery). Re-running the job when we've already told the client the final status of the job is bad for a number of reasons. If the job failed, it's confusing at best since the client was already told the job failed but the subsequent attempt could succeed. If the job succeeded there could be data loss, as a subsequent job launched by the client tries to consume the job's output as input just as the re-attempt starts removing output files in preparation for the output commit. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-4819) AM can rerun job after reporting final job status to the client
[ https://issues.apache.org/jira/browse/MAPREDUCE-4819?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13543376#comment-13543376 ] Hadoop QA commented on MAPREDUCE-4819: -- {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12563172/MR-4819-bobby-trunk.txt against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 7 new or modified test files. {color:red}-1 javac{color}. The applied patch generated 2015 javac compiler warnings (more than the trunk's current 2014 warnings). {color:green}+1 javadoc{color}. The javadoc tool did not generate any warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-common hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-hs hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-web-proxy: org.apache.hadoop.mapreduce.v2.app.TestRecovery org.apache.hadoop.mapreduce.v2.app.webapp.TestAMWebServicesJobs org.apache.hadoop.mapreduce.v2.app.webapp.TestAMWebServicesTasks org.apache.hadoop.mapreduce.v2.hs.webapp.TestHsWebServicesJobs org.apache.hadoop.mapreduce.v2.hs.webapp.TestHsWebServicesTasks org.apache.hadoop.mapreduce.v2.hs.webapp.TestHsWebServicesJobsQuery {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3190//testReport/ Javac warnings: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3190//artifact/trunk/patchprocess/diffJavacWarnings.txt Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3190//console This message is automatically generated. AM can rerun job after reporting final job status to the client --- Key: MAPREDUCE-4819 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4819 Project: Hadoop Map/Reduce Issue Type: Bug Components: mr-am Affects Versions: 0.23.3, 2.0.1-alpha Reporter: Jason Lowe Assignee: Bikas Saha Priority: Critical Attachments: MAPREDUCE-4819.1.patch, MAPREDUCE-4819.2.patch, MAPREDUCE-4819.3.patch, MR-4819-bobby-trunk.txt, MR-4819-bobby-trunk.txt, MR-4819-bobby-trunk.txt, MR-4819-bobby-trunk.txt, MR-4819-bobby-trunk.txt If the AM reports final job status to the client but then crashes before unregistering with the RM then the RM can run another AM attempt. Currently AM re-attempts assume that the previous attempts did not reach a final job state, and that causes the job to rerun (from scratch, if the output format doesn't support recovery). Re-running the job when we've already told the client the final status of the job is bad for a number of reasons. If the job failed, it's confusing at best since the client was already told the job failed but the subsequent attempt could succeed. If the job succeeded there could be data loss, as a subsequent job launched by the client tries to consume the job's output as input just as the re-attempt starts removing output files in preparation for the output commit. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-4892) CombineFileInputFormat node input split can be skewed on small clusters
[ https://issues.apache.org/jira/browse/MAPREDUCE-4892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13543377#comment-13543377 ] Chris Nauroth commented on MAPREDUCE-4892: -- +1 for the patch I applied the patch locally and ran {{TestCombineFileInputFormat}}. Code looks good. I can't think of any other edge cases that this patch doesn't handle. CombineFileInputFormat node input split can be skewed on small clusters --- Key: MAPREDUCE-4892 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4892 Project: Hadoop Map/Reduce Issue Type: Bug Reporter: Bikas Saha Assignee: Bikas Saha Fix For: 3.0.0 Attachments: MAPREDUCE-4892.1.patch The CombineFileInputFormat split generation logic tries to group blocks by node in order to create splits. It iterates through the nodes and creates splits on them until there aren't enough blocks left on a node that can be grouped into a valid split. If the first few nodes have a lot of blocks on them then they can end up getting a disproportionately large share of the total number of splits created. This can result in poor locality of maps. This problem is likely to happen on small clusters where its easier to create a skew in the distribution of blocks on nodes. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-4819) AM can rerun job after reporting final job status to the client
[ https://issues.apache.org/jira/browse/MAPREDUCE-4819?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13543380#comment-13543380 ] Robert Joseph Evans commented on MAPREDUCE-4819: I am investigating the test failures. I think they are unrelated to this patch, because they work just fine for me when I run them without up-merging to the latest trunk. AM can rerun job after reporting final job status to the client --- Key: MAPREDUCE-4819 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4819 Project: Hadoop Map/Reduce Issue Type: Bug Components: mr-am Affects Versions: 0.23.3, 2.0.1-alpha Reporter: Jason Lowe Assignee: Bikas Saha Priority: Critical Attachments: MAPREDUCE-4819.1.patch, MAPREDUCE-4819.2.patch, MAPREDUCE-4819.3.patch, MR-4819-bobby-trunk.txt, MR-4819-bobby-trunk.txt, MR-4819-bobby-trunk.txt, MR-4819-bobby-trunk.txt, MR-4819-bobby-trunk.txt If the AM reports final job status to the client but then crashes before unregistering with the RM then the RM can run another AM attempt. Currently AM re-attempts assume that the previous attempts did not reach a final job state, and that causes the job to rerun (from scratch, if the output format doesn't support recovery). Re-running the job when we've already told the client the final status of the job is bad for a number of reasons. If the job failed, it's confusing at best since the client was already told the job failed but the subsequent attempt could succeed. If the job succeeded there could be data loss, as a subsequent job launched by the client tries to consume the job's output as input just as the re-attempt starts removing output files in preparation for the output commit. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-4819) AM can rerun job after reporting final job status to the client
[ https://issues.apache.org/jira/browse/MAPREDUCE-4819?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13543383#comment-13543383 ] Robert Joseph Evans commented on MAPREDUCE-4819: For some reason all of the web service tests were failing with out of memory errors, that I have not been able to reproduce yet myself. The TestRecovery failures I also have not been able to reproduce, but I did not see any OOMs there. AM can rerun job after reporting final job status to the client --- Key: MAPREDUCE-4819 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4819 Project: Hadoop Map/Reduce Issue Type: Bug Components: mr-am Affects Versions: 0.23.3, 2.0.1-alpha Reporter: Jason Lowe Assignee: Bikas Saha Priority: Critical Attachments: MAPREDUCE-4819.1.patch, MAPREDUCE-4819.2.patch, MAPREDUCE-4819.3.patch, MR-4819-bobby-trunk.txt, MR-4819-bobby-trunk.txt, MR-4819-bobby-trunk.txt, MR-4819-bobby-trunk.txt, MR-4819-bobby-trunk.txt If the AM reports final job status to the client but then crashes before unregistering with the RM then the RM can run another AM attempt. Currently AM re-attempts assume that the previous attempts did not reach a final job state, and that causes the job to rerun (from scratch, if the output format doesn't support recovery). Re-running the job when we've already told the client the final status of the job is bad for a number of reasons. If the job failed, it's confusing at best since the client was already told the job failed but the subsequent attempt could succeed. If the job succeeded there could be data loss, as a subsequent job launched by the client tries to consume the job's output as input just as the re-attempt starts removing output files in preparation for the output commit. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-4819) AM can rerun job after reporting final job status to the client
[ https://issues.apache.org/jira/browse/MAPREDUCE-4819?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13543388#comment-13543388 ] Hadoop QA commented on MAPREDUCE-4819: -- {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12563176/MR-4819-bobby-trunk.txt against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 7 new or modified test files. {color:red}-1 javac{color}. The applied patch generated 2015 javac compiler warnings (more than the trunk's current 2014 warnings). {color:green}+1 javadoc{color}. The javadoc tool did not generate any warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-common hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-hs hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-web-proxy. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3191//testReport/ Javac warnings: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3191//artifact/trunk/patchprocess/diffJavacWarnings.txt Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3191//console This message is automatically generated. AM can rerun job after reporting final job status to the client --- Key: MAPREDUCE-4819 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4819 Project: Hadoop Map/Reduce Issue Type: Bug Components: mr-am Affects Versions: 0.23.3, 2.0.1-alpha Reporter: Jason Lowe Assignee: Bikas Saha Priority: Critical Attachments: MAPREDUCE-4819.1.patch, MAPREDUCE-4819.2.patch, MAPREDUCE-4819.3.patch, MR-4819-bobby-trunk.txt, MR-4819-bobby-trunk.txt, MR-4819-bobby-trunk.txt, MR-4819-bobby-trunk.txt, MR-4819-bobby-trunk.txt If the AM reports final job status to the client but then crashes before unregistering with the RM then the RM can run another AM attempt. Currently AM re-attempts assume that the previous attempts did not reach a final job state, and that causes the job to rerun (from scratch, if the output format doesn't support recovery). Re-running the job when we've already told the client the final status of the job is bad for a number of reasons. If the job failed, it's confusing at best since the client was already told the job failed but the subsequent attempt could succeed. If the job succeeded there could be data loss, as a subsequent job launched by the client tries to consume the job's output as input just as the re-attempt starts removing output files in preparation for the output commit. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-4832) MR AM can get in a split brain situation
[ https://issues.apache.org/jira/browse/MAPREDUCE-4832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13543390#comment-13543390 ] Bikas Saha commented on MAPREDUCE-4832: --- Independent of this change, this looks like a problem that needs to be solved in the platform than in the AM. Something like making sure the NM maintains an expire time on its containers and terminates them when the expire time is reached. The expire time is extended whenever the NM heartbeats with the RM. So if the NM loses contact with the RM or if the RM thinks the AM should not be running anymore on that NM,then the expire time will not be extended. RM starts retries after the expire time has elapsed. The logic is similar but self contained within the platform. AM's could do similar stuff to their containers. Thus providing an automatic garbage collection when an AM crashes. MR AM can get in a split brain situation Key: MAPREDUCE-4832 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4832 Project: Hadoop Map/Reduce Issue Type: Bug Components: applicationmaster Affects Versions: 2.0.2-alpha, 0.23.5 Reporter: Robert Joseph Evans Assignee: Jason Lowe Priority: Critical Attachments: MAPREDUCE-4832.patch It is possible for a networking issue to happen where the RM thinks an AM has gone down and launches a replacement, but the previous AM is still up and running. If the previous AM does not need any more resources from the RM it could try to commit either tasks or jobs. This could cause lots of problems where the second AM finishes and tries to commit too. This could result in data corruption. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-4819) AM can rerun job after reporting final job status to the client
[ https://issues.apache.org/jira/browse/MAPREDUCE-4819?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13543394#comment-13543394 ] Robert Joseph Evans commented on MAPREDUCE-4819: OK looking at it all of the failures appear to be associated with the hadoop4 machine. I will work with tgraves to see if we can figure out what is happening. AM can rerun job after reporting final job status to the client --- Key: MAPREDUCE-4819 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4819 Project: Hadoop Map/Reduce Issue Type: Bug Components: mr-am Affects Versions: 0.23.3, 2.0.1-alpha Reporter: Jason Lowe Assignee: Bikas Saha Priority: Critical Attachments: MAPREDUCE-4819.1.patch, MAPREDUCE-4819.2.patch, MAPREDUCE-4819.3.patch, MR-4819-bobby-trunk.txt, MR-4819-bobby-trunk.txt, MR-4819-bobby-trunk.txt, MR-4819-bobby-trunk.txt, MR-4819-bobby-trunk.txt If the AM reports final job status to the client but then crashes before unregistering with the RM then the RM can run another AM attempt. Currently AM re-attempts assume that the previous attempts did not reach a final job state, and that causes the job to rerun (from scratch, if the output format doesn't support recovery). Re-running the job when we've already told the client the final status of the job is bad for a number of reasons. If the job failed, it's confusing at best since the client was already told the job failed but the subsequent attempt could succeed. If the job succeeded there could be data loss, as a subsequent job launched by the client tries to consume the job's output as input just as the re-attempt starts removing output files in preparation for the output commit. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-4819) AM can rerun job after reporting final job status to the client
[ https://issues.apache.org/jira/browse/MAPREDUCE-4819?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13543406#comment-13543406 ] Bikas Saha commented on MAPREDUCE-4819: --- When staging dir exists but commitStarted marker does not exist, then it means that its a retry that should continue as normal, right? If yes, shouldnt copyHistory be set to false for the above case? Looks like copyHistory should be set to true only inside the following block. Only when commit started, do we need to copy history and end. In other cases, we should not copy history. Changes initial value of copyHistory to false and set it when needed? {code} + } else if (commitStarted) { {code} Typos errorHappendShutDown NoopEventHanlder If we change this code to create new file or fail then AM knows when it has lost its race to commit. Does this provide a simpler fix for MAPREDUCE-4832? When AM tries to initiate commit, then only the first one manages to write the commit_start file in HDFS. So racing AM's will fail after the first one succeeds. The marker still exists for the purpose of signalling start of commit (ie for this jira). It should not matter which AM commits the result because the computation is deterministic. The AM that failed to commit could check/wait for end of commit marker in order to makes sure that the last retry succeeds (if that is necessary). {code} +private void touchz(Path p) throws IOException { + fs.create(p).close(); +} {code} AM can rerun job after reporting final job status to the client --- Key: MAPREDUCE-4819 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4819 Project: Hadoop Map/Reduce Issue Type: Bug Components: mr-am Affects Versions: 0.23.3, 2.0.1-alpha Reporter: Jason Lowe Assignee: Bikas Saha Priority: Critical Attachments: MAPREDUCE-4819.1.patch, MAPREDUCE-4819.2.patch, MAPREDUCE-4819.3.patch, MR-4819-bobby-trunk.txt, MR-4819-bobby-trunk.txt, MR-4819-bobby-trunk.txt, MR-4819-bobby-trunk.txt, MR-4819-bobby-trunk.txt If the AM reports final job status to the client but then crashes before unregistering with the RM then the RM can run another AM attempt. Currently AM re-attempts assume that the previous attempts did not reach a final job state, and that causes the job to rerun (from scratch, if the output format doesn't support recovery). Re-running the job when we've already told the client the final status of the job is bad for a number of reasons. If the job failed, it's confusing at best since the client was already told the job failed but the subsequent attempt could succeed. If the job succeeded there could be data loss, as a subsequent job launched by the client tries to consume the job's output as input just as the re-attempt starts removing output files in preparation for the output commit. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-4832) MR AM can get in a split brain situation
[ https://issues.apache.org/jira/browse/MAPREDUCE-4832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13543414#comment-13543414 ] Jason Lowe commented on MAPREDUCE-4832: --- bq. Seems like it's possible to avoid multiple AMs by tuning the AM_LIVENESS_INTERVAL (10 minutes by default) and MR_AM_TO_RM_WAIT_INTERVAL_MS (6 minutes by default). A new AM should only be started after the existing AM is done. That *almost* solves the problem, but there are some corner cases left unsolved. For example: 1) AM is running on a node whose NM suddenly declares itself UNHEALTHY via health-check script 2) RM removes node from active nodes and kills all containers running on that node 3) Network cut occurs. NM did not receive notification to kill the containers and/or NM crashes. AM is unable to communicate to RM. 4) RM now thinks all containers are dead on that node, proceeds to relaunch a new AM attempt 5) Now for the next 6 minutes (or whatever the expiry interval is for the AM to RM) we have two app attempts running simultaneously. If the old AM attempt is able to reach HDFS or whatever it needs to commit, we could end up committing twice. bq. Could add a check to ensure the window interval is greater than the AM-RM heartbeat. Actually that's not strictly necessary. The code can function correctly even if the commit window is smaller than the heartbeat interval. For example, job commit is woken up when a fresh heartbeat arrives, and task commit polls periodically for whether the heartbeat has occurred recently. It's not mandatory that the interval between heartbeats is smaller than the commit window for a commit to proceed, but it is more likely a commit operation will be stalled waiting for a fresh heartbeat if configured that way. bq. Does getClock() need to be part of the RMHeartbeatHandler. Looks like the AppContext can provide this I put it in the interface so the caller can access the same clock used to timestamp the heartbeat in case it could be different from the AppContext clock or if the caller didn't have access to the AppContext. But that's probably never going to be a real concern, so I'll take it out. And to address Bikas' comment: bq. Independent of this change, this looks like a problem that needs to be solved in the platform than in the AM. We might be able to close all the corner cases in the framework. For example, the above scenario could be solved if the RM were to wait for confirmation from the NM of the containers actually expiring before proceeding to launch another attempt. If the NM is unreachable before the confirmation is received, it could wait for the AM expiry interval before launching a new attempt. It could mean that we wait a lot longer than necessary, but at least we'd know with confidence that two attempts aren't running simultaneously. MR AM can get in a split brain situation Key: MAPREDUCE-4832 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4832 Project: Hadoop Map/Reduce Issue Type: Bug Components: applicationmaster Affects Versions: 2.0.2-alpha, 0.23.5 Reporter: Robert Joseph Evans Assignee: Jason Lowe Priority: Critical Attachments: MAPREDUCE-4832.patch It is possible for a networking issue to happen where the RM thinks an AM has gone down and launches a replacement, but the previous AM is still up and running. If the previous AM does not need any more resources from the RM it could try to commit either tasks or jobs. This could cause lots of problems where the second AM finishes and tries to commit too. This could result in data corruption. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-4819) AM can rerun job after reporting final job status to the client
[ https://issues.apache.org/jira/browse/MAPREDUCE-4819?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13543425#comment-13543425 ] Jason Lowe commented on MAPREDUCE-4819: --- bq. If we change this code to create new file or fail then AM knows when it has lost its race to commit. Does this provide a simpler fix for MAPREDUCE-4832? If an app attempt sees the file, how does it even know whether there's an active race that was lost? The other AM could have simply crashed mid-commit. The losing AM could just assume that's the case and unregister from the RM with a FAILED status assuming job commit failed. (Or maybe wait for some configurable timeout just in case.) However this would only cover job commit, and two racing app attempts could still commit output for tasks simultaneously. MAPREDUCE-4832 prevents two racing app attempts from committing the same task output, as at most one will be active and allowed to commit. That could be bad if the old attempt is re-committing output for a fetch-failure map task while the second attempt is trying to recover, for example. Task output could be lost in that case. AM can rerun job after reporting final job status to the client --- Key: MAPREDUCE-4819 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4819 Project: Hadoop Map/Reduce Issue Type: Bug Components: mr-am Affects Versions: 0.23.3, 2.0.1-alpha Reporter: Jason Lowe Assignee: Bikas Saha Priority: Critical Attachments: MAPREDUCE-4819.1.patch, MAPREDUCE-4819.2.patch, MAPREDUCE-4819.3.patch, MR-4819-bobby-trunk.txt, MR-4819-bobby-trunk.txt, MR-4819-bobby-trunk.txt, MR-4819-bobby-trunk.txt, MR-4819-bobby-trunk.txt If the AM reports final job status to the client but then crashes before unregistering with the RM then the RM can run another AM attempt. Currently AM re-attempts assume that the previous attempts did not reach a final job state, and that causes the job to rerun (from scratch, if the output format doesn't support recovery). Re-running the job when we've already told the client the final status of the job is bad for a number of reasons. If the job failed, it's confusing at best since the client was already told the job failed but the subsequent attempt could succeed. If the job succeeded there could be data loss, as a subsequent job launched by the client tries to consume the job's output as input just as the re-attempt starts removing output files in preparation for the output commit. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (MAPREDUCE-3685) There are some bugs in implementation of MergeManager
[ https://issues.apache.org/jira/browse/MAPREDUCE-3685?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ravi Prakash updated MAPREDUCE-3685: Priority: Critical (was: Minor) There are some bugs in implementation of MergeManager - Key: MAPREDUCE-3685 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3685 Project: Hadoop Map/Reduce Issue Type: Bug Components: mrv2 Affects Versions: 0.23.1 Reporter: anty.rao Assignee: anty Priority: Critical Attachments: MAPREDUCE-3685-branch-0.23.1.patch, MAPREDUCE-3685-branch-0.23.1.patch, MAPREDUCE-3685-branch-0.23.1.patch, MAPREDUCE-3685.patch -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (MAPREDUCE-3685) There are some bugs in implementation of MergeManager
[ https://issues.apache.org/jira/browse/MAPREDUCE-3685?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ravi Prakash updated MAPREDUCE-3685: Target Version/s: 2.0.0-alpha, trunk, 0.23.6 (was: 2.0.0-alpha, trunk) There are some bugs in implementation of MergeManager - Key: MAPREDUCE-3685 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3685 Project: Hadoop Map/Reduce Issue Type: Bug Components: mrv2 Affects Versions: 0.23.1 Reporter: anty.rao Assignee: anty Priority: Critical Attachments: MAPREDUCE-3685-branch-0.23.1.patch, MAPREDUCE-3685-branch-0.23.1.patch, MAPREDUCE-3685-branch-0.23.1.patch, MAPREDUCE-3685.patch -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (MAPREDUCE-3685) There are some bugs in implementation of MergeManager
[ https://issues.apache.org/jira/browse/MAPREDUCE-3685?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ravi Prakash updated MAPREDUCE-3685: Status: Patch Available (was: Open) Submitting patch on behalf of Anty! There are some bugs in implementation of MergeManager - Key: MAPREDUCE-3685 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3685 Project: Hadoop Map/Reduce Issue Type: Bug Components: mrv2 Affects Versions: 0.23.1 Reporter: anty.rao Assignee: anty Priority: Critical Attachments: MAPREDUCE-3685-branch-0.23.1.patch, MAPREDUCE-3685-branch-0.23.1.patch, MAPREDUCE-3685-branch-0.23.1.patch, MAPREDUCE-3685.patch -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (MAPREDUCE-4832) MR AM can get in a split brain situation
[ https://issues.apache.org/jira/browse/MAPREDUCE-4832?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Lowe updated MAPREDUCE-4832: -- Attachment: MAPREDUCE-4832.patch Updated patch to remove getClock() from RMHeartbeatHandler interface. MR AM can get in a split brain situation Key: MAPREDUCE-4832 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4832 Project: Hadoop Map/Reduce Issue Type: Bug Components: applicationmaster Affects Versions: 2.0.2-alpha, 0.23.5 Reporter: Robert Joseph Evans Assignee: Jason Lowe Priority: Critical Attachments: MAPREDUCE-4832.patch, MAPREDUCE-4832.patch It is possible for a networking issue to happen where the RM thinks an AM has gone down and launches a replacement, but the previous AM is still up and running. If the previous AM does not need any more resources from the RM it could try to commit either tasks or jobs. This could cause lots of problems where the second AM finishes and tries to commit too. This could result in data corruption. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-3685) There are some bugs in implementation of MergeManager
[ https://issues.apache.org/jira/browse/MAPREDUCE-3685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13543460#comment-13543460 ] Hadoop QA commented on MAPREDUCE-3685: -- {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12522248/MAPREDUCE-3685-branch-0.23.1.patch against trunk revision . {color:red}-1 patch{color}. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3193//console This message is automatically generated. There are some bugs in implementation of MergeManager - Key: MAPREDUCE-3685 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3685 Project: Hadoop Map/Reduce Issue Type: Bug Components: mrv2 Affects Versions: 0.23.1 Reporter: anty.rao Assignee: anty Priority: Critical Attachments: MAPREDUCE-3685-branch-0.23.1.patch, MAPREDUCE-3685-branch-0.23.1.patch, MAPREDUCE-3685-branch-0.23.1.patch, MAPREDUCE-3685.patch -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Assigned] (MAPREDUCE-2286) ASF mapreduce
[ https://issues.apache.org/jira/browse/MAPREDUCE-2286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Miguel Ochoa reassigned MAPREDUCE-2286: --- Assignee: Miguel Ochoa ASF mapreduce - Key: MAPREDUCE-2286 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2286 Project: Hadoop Map/Reduce Issue Type: Sub-task Components: benchmarks, client, contrib/streaming, jobtracker, pipes Environment: 2.2 Commodore Reporter: Miguel Ochoa Assignee: Miguel Ochoa Priority: Trivial Original Estimate: 50h Remaining Estimate: 50h This sub-net ensures versions in description, however projects or manufacturing will have to be in working conditioning in the time of unknown versions. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-4832) MR AM can get in a split brain situation
[ https://issues.apache.org/jira/browse/MAPREDUCE-4832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13543475#comment-13543475 ] Hadoop QA commented on MAPREDUCE-4832: -- {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12563197/MAPREDUCE-4832.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 7 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. The javadoc tool did not generate any warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3192//testReport/ Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3192//console This message is automatically generated. MR AM can get in a split brain situation Key: MAPREDUCE-4832 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4832 Project: Hadoop Map/Reduce Issue Type: Bug Components: applicationmaster Affects Versions: 2.0.2-alpha, 0.23.5 Reporter: Robert Joseph Evans Assignee: Jason Lowe Priority: Critical Attachments: MAPREDUCE-4832.patch, MAPREDUCE-4832.patch It is possible for a networking issue to happen where the RM thinks an AM has gone down and launches a replacement, but the previous AM is still up and running. If the previous AM does not need any more resources from the RM it could try to commit either tasks or jobs. This could cause lots of problems where the second AM finishes and tries to commit too. This could result in data corruption. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-4819) AM can rerun job after reporting final job status to the client
[ https://issues.apache.org/jira/browse/MAPREDUCE-4819?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13543517#comment-13543517 ] Siddharth Seth commented on MAPREDUCE-4819: --- bq. I think it already will. We are not opening the file for append, we are trying to create it. fs.create(Path) - overwrites by default, instead of throwing an exception. There's another form which does not overwrite. Don't think this is a problem once 4832 goes in. bq. I have also not added in the kill state. Again it looked a bit tricky because of the multithreading and I would prefer to get something working in now and add that as part of a follow up JIRA. ok. This seems like it will be easier if we rely on the history file as the commit log instead of the 3/more individual files. RPC clients not being able to communicate with the AM / history (or getting alternate states) after having seen a SUCCESS state seems to be independent of this patch. Separate jira. This seems ok for now since it's gotten some attention and has been tried out. I think handling all of this via the CommitHandler is a cleaner approach, and we can move to that at a later point. AM can rerun job after reporting final job status to the client --- Key: MAPREDUCE-4819 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4819 Project: Hadoop Map/Reduce Issue Type: Bug Components: mr-am Affects Versions: 0.23.3, 2.0.1-alpha Reporter: Jason Lowe Assignee: Bikas Saha Priority: Critical Attachments: MAPREDUCE-4819.1.patch, MAPREDUCE-4819.2.patch, MAPREDUCE-4819.3.patch, MR-4819-bobby-trunk.txt, MR-4819-bobby-trunk.txt, MR-4819-bobby-trunk.txt, MR-4819-bobby-trunk.txt, MR-4819-bobby-trunk.txt If the AM reports final job status to the client but then crashes before unregistering with the RM then the RM can run another AM attempt. Currently AM re-attempts assume that the previous attempts did not reach a final job state, and that causes the job to rerun (from scratch, if the output format doesn't support recovery). Re-running the job when we've already told the client the final status of the job is bad for a number of reasons. If the job failed, it's confusing at best since the client was already told the job failed but the subsequent attempt could succeed. If the job succeeded there could be data loss, as a subsequent job launched by the client tries to consume the job's output as input just as the re-attempt starts removing output files in preparation for the output commit. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (MAPREDUCE-4904) TestMultipleLevelCaching failed in barnch-1
[ https://issues.apache.org/jira/browse/MAPREDUCE-4904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Junping Du updated MAPREDUCE-4904: -- Attachment: MAPREDUCE-4904-v2.patch Incorporate Luke's comments with adding comments to fall through in switch case. TestMultipleLevelCaching failed in barnch-1 --- Key: MAPREDUCE-4904 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4904 Project: Hadoop Map/Reduce Issue Type: Bug Components: test Affects Versions: 1.2.0 Reporter: meng gong Assignee: meng gong Fix For: 1.2.0 Attachments: MAPREDUCE-4904.patch, MAPREDUCE-4904-v2.patch TestMultipleLevelCaching will failed: {noformat} Testcase: testMultiLevelCaching took 30.406 sec FAILED Number of local maps expected:0 but was:1 junit.framework.AssertionFailedError: Number of local maps expected:0 but was:1 at org.apache.hadoop.mapred.TestRackAwareTaskPlacement.launchJobAndTestCounters(TestRackAwareTaskPlacement.java:78) at org.apache.hadoop.mapred.TestMultipleLevelCaching.testCachingAtLevel(TestMultipleLevelCaching.java:113) at org.apache.hadoop.mapred.TestMultipleLevelCaching.testMultiLevelCaching(TestMultipleLevelCaching.java:69) {noformat} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (MAPREDUCE-2286) ASF mapreduce
[ https://issues.apache.org/jira/browse/MAPREDUCE-2286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Miguel Ochoa updated MAPREDUCE-2286: Attachment: 01 - Mutual NDA - 2010.doc ASF mapreduce - Key: MAPREDUCE-2286 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2286 Project: Hadoop Map/Reduce Issue Type: Sub-task Components: benchmarks, client, contrib/streaming, jobtracker, pipes Environment: 2.2 Commodore Reporter: Miguel Ochoa Assignee: Miguel Ochoa Priority: Trivial Attachments: 01 - Mutual NDA - 2010.doc Original Estimate: 50h Remaining Estimate: 50h This sub-net ensures versions in description, however projects or manufacturing will have to be in working conditioning in the time of unknown versions. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (MAPREDUCE-4904) TestMultipleLevelCaching failed in barnch-1
[ https://issues.apache.org/jira/browse/MAPREDUCE-4904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Junping Du updated MAPREDUCE-4904: -- Attachment: MAPREDUCE-4904-v2.patch use --no-prefix to generate patch in new v2 patch. TestMultipleLevelCaching failed in barnch-1 --- Key: MAPREDUCE-4904 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4904 Project: Hadoop Map/Reduce Issue Type: Bug Components: test Affects Versions: 1.2.0 Reporter: meng gong Assignee: meng gong Fix For: 1.2.0 Attachments: MAPREDUCE-4904.patch, MAPREDUCE-4904-v2.patch, MAPREDUCE-4904-v2.patch TestMultipleLevelCaching will failed: {noformat} Testcase: testMultiLevelCaching took 30.406 sec FAILED Number of local maps expected:0 but was:1 junit.framework.AssertionFailedError: Number of local maps expected:0 but was:1 at org.apache.hadoop.mapred.TestRackAwareTaskPlacement.launchJobAndTestCounters(TestRackAwareTaskPlacement.java:78) at org.apache.hadoop.mapred.TestMultipleLevelCaching.testCachingAtLevel(TestMultipleLevelCaching.java:113) at org.apache.hadoop.mapred.TestMultipleLevelCaching.testMultiLevelCaching(TestMultipleLevelCaching.java:69) {noformat} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (MAPREDUCE-4819) AM can rerun job after reporting final job status to the client
[ https://issues.apache.org/jira/browse/MAPREDUCE-4819?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Arun C Murthy updated MAPREDUCE-4819: - Priority: Blocker (was: Critical) AM can rerun job after reporting final job status to the client --- Key: MAPREDUCE-4819 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4819 Project: Hadoop Map/Reduce Issue Type: Bug Components: mr-am Affects Versions: 0.23.3, 2.0.1-alpha Reporter: Jason Lowe Assignee: Bikas Saha Priority: Blocker Attachments: MAPREDUCE-4819.1.patch, MAPREDUCE-4819.2.patch, MAPREDUCE-4819.3.patch, MR-4819-bobby-trunk.txt, MR-4819-bobby-trunk.txt, MR-4819-bobby-trunk.txt, MR-4819-bobby-trunk.txt, MR-4819-bobby-trunk.txt If the AM reports final job status to the client but then crashes before unregistering with the RM then the RM can run another AM attempt. Currently AM re-attempts assume that the previous attempts did not reach a final job state, and that causes the job to rerun (from scratch, if the output format doesn't support recovery). Re-running the job when we've already told the client the final status of the job is bad for a number of reasons. If the job failed, it's confusing at best since the client was already told the job failed but the subsequent attempt could succeed. If the job succeeded there could be data loss, as a subsequent job launched by the client tries to consume the job's output as input just as the re-attempt starts removing output files in preparation for the output commit. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-4049) plugin for generic shuffle service
[ https://issues.apache.org/jira/browse/MAPREDUCE-4049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13543629#comment-13543629 ] Avner BenHanoch commented on MAPREDUCE-4049: Hi Alejandro, re #2, my intuation is that supporting 1 external shuffle service (in addition to the built-in shuffle service) is the keep it simple solution. I feel that the use case of N providers is theoretical. Hence, I prefer to keep the conf and code simple. Avner plugin for generic shuffle service -- Key: MAPREDUCE-4049 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4049 Project: Hadoop Map/Reduce Issue Type: Sub-task Components: performance, task, tasktracker Affects Versions: 1.0.3, 1.1.0, 2.0.0-alpha, 3.0.0 Reporter: Avner BenHanoch Assignee: Avner BenHanoch Labels: merge, plugin, rdma, shuffle Fix For: 3.0.0 Attachments: HADOOP-1.x.y.patch, Hadoop Shuffle Plugin Design.rtf, mapreduce-4049.patch Support generic shuffle service as set of two plugins: ShuffleProvider ShuffleConsumer. This will satisfy the following needs: # Better shuffle and merge performance. For example: we are working on shuffle plugin that performs shuffle over RDMA in fast networks (10gE, 40gE, or Infiniband) instead of using the current HTTP shuffle. Based on the fast RDMA shuffle, the plugin can also utilize a suitable merge approach during the intermediate merges. Hence, getting much better performance. # Satisfy MAPREDUCE-3060 - generic shuffle service for avoiding hidden dependency of NodeManager with a specific version of mapreduce shuffle (currently targeted to 0.24.0). References: # Hadoop Acceleration through Network Levitated Merging, by Prof. Weikuan Yu from Auburn University with others, [http://pasl.eng.auburn.edu/pubs/sc11-netlev.pdf] # I am attaching 2 documents with suggested Top Level Design for both plugins (currently, based on 1.0 branch) # I am providing link for downloading UDA - Mellanox's open source plugin that implements generic shuffle service using RDMA and levitated merge. Note: At this phase, the code is in C++ through JNI and you should consider it as beta only. Still, it can serve anyone that wants to implement or contribute to levitated merge. (Please be advised that levitated merge is mostly suit in very fast networks) - [http://www.mellanox.com/content/pages.php?pg=products_dynproduct_family=144menu_section=69] -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-4819) AM can rerun job after reporting final job status to the client
[ https://issues.apache.org/jira/browse/MAPREDUCE-4819?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13543630#comment-13543630 ] Bikas Saha commented on MAPREDUCE-4819: --- Sid, how about creating some jiras so that your ideas dont get lost as comments. AM can rerun job after reporting final job status to the client --- Key: MAPREDUCE-4819 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4819 Project: Hadoop Map/Reduce Issue Type: Bug Components: mr-am Affects Versions: 0.23.3, 2.0.1-alpha Reporter: Jason Lowe Assignee: Bikas Saha Priority: Blocker Attachments: MAPREDUCE-4819.1.patch, MAPREDUCE-4819.2.patch, MAPREDUCE-4819.3.patch, MR-4819-bobby-trunk.txt, MR-4819-bobby-trunk.txt, MR-4819-bobby-trunk.txt, MR-4819-bobby-trunk.txt, MR-4819-bobby-trunk.txt If the AM reports final job status to the client but then crashes before unregistering with the RM then the RM can run another AM attempt. Currently AM re-attempts assume that the previous attempts did not reach a final job state, and that causes the job to rerun (from scratch, if the output format doesn't support recovery). Re-running the job when we've already told the client the final status of the job is bad for a number of reasons. If the job failed, it's confusing at best since the client was already told the job failed but the subsequent attempt could succeed. If the job succeeded there could be data loss, as a subsequent job launched by the client tries to consume the job's output as input just as the re-attempt starts removing output files in preparation for the output commit. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-4049) plugin for generic shuffle service
[ https://issues.apache.org/jira/browse/MAPREDUCE-4049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13543632#comment-13543632 ] Alex Rosenbaum commented on MAPREDUCE-4049: --- I’ll be on vacation between Jan 6 to 13 (returning on Monday the 14th) Redirecting issues: · VMA - Olga Shern ol...@mellanox.commailto:ol...@mellanox.com · UDA - Avner Ben Hanoch avn...@mellanox.commailto:avn...@mellanox.com Regards, Alex Rosenbaum Director RD Application Acceleration Mellanox Technologies 13 Zarhin st, Raanana, Israel +972 (74) 712-9215 Follow us on Twitterhttp://twitter.com/mellanoxtech and Facebookhttp://www.facebook.com/pages/Mellanox-Technologies/223164879116 plugin for generic shuffle service -- Key: MAPREDUCE-4049 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4049 Project: Hadoop Map/Reduce Issue Type: Sub-task Components: performance, task, tasktracker Affects Versions: 1.0.3, 1.1.0, 2.0.0-alpha, 3.0.0 Reporter: Avner BenHanoch Assignee: Avner BenHanoch Labels: merge, plugin, rdma, shuffle Fix For: 3.0.0 Attachments: HADOOP-1.x.y.patch, Hadoop Shuffle Plugin Design.rtf, mapreduce-4049.patch Support generic shuffle service as set of two plugins: ShuffleProvider ShuffleConsumer. This will satisfy the following needs: # Better shuffle and merge performance. For example: we are working on shuffle plugin that performs shuffle over RDMA in fast networks (10gE, 40gE, or Infiniband) instead of using the current HTTP shuffle. Based on the fast RDMA shuffle, the plugin can also utilize a suitable merge approach during the intermediate merges. Hence, getting much better performance. # Satisfy MAPREDUCE-3060 - generic shuffle service for avoiding hidden dependency of NodeManager with a specific version of mapreduce shuffle (currently targeted to 0.24.0). References: # Hadoop Acceleration through Network Levitated Merging, by Prof. Weikuan Yu from Auburn University with others, [http://pasl.eng.auburn.edu/pubs/sc11-netlev.pdf] # I am attaching 2 documents with suggested Top Level Design for both plugins (currently, based on 1.0 branch) # I am providing link for downloading UDA - Mellanox's open source plugin that implements generic shuffle service using RDMA and levitated merge. Note: At this phase, the code is in C++ through JNI and you should consider it as beta only. Still, it can serve anyone that wants to implement or contribute to levitated merge. (Please be advised that levitated merge is mostly suit in very fast networks) - [http://www.mellanox.com/content/pages.php?pg=products_dynproduct_family=144menu_section=69] -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Issue Comment Deleted] (MAPREDUCE-4049) plugin for generic shuffle service
[ https://issues.apache.org/jira/browse/MAPREDUCE-4049?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Harsh J updated MAPREDUCE-4049: --- Comment: was deleted (was: I’ll be on vacation between Jan 6 to 13 (returning on Monday the 14th) Redirecting issues: · VMA - Olga Shern ol...@mellanox.commailto:ol...@mellanox.com · UDA - Avner Ben Hanoch avn...@mellanox.commailto:avn...@mellanox.com Regards, Alex Rosenbaum Director RD Application Acceleration Mellanox Technologies 13 Zarhin st, Raanana, Israel +972 (74) 712-9215 Follow us on Twitterhttp://twitter.com/mellanoxtech and Facebookhttp://www.facebook.com/pages/Mellanox-Technologies/223164879116 ) plugin for generic shuffle service -- Key: MAPREDUCE-4049 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4049 Project: Hadoop Map/Reduce Issue Type: Sub-task Components: performance, task, tasktracker Affects Versions: 1.0.3, 1.1.0, 2.0.0-alpha, 3.0.0 Reporter: Avner BenHanoch Assignee: Avner BenHanoch Labels: merge, plugin, rdma, shuffle Fix For: 3.0.0 Attachments: HADOOP-1.x.y.patch, Hadoop Shuffle Plugin Design.rtf, mapreduce-4049.patch Support generic shuffle service as set of two plugins: ShuffleProvider ShuffleConsumer. This will satisfy the following needs: # Better shuffle and merge performance. For example: we are working on shuffle plugin that performs shuffle over RDMA in fast networks (10gE, 40gE, or Infiniband) instead of using the current HTTP shuffle. Based on the fast RDMA shuffle, the plugin can also utilize a suitable merge approach during the intermediate merges. Hence, getting much better performance. # Satisfy MAPREDUCE-3060 - generic shuffle service for avoiding hidden dependency of NodeManager with a specific version of mapreduce shuffle (currently targeted to 0.24.0). References: # Hadoop Acceleration through Network Levitated Merging, by Prof. Weikuan Yu from Auburn University with others, [http://pasl.eng.auburn.edu/pubs/sc11-netlev.pdf] # I am attaching 2 documents with suggested Top Level Design for both plugins (currently, based on 1.0 branch) # I am providing link for downloading UDA - Mellanox's open source plugin that implements generic shuffle service using RDMA and levitated merge. Note: At this phase, the code is in C++ through JNI and you should consider it as beta only. Still, it can serve anyone that wants to implement or contribute to levitated merge. (Please be advised that levitated merge is mostly suit in very fast networks) - [http://www.mellanox.com/content/pages.php?pg=products_dynproduct_family=144menu_section=69] -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira