[jira] [Updated] (MAPREDUCE-4049) plugin for generic shuffle service
[ https://issues.apache.org/jira/browse/MAPREDUCE-4049?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Avner BenHanoch updated MAPREDUCE-4049: --- Description: Support generic shuffle service as set of two plugins: ShuffleProvider ShuffleConsumer. This will satisfy the following needs: # Better shuffle and merge performance. For example: we are working on shuffle plugin that performs shuffle over RDMA in fast networks (10gE, 40gE, or Infiniband) instead of using the current HTTP shuffle. Based on the fast RDMA shuffle, the plugin can also utilize a suitable merge approach during the intermediate merges. Hence, getting much better performance. # Satisfy MAPREDUCE-3060 - generic shuffle service for avoiding hidden dependency of NodeManager with a specific version of mapreduce shuffle (currently targeted to 0.24.0). References: # Hadoop Acceleration through Network Levitated Merging, by Prof. Weikuan Yu from Auburn University with others, [http://pasl.eng.auburn.edu/pubs/sc11-netlev.pdf] # I am attaching 2 documents with suggested Top Level Design for both plugins (currently, based on 1.0 branch) # I am providing link for downloading UDA - Mellanox's open source plugin that implements generic shuffle service using RDMA and levitated merge. Note: At this phase, the code is in C++ through JNI and you should consider it as beta only. Still, it can serve anyone that wants to implement or contribute to levitated merge. (Please be advised that levitated merge is mostly suit in very fast networks) - [http://www.mellanox.com/content/pages.php?pg=products_dynproduct_family=144menu_section=69] was: Support generic shuffle service as set of two plugins: ShuffleProvider ShuffleConsumer. This will satisfy the following needs: # Better shuffle and merge performance. For example: we are working on shuffle plugin that performs shuffle over RDMA in fast networks (10gE, 40gE, or Infiniband) instead of using the current HTTP shuffle. Based on the fast RDMA shuffle, the plugin can also utilize a suitable merge approach during the intermediate merges. Hence, getting much better performance. # Satisfy MAPREDUCE-3060 - generic shuffle service for avoiding hidden dependency of NodeManager with a specific version of mapreduce shuffle (currently targeted to 0.24.0). References: # Hadoop Acceleration through Network Levitated Merging, by Prof. Weikuan Yu from Auburn University with others, [http://pasl.eng.auburn.edu/pubs/sc11-netlev.pdf] # I am attaching 2 documents with suggested Top Level Design for both plugins (currently, based on 1.0 branch) plugin for generic shuffle service -- Key: MAPREDUCE-4049 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4049 Project: Hadoop Map/Reduce Issue Type: Sub-task Components: performance, task, tasktracker Affects Versions: 1.0.3, 1.1.0, 2.0.0-alpha, 3.0.0 Reporter: Avner BenHanoch Labels: merge, plugin, rdma, shuffle Fix For: trunk Attachments: HADOOP-1.x.y.patch, Hadoop Shuffle Plugin Design.rtf, mapreduce-4049.patch, mapreduce-4049.patch, mapreduce-4049.patch, mapreduce-4049.patch Support generic shuffle service as set of two plugins: ShuffleProvider ShuffleConsumer. This will satisfy the following needs: # Better shuffle and merge performance. For example: we are working on shuffle plugin that performs shuffle over RDMA in fast networks (10gE, 40gE, or Infiniband) instead of using the current HTTP shuffle. Based on the fast RDMA shuffle, the plugin can also utilize a suitable merge approach during the intermediate merges. Hence, getting much better performance. # Satisfy MAPREDUCE-3060 - generic shuffle service for avoiding hidden dependency of NodeManager with a specific version of mapreduce shuffle (currently targeted to 0.24.0). References: # Hadoop Acceleration through Network Levitated Merging, by Prof. Weikuan Yu from Auburn University with others, [http://pasl.eng.auburn.edu/pubs/sc11-netlev.pdf] # I am attaching 2 documents with suggested Top Level Design for both plugins (currently, based on 1.0 branch) # I am providing link for downloading UDA - Mellanox's open source plugin that implements generic shuffle service using RDMA and levitated merge. Note: At this phase, the code is in C++ through JNI and you should consider it as beta only. Still, it can serve anyone that wants to implement or contribute to levitated merge. (Please be advised that levitated merge is mostly suit in very fast networks) - [http://www.mellanox.com/content/pages.php?pg=products_dynproduct_family=144menu_section=69] -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see:
[jira] [Commented] (MAPREDUCE-4049) plugin for generic shuffle service
[ https://issues.apache.org/jira/browse/MAPREDUCE-4049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13504461#comment-13504461 ] Avner BenHanoch commented on MAPREDUCE-4049: Hi Laxman, Thanks for your comment and sorry for my late response. I just posted a link for downloading the source code of Mellanox plugin that implements generic shuffle using RDMA and levitated merge. You are warmly welcomed to contribute to push the algorithms of this plugin to the core of vanilla Hadoop, as well as to help accepting my straight forward patch in this JIRA issue. Avner plugin for generic shuffle service -- Key: MAPREDUCE-4049 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4049 Project: Hadoop Map/Reduce Issue Type: Sub-task Components: performance, task, tasktracker Affects Versions: 1.0.3, 1.1.0, 2.0.0-alpha, 3.0.0 Reporter: Avner BenHanoch Labels: merge, plugin, rdma, shuffle Fix For: trunk Attachments: HADOOP-1.x.y.patch, Hadoop Shuffle Plugin Design.rtf, mapreduce-4049.patch, mapreduce-4049.patch, mapreduce-4049.patch, mapreduce-4049.patch Support generic shuffle service as set of two plugins: ShuffleProvider ShuffleConsumer. This will satisfy the following needs: # Better shuffle and merge performance. For example: we are working on shuffle plugin that performs shuffle over RDMA in fast networks (10gE, 40gE, or Infiniband) instead of using the current HTTP shuffle. Based on the fast RDMA shuffle, the plugin can also utilize a suitable merge approach during the intermediate merges. Hence, getting much better performance. # Satisfy MAPREDUCE-3060 - generic shuffle service for avoiding hidden dependency of NodeManager with a specific version of mapreduce shuffle (currently targeted to 0.24.0). References: # Hadoop Acceleration through Network Levitated Merging, by Prof. Weikuan Yu from Auburn University with others, [http://pasl.eng.auburn.edu/pubs/sc11-netlev.pdf] # I am attaching 2 documents with suggested Top Level Design for both plugins (currently, based on 1.0 branch) # I am providing link for downloading UDA - Mellanox's open source plugin that implements generic shuffle service using RDMA and levitated merge. Note: At this phase, the code is in C++ through JNI and you should consider it as beta only. Still, it can serve anyone that wants to implement or contribute to levitated merge. (Please be advised that levitated merge is mostly suit in very fast networks) - [http://www.mellanox.com/content/pages.php?pg=products_dynproduct_family=144menu_section=69] -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (MAPREDUCE-4762) repair test org.apache.hadoop.mapreduce.security.token.TestDelegationTokenRenewal
[ https://issues.apache.org/jira/browse/MAPREDUCE-4762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ivan A. Veselovsky updated MAPREDUCE-4762: -- Attachment: MAPREDUCE-4762--b.patch MAPREDUCE-4762-branch-0.23--b.patch Hi, Robert, the attached patches MAPREDUCE-4762-branch-0.23--b.patch and MAPREDUCE-4762--b.patch implement your suggestion. Patch MAPREDUCE-4762--b.patch targeted to branches trunk and branch-2. repair test org.apache.hadoop.mapreduce.security.token.TestDelegationTokenRenewal - Key: MAPREDUCE-4762 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4762 Project: Hadoop Map/Reduce Issue Type: Improvement Reporter: Ivan A. Veselovsky Attachments: MAPREDUCE-4762--b.patch, MAPREDUCE-4762-branch-0.23--b.patch, MAPREDUCE-4762-trunk.patch The test org.apache.hadoop.mapreduce.security.token.TestDelegationTokenRenewal is @Ignor-ed. Due to that several classes in package org.apache.hadoop.mapreduce.security.token have zero unit-test coverage. The problem is that the test assumed that class org.apache.hadoop.mapreduce.security.token.TestDelegationTokenRenewal.Renewer is used as a custom implementation of the org.apache.hadoop.security.token.TokenRenewer service, but that did not happen, because this custom service implementation was not registered. We solved this problem by using special classloader that is invoked to find the resource META-INF/services/org.apache.hadoop.security.token.TokenRenewer , and supplies some custom content for it. This way the custom service implementation gets instantiated. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-4049) plugin for generic shuffle service
[ https://issues.apache.org/jira/browse/MAPREDUCE-4049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13504502#comment-13504502 ] Avner BenHanoch commented on MAPREDUCE-4049: _Alejandro,_ With all due respect, I think that something in your behavior is inappropriate: * You were never involved in this issue; still you gave yourself the liberty to make it a sub issue of your supported MAPREDUCE-2454 issue, without consulting anyone. * This is especially inappropriate since MAPREDUCE-2454 is disputable and has its acceptance problems regardless of my issue. Hence, its acceptance problems will affect my issue. * Your justification *As all this JIRAs are small, I think we'll be able to move fast with all of them.* is inappropriate since you actually created a linkage that will surely postpone my issue instead of leaving each issue to progress at its own pace! * It is not the first time that the persons behind MAPREDUCE-2454 try to disturb this JIRA issue. Apparently, I don't have the privileges to break this sub task linkage; hence, I am asking that you or someone else will do it. I am welcoming any comment coming from a professional place with the simple target of making Hadoop better. Having that said, I feel that the way you blitzed my patch with any possible patty comment, sometime with disputable claims, just before the patch is about to be accepted – is unfair, unprofessional and unfriendly. Especially considering your complete silence since this JIRA issue has started. I am not sure that commenting in a blitz way will increase the quality of hadoop. For example: {quote} Checking for shuffleConsumerPlugin != null before closing it seems redundant, you would have never got there if shufflePlugin is NULL. {quote} This is your mistake (I'll reach there in case isLocal == true). *There is no option to remove the nullity check!* {quote} Visibility annotations for the ShuffleConsumerPlugin, ShuffleContext, should be Unstable {quote} I think it is inappropriate to declare plugin interface as Unstable, since it must stay stable for 3rd party vendors. --- --- --- --- Personally, I have no problem to implement all the rest of your comments. It should be very easy for me. Still, I am raising few points for consideration regarding your following comments: {quote} The Shuffle class should be renamed to DefaultShuffle. The ShuffleConsumerPlugin should be renamed to Shuffle. {quote} I chose the term 'ShuffleConsumerPlugin' and not something like 'Shuffle', because it clarifies that we are in a *plugin* of *ShuffleConsumer*, rather than a *builtin* *ShuffleProvider/ShuffleHandler*. Also, I didn't take the liberty to rename core classes of Hadoop. {quote} ShuffleConsumerPlugin, getShuffleConsumerPlugin() method is not required, instead use the ReflectionUtils directly in the ReducerTask class. {quote} Here, I only followed existing convention of Hadoop as shown in ResourceCalculatorPlugin.getResourceCalculatorPlugin(). Personally, I'll be glad to follow your advice, and even to go one step further and make ShuffleConsumerPlugin an interface instead of AbstractClass. {quote} use 'mapreduce.job.reduce.shuffle.class' to be consistent with MAPREDUCE-2454. {quote} Here I chose 'mapreduce.shuffle…', since I think it is consistent with the current convention in hadoop-3 configuration. --- --- --- --- I can tell you that Arun Todd didn't make it easy for me with their requests from this patch so far. Still, I understand, respect and accept all their comments. I am sure that everyone involved only want the best for Hadoop. I suggest we hear Arun's consideration and move forward with the patch in the best professional way. _*Arun,*_ I think you are very familiar with both Hadoop/MapReduce and this JIRA issue since its inception. You are also well familiar and involved with MAPREDUCE-2454. It is also safe to say you know Alejandro and Asokan better than you know me. I believe everyone involved will agree that your sole interest is Hadoop's quality. *I am asking you and everyone else to help progressing here.* Avner plugin for generic shuffle service -- Key: MAPREDUCE-4049 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4049 Project: Hadoop Map/Reduce Issue Type: Sub-task Components: performance, task, tasktracker Affects Versions: 1.0.3, 1.1.0, 2.0.0-alpha, 3.0.0 Reporter: Avner BenHanoch Labels: merge, plugin, rdma, shuffle Fix For: trunk Attachments: HADOOP-1.x.y.patch, Hadoop Shuffle Plugin Design.rtf, mapreduce-4049.patch, mapreduce-4049.patch, mapreduce-4049.patch, mapreduce-4049.patch Support generic shuffle service as set
[jira] [Commented] (MAPREDUCE-4764) repair test org.apache.hadoop.mapreduce.security.TestBinaryTokenFile
[ https://issues.apache.org/jira/browse/MAPREDUCE-4764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13504521#comment-13504521 ] Hudson commented on MAPREDUCE-4764: --- Integrated in Hadoop-Yarn-trunk #49 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/49/]) MAPREDUCE-4764. repair TestBinaryTokenFile (Ivan A. Veselovsky via bobby) (Revision 1413739) Result = SUCCESS bobby : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1413739 Files : * /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt * /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient/src/test/java/org/apache/hadoop/mapreduce/security/TestBinaryTokenFile.java repair test org.apache.hadoop.mapreduce.security.TestBinaryTokenFile Key: MAPREDUCE-4764 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4764 Project: Hadoop Map/Reduce Issue Type: Improvement Reporter: Ivan A. Veselovsky Fix For: 3.0.0, 2.0.3-alpha, 0.23.6 Attachments: MAPREDUCE-4764.patch, MAPREDUCE-4764-trunk.patch the test is @Ignore-ed, and fails being enabled. Suggested to repair it to fill the coverage gap. Problems fixed in the test: (1) MRConfig.FRAMEWORK_NAME and YarnConfiguration.RM_PRINCIPAL properties must be correctly set in the configuration to correctly enable the security in the way this test implies. (2) The property MRJobConfig.MAPREDUCE_JOB_CREDENTIALS_BINARY now is not passed into the Job configuration -- it is intentionally deleted from there. So, we pass the binary file name in another dedicated property. (3) The test was using deprecated cluster classes. All them are updated to the modern analogs. (4) The delegation token found in the job context is now correctly compared to the one deserialized from the binary file. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-4762) repair test org.apache.hadoop.mapreduce.security.token.TestDelegationTokenRenewal
[ https://issues.apache.org/jira/browse/MAPREDUCE-4762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13504537#comment-13504537 ] Hadoop QA commented on MAPREDUCE-4762: -- {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12554983/MAPREDUCE-4762--b.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. The javadoc tool did not generate any warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-common-project/hadoop-common hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3071//testReport/ Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3071//console This message is automatically generated. repair test org.apache.hadoop.mapreduce.security.token.TestDelegationTokenRenewal - Key: MAPREDUCE-4762 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4762 Project: Hadoop Map/Reduce Issue Type: Improvement Reporter: Ivan A. Veselovsky Attachments: MAPREDUCE-4762--b.patch, MAPREDUCE-4762-branch-0.23--b.patch, MAPREDUCE-4762-trunk.patch The test org.apache.hadoop.mapreduce.security.token.TestDelegationTokenRenewal is @Ignor-ed. Due to that several classes in package org.apache.hadoop.mapreduce.security.token have zero unit-test coverage. The problem is that the test assumed that class org.apache.hadoop.mapreduce.security.token.TestDelegationTokenRenewal.Renewer is used as a custom implementation of the org.apache.hadoop.security.token.TokenRenewer service, but that did not happen, because this custom service implementation was not registered. We solved this problem by using special classloader that is invoked to find the resource META-INF/services/org.apache.hadoop.security.token.TokenRenewer , and supplies some custom content for it. This way the custom service implementation gets instantiated. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-4049) plugin for generic shuffle service
[ https://issues.apache.org/jira/browse/MAPREDUCE-4049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13504558#comment-13504558 ] Laxman commented on MAPREDUCE-4049: --- bq. You are warmly welcomed to contribute to push the algorithms of this plugin to the core of vanilla Hadoop Thank you Avner. I wish to see this as part of hadoop. I'm not able to build UDA you have provided as per BUILD.README provided in the downloaded bundle. SVN repository provided is not accessible/resolvable. https://sirius.voltaire.com/repos/enterprise/uda/trunk bq. as well as to help accepting my straight forward patch in this JIRA issue. I will personally request few of my friends (Hadoop contributors) to review this jira. plugin for generic shuffle service -- Key: MAPREDUCE-4049 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4049 Project: Hadoop Map/Reduce Issue Type: Sub-task Components: performance, task, tasktracker Affects Versions: 1.0.3, 1.1.0, 2.0.0-alpha, 3.0.0 Reporter: Avner BenHanoch Labels: merge, plugin, rdma, shuffle Fix For: trunk Attachments: HADOOP-1.x.y.patch, Hadoop Shuffle Plugin Design.rtf, mapreduce-4049.patch, mapreduce-4049.patch, mapreduce-4049.patch, mapreduce-4049.patch Support generic shuffle service as set of two plugins: ShuffleProvider ShuffleConsumer. This will satisfy the following needs: # Better shuffle and merge performance. For example: we are working on shuffle plugin that performs shuffle over RDMA in fast networks (10gE, 40gE, or Infiniband) instead of using the current HTTP shuffle. Based on the fast RDMA shuffle, the plugin can also utilize a suitable merge approach during the intermediate merges. Hence, getting much better performance. # Satisfy MAPREDUCE-3060 - generic shuffle service for avoiding hidden dependency of NodeManager with a specific version of mapreduce shuffle (currently targeted to 0.24.0). References: # Hadoop Acceleration through Network Levitated Merging, by Prof. Weikuan Yu from Auburn University with others, [http://pasl.eng.auburn.edu/pubs/sc11-netlev.pdf] # I am attaching 2 documents with suggested Top Level Design for both plugins (currently, based on 1.0 branch) # I am providing link for downloading UDA - Mellanox's open source plugin that implements generic shuffle service using RDMA and levitated merge. Note: At this phase, the code is in C++ through JNI and you should consider it as beta only. Still, it can serve anyone that wants to implement or contribute to levitated merge. (Please be advised that levitated merge is mostly suit in very fast networks) - [http://www.mellanox.com/content/pages.php?pg=products_dynproduct_family=144menu_section=69] -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-4049) plugin for generic shuffle service
[ https://issues.apache.org/jira/browse/MAPREDUCE-4049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13504560#comment-13504560 ] Laxman commented on MAPREDUCE-4049: --- I'm trying to build as per the README available here (http://mellanox.com/downloads/UDA/UDA3.0_Release.tar.gz). plugin for generic shuffle service -- Key: MAPREDUCE-4049 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4049 Project: Hadoop Map/Reduce Issue Type: Sub-task Components: performance, task, tasktracker Affects Versions: 1.0.3, 1.1.0, 2.0.0-alpha, 3.0.0 Reporter: Avner BenHanoch Labels: merge, plugin, rdma, shuffle Fix For: trunk Attachments: HADOOP-1.x.y.patch, Hadoop Shuffle Plugin Design.rtf, mapreduce-4049.patch, mapreduce-4049.patch, mapreduce-4049.patch, mapreduce-4049.patch Support generic shuffle service as set of two plugins: ShuffleProvider ShuffleConsumer. This will satisfy the following needs: # Better shuffle and merge performance. For example: we are working on shuffle plugin that performs shuffle over RDMA in fast networks (10gE, 40gE, or Infiniband) instead of using the current HTTP shuffle. Based on the fast RDMA shuffle, the plugin can also utilize a suitable merge approach during the intermediate merges. Hence, getting much better performance. # Satisfy MAPREDUCE-3060 - generic shuffle service for avoiding hidden dependency of NodeManager with a specific version of mapreduce shuffle (currently targeted to 0.24.0). References: # Hadoop Acceleration through Network Levitated Merging, by Prof. Weikuan Yu from Auburn University with others, [http://pasl.eng.auburn.edu/pubs/sc11-netlev.pdf] # I am attaching 2 documents with suggested Top Level Design for both plugins (currently, based on 1.0 branch) # I am providing link for downloading UDA - Mellanox's open source plugin that implements generic shuffle service using RDMA and levitated merge. Note: At this phase, the code is in C++ through JNI and you should consider it as beta only. Still, it can serve anyone that wants to implement or contribute to levitated merge. (Please be advised that levitated merge is mostly suit in very fast networks) - [http://www.mellanox.com/content/pages.php?pg=products_dynproduct_family=144menu_section=69] -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-4764) repair test org.apache.hadoop.mapreduce.security.TestBinaryTokenFile
[ https://issues.apache.org/jira/browse/MAPREDUCE-4764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13504581#comment-13504581 ] Hudson commented on MAPREDUCE-4764: --- Integrated in Hadoop-Hdfs-0.23-Build #448 (See [https://builds.apache.org/job/Hadoop-Hdfs-0.23-Build/448/]) svn merge -c 1413739 FIXES: MAPREDUCE-4764. repair TestBinaryTokenFile (Ivan A. Veselovsky via bobby) (Revision 1413742) Result = SUCCESS bobby : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1413742 Files : * /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/CHANGES.txt * /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient/src/test/java/org/apache/hadoop/mapreduce/security/TestBinaryTokenFile.java repair test org.apache.hadoop.mapreduce.security.TestBinaryTokenFile Key: MAPREDUCE-4764 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4764 Project: Hadoop Map/Reduce Issue Type: Improvement Reporter: Ivan A. Veselovsky Fix For: 3.0.0, 2.0.3-alpha, 0.23.6 Attachments: MAPREDUCE-4764.patch, MAPREDUCE-4764-trunk.patch the test is @Ignore-ed, and fails being enabled. Suggested to repair it to fill the coverage gap. Problems fixed in the test: (1) MRConfig.FRAMEWORK_NAME and YarnConfiguration.RM_PRINCIPAL properties must be correctly set in the configuration to correctly enable the security in the way this test implies. (2) The property MRJobConfig.MAPREDUCE_JOB_CREDENTIALS_BINARY now is not passed into the Job configuration -- it is intentionally deleted from there. So, we pass the binary file name in another dedicated property. (3) The test was using deprecated cluster classes. All them are updated to the modern analogs. (4) The delegation token found in the job context is now correctly compared to the one deserialized from the binary file. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-4764) repair test org.apache.hadoop.mapreduce.security.TestBinaryTokenFile
[ https://issues.apache.org/jira/browse/MAPREDUCE-4764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13504589#comment-13504589 ] Hudson commented on MAPREDUCE-4764: --- Integrated in Hadoop-Hdfs-trunk #1239 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/1239/]) MAPREDUCE-4764. repair TestBinaryTokenFile (Ivan A. Veselovsky via bobby) (Revision 1413739) Result = SUCCESS bobby : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1413739 Files : * /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt * /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient/src/test/java/org/apache/hadoop/mapreduce/security/TestBinaryTokenFile.java repair test org.apache.hadoop.mapreduce.security.TestBinaryTokenFile Key: MAPREDUCE-4764 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4764 Project: Hadoop Map/Reduce Issue Type: Improvement Reporter: Ivan A. Veselovsky Fix For: 3.0.0, 2.0.3-alpha, 0.23.6 Attachments: MAPREDUCE-4764.patch, MAPREDUCE-4764-trunk.patch the test is @Ignore-ed, and fails being enabled. Suggested to repair it to fill the coverage gap. Problems fixed in the test: (1) MRConfig.FRAMEWORK_NAME and YarnConfiguration.RM_PRINCIPAL properties must be correctly set in the configuration to correctly enable the security in the way this test implies. (2) The property MRJobConfig.MAPREDUCE_JOB_CREDENTIALS_BINARY now is not passed into the Job configuration -- it is intentionally deleted from there. So, we pass the binary file name in another dedicated property. (3) The test was using deprecated cluster classes. All them are updated to the modern analogs. (4) The delegation token found in the job context is now correctly compared to the one deserialized from the binary file. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-4764) repair test org.apache.hadoop.mapreduce.security.TestBinaryTokenFile
[ https://issues.apache.org/jira/browse/MAPREDUCE-4764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13504606#comment-13504606 ] Hudson commented on MAPREDUCE-4764: --- Integrated in Hadoop-Mapreduce-trunk #1270 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1270/]) MAPREDUCE-4764. repair TestBinaryTokenFile (Ivan A. Veselovsky via bobby) (Revision 1413739) Result = FAILURE bobby : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1413739 Files : * /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt * /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient/src/test/java/org/apache/hadoop/mapreduce/security/TestBinaryTokenFile.java repair test org.apache.hadoop.mapreduce.security.TestBinaryTokenFile Key: MAPREDUCE-4764 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4764 Project: Hadoop Map/Reduce Issue Type: Improvement Reporter: Ivan A. Veselovsky Fix For: 3.0.0, 2.0.3-alpha, 0.23.6 Attachments: MAPREDUCE-4764.patch, MAPREDUCE-4764-trunk.patch the test is @Ignore-ed, and fails being enabled. Suggested to repair it to fill the coverage gap. Problems fixed in the test: (1) MRConfig.FRAMEWORK_NAME and YarnConfiguration.RM_PRINCIPAL properties must be correctly set in the configuration to correctly enable the security in the way this test implies. (2) The property MRJobConfig.MAPREDUCE_JOB_CREDENTIALS_BINARY now is not passed into the Job configuration -- it is intentionally deleted from there. So, we pass the binary file name in another dedicated property. (3) The test was using deprecated cluster classes. All them are updated to the modern analogs. (4) The delegation token found in the job context is now correctly compared to the one deserialized from the binary file. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-4049) plugin for generic shuffle service
[ https://issues.apache.org/jira/browse/MAPREDUCE-4049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13504624#comment-13504624 ] Avner BenHanoch commented on MAPREDUCE-4049: Hi Laxman, You are referring to an internal document (there is no external document yet :)). The svn is only for downloading internally clean sources for releasing new version. However, you already got the sources and you don't need it. In fast, I think you should use: # src/premake.sh # build/makerpm.sh Also, in fast, Please expect compilation dependency: * In the C++, on librdmacm-devel * In the java, you'll need to copy the hadoop jars, that are used by the plugin, into the plugin's directory (see them according to CLASSPATH in the makefile at the plugin's directory) Before you go with the java side, you may choose to edit makerpm.sh and comment out hadoop flavors that you don't care about. Please be aware that you are the 1st one that tries to build the sources outside Mellanox. Also, I am not sure this is the place and way to get support for Mellanox products. plugin for generic shuffle service -- Key: MAPREDUCE-4049 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4049 Project: Hadoop Map/Reduce Issue Type: Sub-task Components: performance, task, tasktracker Affects Versions: 1.0.3, 1.1.0, 2.0.0-alpha, 3.0.0 Reporter: Avner BenHanoch Labels: merge, plugin, rdma, shuffle Fix For: trunk Attachments: HADOOP-1.x.y.patch, Hadoop Shuffle Plugin Design.rtf, mapreduce-4049.patch, mapreduce-4049.patch, mapreduce-4049.patch, mapreduce-4049.patch Support generic shuffle service as set of two plugins: ShuffleProvider ShuffleConsumer. This will satisfy the following needs: # Better shuffle and merge performance. For example: we are working on shuffle plugin that performs shuffle over RDMA in fast networks (10gE, 40gE, or Infiniband) instead of using the current HTTP shuffle. Based on the fast RDMA shuffle, the plugin can also utilize a suitable merge approach during the intermediate merges. Hence, getting much better performance. # Satisfy MAPREDUCE-3060 - generic shuffle service for avoiding hidden dependency of NodeManager with a specific version of mapreduce shuffle (currently targeted to 0.24.0). References: # Hadoop Acceleration through Network Levitated Merging, by Prof. Weikuan Yu from Auburn University with others, [http://pasl.eng.auburn.edu/pubs/sc11-netlev.pdf] # I am attaching 2 documents with suggested Top Level Design for both plugins (currently, based on 1.0 branch) # I am providing link for downloading UDA - Mellanox's open source plugin that implements generic shuffle service using RDMA and levitated merge. Note: At this phase, the code is in C++ through JNI and you should consider it as beta only. Still, it can serve anyone that wants to implement or contribute to levitated merge. (Please be advised that levitated merge is mostly suit in very fast networks) - [http://www.mellanox.com/content/pages.php?pg=products_dynproduct_family=144menu_section=69] -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-4049) plugin for generic shuffle service
[ https://issues.apache.org/jira/browse/MAPREDUCE-4049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13504633#comment-13504633 ] Alejandro Abdelnur commented on MAPREDUCE-4049: --- Hi Avner, I respectfully disagree with your opinion that my behavior is inappropriate. First of all, it is not my intention to slow you this JIRA down, but to make sure it is consistent with the related work in MAPREDUCE-2454 (you can see that in my comments). If that requires a couple of extra days, is is a small price to pay. As an Apache Hadoop developer is my responsibility to review and provide feedback on work posted by other developers, my usual triggers are area of knowledge, related work and area of interest. This JIRA is tightly related to MAPREDUCE-2454, there is not dispute on that. Thus it should stay as a subtask of it. MAPREDUCE-2454 is not disputable, as it has been commented in it JIRA, it is almost ready, it was matter of breaking it up and doing an fast interactive review of its parts. As far as I can tell, this is already happening there. Now going to your comments on my review: * Yes the *shuffleConsumerPlugin != null*, you are right, I've noticed that after I've posted my comments, so you can disregard that done. * On the marking the ShuffleConsumerPlugin, ShuffleContext as *unstable*, it is not appropriate, Hadoop wants to keep the right of modifying these APIs in the future, if hte need arises. You can also see this, no only in MAPREDUCE-2454, but in several places where Hadoop provides pluggability (ie ResourceManagement, authentication). * On making the ShuffleConsumerPlugin and interface, that is a good idea, it will align things with the other sub-tasks. Looking forward to see the updated patch. Cheers. plugin for generic shuffle service -- Key: MAPREDUCE-4049 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4049 Project: Hadoop Map/Reduce Issue Type: Sub-task Components: performance, task, tasktracker Affects Versions: 1.0.3, 1.1.0, 2.0.0-alpha, 3.0.0 Reporter: Avner BenHanoch Labels: merge, plugin, rdma, shuffle Fix For: trunk Attachments: HADOOP-1.x.y.patch, Hadoop Shuffle Plugin Design.rtf, mapreduce-4049.patch, mapreduce-4049.patch, mapreduce-4049.patch, mapreduce-4049.patch Support generic shuffle service as set of two plugins: ShuffleProvider ShuffleConsumer. This will satisfy the following needs: # Better shuffle and merge performance. For example: we are working on shuffle plugin that performs shuffle over RDMA in fast networks (10gE, 40gE, or Infiniband) instead of using the current HTTP shuffle. Based on the fast RDMA shuffle, the plugin can also utilize a suitable merge approach during the intermediate merges. Hence, getting much better performance. # Satisfy MAPREDUCE-3060 - generic shuffle service for avoiding hidden dependency of NodeManager with a specific version of mapreduce shuffle (currently targeted to 0.24.0). References: # Hadoop Acceleration through Network Levitated Merging, by Prof. Weikuan Yu from Auburn University with others, [http://pasl.eng.auburn.edu/pubs/sc11-netlev.pdf] # I am attaching 2 documents with suggested Top Level Design for both plugins (currently, based on 1.0 branch) # I am providing link for downloading UDA - Mellanox's open source plugin that implements generic shuffle service using RDMA and levitated merge. Note: At this phase, the code is in C++ through JNI and you should consider it as beta only. Still, it can serve anyone that wants to implement or contribute to levitated merge. (Please be advised that levitated merge is mostly suit in very fast networks) - [http://www.mellanox.com/content/pages.php?pg=products_dynproduct_family=144menu_section=69] -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-4813) AM timing out during job commit
[ https://issues.apache.org/jira/browse/MAPREDUCE-4813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13504635#comment-13504635 ] Jason Lowe commented on MAPREDUCE-4813: --- MAPREDUCE-4815 only addresses FileOutputCommitter and friends, but the committer is arbitrary user code. It could be doing all sorts of things including connecting to databases, etc. So I still think we need this, although the priority of it is reduced given how many things are built from FileOutputCommitter. AM timing out during job commit --- Key: MAPREDUCE-4813 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4813 Project: Hadoop Map/Reduce Issue Type: Bug Components: applicationmaster Affects Versions: 0.23.3, 2.0.1-alpha Reporter: Jason Lowe Assignee: Jason Lowe Priority: Critical Attachments: MAPREDUCE-4813.patch The AM calls the output committer's {{commitJob}} method synchronously during JobImpl state transitions, which means the JobImpl write lock is held the entire time the job is being committed. Holding the write lock prevents the RM allocator thread from heartbeating to the RM. Therefore if committing the job takes too long (e.g.: the job has tons of files to commit and/or the namenode is bogged down) then the AM appears to be unresponsive to the RM and the RM kills the AM attempt. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-4819) AM can rerun job after reporting final job status to the client
[ https://issues.apache.org/jira/browse/MAPREDUCE-4819?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13504646#comment-13504646 ] Jason Lowe commented on MAPREDUCE-4819: --- bq. Maybe final client notification should be the last thing after all post processing is done. No, moving the client notification later just creates a different set of problems, like the client never being notified *at all* because the AM crashes after unregistering with the RM but before it notifies the client. The RM won't restart the app because it unregistered successfully, but the client is never notified. bq. In general it seems like we need to come up with a set of markers that previous AM's leave behind that can tell the next retry if the previous one failed/succeeded and so the current AM should exit or continue to run. Exactly, and the AM is already doing this in the job history file which is how it helps supports recovery. We should extend this so that even if the output committer doesn't support recovery the AM will check for markers in the job history file and prevent the job from executing tasks and committing output if final job status has been determined by previous attempts. That way we prevent the AM from re-committing job output or changing the final job status after notifying the client. We just need to make sure those markers are flushed to persistent store and located properly by future AM attempts before attempting to notify the client. If subsequent attempts see the final job status marker then they should skip straight to the client notification process instead of running tasks. AM can rerun job after reporting final job status to the client --- Key: MAPREDUCE-4819 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4819 Project: Hadoop Map/Reduce Issue Type: Bug Components: mr-am Affects Versions: 0.23.3, 2.0.1-alpha Reporter: Jason Lowe Assignee: Bikas Saha Priority: Critical If the AM reports final job status to the client but then crashes before unregistering with the RM then the RM can run another AM attempt. Currently AM re-attempts assume that the previous attempts did not reach a final job state, and that causes the job to rerun (from scratch, if the output format doesn't support recovery). Re-running the job when we've already told the client the final status of the job is bad for a number of reasons. If the job failed, it's confusing at best since the client was already told the job failed but the subsequent attempt could succeed. If the job succeeded there could be data loss, as a subsequent job launched by the client tries to consume the job's output as input just as the re-attempt starts removing output files in preparation for the output commit. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (MAPREDUCE-4822) Unnessisary conversions in History Events
Robert Joseph Evans created MAPREDUCE-4822: -- Summary: Unnessisary conversions in History Events Key: MAPREDUCE-4822 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4822 Project: Hadoop Map/Reduce Issue Type: Improvement Components: jobhistoryserver Affects Versions: 0.23.4 Reporter: Robert Joseph Evans Priority: Trivial There are a number of conversions in the Job History Event classes that are totally unnecessary. It appears that they were originally used to convert from the internal avro format, but now many of them do not pull the values from the avro they store them internally. For example: {code:title=TaskAttemptFinishedEvent.java} /** Get the task type */ public TaskType getTaskType() { return TaskType.valueOf(taskType.toString()); } {code} The code currently is taking an enum, converting it to a string and then asking the same enum to convert it back to an enum. If java work properly this should be a noop and a reference to the original taskType should be returned. There are several places that a string is having toString called on it, and since strings are immutable it returns a reference to itself. The various ids are not immutable and probably should not be changed at this point. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (MAPREDUCE-4823) NPE in jobhistory.jsp
Steve Loughran created MAPREDUCE-4823: - Summary: NPE in jobhistory.jsp Key: MAPREDUCE-4823 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4823 Project: Hadoop Map/Reduce Issue Type: Bug Components: jobhistoryserver Affects Versions: 1.0.3 Environment: Running on a JT which had a bit of confusion w.r.t its hostname (two IP addresses in /etc/hosts for the same hostname) Reporter: Steve Loughran Priority: Minor asking for the job history page resulted in a stack trace instead of (an empty) job history -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (MAPREDUCE-4824) Provide a mechanism for jobs to indicate they should not be recovered on restart
Tom White created MAPREDUCE-4824: Summary: Provide a mechanism for jobs to indicate they should not be recovered on restart Key: MAPREDUCE-4824 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4824 Project: Hadoop Map/Reduce Issue Type: New Feature Components: mrv1 Affects Versions: 1.1.0 Reporter: Tom White Assignee: Tom White Some jobs (like Sqoop or HBase jobs) are not idempotent, so should not be recovered on jobtracker restart. MAPREDUCE-2702 solves this problem for MR2, however the approach there is not applicable for MR1, since even if we only use the job-level part of the patch and add a isRecoverySupported method to OutputCommitter, there is no way to use that information from the JT (which initiates recovery), since the JT does not instantiate OutputCommitters - and it shouldn't since they are user-level code. (In MR2 it's OK since the MR AM calls the method.) Instead, we can add a MR configuration property to say that a job is not recoverable, and the JT could safely read this from the job conf. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-4823) NPE in jobhistory.jsp
[ https://issues.apache.org/jira/browse/MAPREDUCE-4823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13504699#comment-13504699 ] Steve Loughran commented on MAPREDUCE-4823: --- stack trace -which bears no relation to where in the JSP page the actual NPE was triggered. The generated java pages would show it. {code} java.lang.NullPointerException at org.apache.hadoop.mapred.jobhistoryhome_jsp._jspService(jobhistoryhome_jsp.java:151) at org.apache.jasper.runtime.HttpJspBase.service(HttpJspBase.java:97) at javax.servlet.http.HttpServlet.service(HttpServlet.java:820) at org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:511) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1221) at org.apache.hadoop.http.HttpServer$QuotingInputFilter.doFilter(HttpServer.java:814) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399) at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216) at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182) at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766) at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450) at org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230) at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152) at org.mortbay.jetty.Server.handle(Server.java:326) at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542) at org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:928) at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:549) at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:212) at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404) at org.mortbay.io.nio.SelectChannelEndPoint.run(SelectChannelEndPoint.java:410) at org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582) {code} NPE in jobhistory.jsp - Key: MAPREDUCE-4823 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4823 Project: Hadoop Map/Reduce Issue Type: Bug Components: jobhistoryserver Affects Versions: 1.0.3 Environment: Running on a JT which had a bit of confusion w.r.t its hostname (two IP addresses in /etc/hosts for the same hostname) Reporter: Steve Loughran Priority: Minor asking for the job history page resulted in a stack trace instead of (an empty) job history -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (MAPREDUCE-4824) Provide a mechanism for jobs to indicate they should not be recovered on restart
[ https://issues.apache.org/jira/browse/MAPREDUCE-4824?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tom White updated MAPREDUCE-4824: - Attachment: MAPREDUCE-4824.patch Here's a patch that implements this idea. Jobs that shouldn't be recovered should set mapred.job.restart.recover to false. Provide a mechanism for jobs to indicate they should not be recovered on restart Key: MAPREDUCE-4824 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4824 Project: Hadoop Map/Reduce Issue Type: New Feature Components: mrv1 Affects Versions: 1.1.0 Reporter: Tom White Assignee: Tom White Attachments: MAPREDUCE-4824.patch Some jobs (like Sqoop or HBase jobs) are not idempotent, so should not be recovered on jobtracker restart. MAPREDUCE-2702 solves this problem for MR2, however the approach there is not applicable for MR1, since even if we only use the job-level part of the patch and add a isRecoverySupported method to OutputCommitter, there is no way to use that information from the JT (which initiates recovery), since the JT does not instantiate OutputCommitters - and it shouldn't since they are user-level code. (In MR2 it's OK since the MR AM calls the method.) Instead, we can add a MR configuration property to say that a job is not recoverable, and the JT could safely read this from the job conf. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-4819) AM can rerun job after reporting final job status to the client
[ https://issues.apache.org/jira/browse/MAPREDUCE-4819?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13504758#comment-13504758 ] Koji Noguchi commented on MAPREDUCE-4819: - bq. like the client never being notified at all because the AM crashes after unregistering with the RM but before it notifies the client. As long as client eventually fail, that's not a problem. Critical problem we have here is false-positive from the client's perspective. Client is getting 'success' but output is incomplete or corrupt(due to retried application/job (over)writing to the same target path.) If we can have AM and RM to agree on the job status before telling the client, I think that would work. There could be a corner case when AM and RM say the job was successful but client thinks it failed. This false-negative is much better than false-positive issue we have now. Even in 0.20, we had cases when JobTracker reports job was successful but client thinks it failed due to communication failure to the JobTracker. This is fine to happen and we should let the client handle the recovery-or-retry. bq. In general it seems like we need to come up with a set of markers that previous AM's leave behind I don't want the correctness of the job to depend on the marker on hdfs. AM can rerun job after reporting final job status to the client --- Key: MAPREDUCE-4819 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4819 Project: Hadoop Map/Reduce Issue Type: Bug Components: mr-am Affects Versions: 0.23.3, 2.0.1-alpha Reporter: Jason Lowe Assignee: Bikas Saha Priority: Critical If the AM reports final job status to the client but then crashes before unregistering with the RM then the RM can run another AM attempt. Currently AM re-attempts assume that the previous attempts did not reach a final job state, and that causes the job to rerun (from scratch, if the output format doesn't support recovery). Re-running the job when we've already told the client the final status of the job is bad for a number of reasons. If the job failed, it's confusing at best since the client was already told the job failed but the subsequent attempt could succeed. If the job succeeded there could be data loss, as a subsequent job launched by the client tries to consume the job's output as input just as the re-attempt starts removing output files in preparation for the output commit. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-4819) AM can rerun job after reporting final job status to the client
[ https://issues.apache.org/jira/browse/MAPREDUCE-4819?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13504767#comment-13504767 ] Koji Noguchi commented on MAPREDUCE-4819: - bq. I don't want the correctness of the job to depend on the marker on hdfs. I meant, hdfs on user space like outputpath. If this is stored elsewhere where user cannot access, I have no problem. AM can rerun job after reporting final job status to the client --- Key: MAPREDUCE-4819 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4819 Project: Hadoop Map/Reduce Issue Type: Bug Components: mr-am Affects Versions: 0.23.3, 2.0.1-alpha Reporter: Jason Lowe Assignee: Bikas Saha Priority: Critical If the AM reports final job status to the client but then crashes before unregistering with the RM then the RM can run another AM attempt. Currently AM re-attempts assume that the previous attempts did not reach a final job state, and that causes the job to rerun (from scratch, if the output format doesn't support recovery). Re-running the job when we've already told the client the final status of the job is bad for a number of reasons. If the job failed, it's confusing at best since the client was already told the job failed but the subsequent attempt could succeed. If the job succeeded there could be data loss, as a subsequent job launched by the client tries to consume the job's output as input just as the re-attempt starts removing output files in preparation for the output commit. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-4817) Hardcoded task ping timeout kills tasks localizing large amounts of data
[ https://issues.apache.org/jira/browse/MAPREDUCE-4817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13504777#comment-13504777 ] Thomas Graves commented on MAPREDUCE-4817: -- When you say knock off the ping thread I assume you really mean just the ping timeout check since the task progress happens in the same thread? So the ping serves multiple purposes. Currently it notifies the AM that the task has pinged in and is still running. This could be useful even with taskTimeout since the taskTimeout could be turned off (set to 0) and we would never know if that task got hung. Second, the task uses it to check to see if the AM is still alive. If it doesn't return true, the task is supposed to exit. 1.X also had the ping check, but it went to the taskTracker and the tasktracker validated that the parent Task of the ping checker thread was still there. Now with 0.23 the nodemanager is watching the processes and talking back to the RM to let it know that the AM died and if it died it kills the other tasks, but if the entire nodemanager goes down then the task doesn't know the AM went away. If the task isn't sending progress, and the task timeout is set to 0, and this is the last AM retry it could hang around forever. The odds of that seem pretty small and I guess if we aren't worried about the first happening, the second probably isn't that interesting either. But we could also just remove the ping timeout check in the TaskHeartBeatHandler. What exactly are you proposing? Hardcoded task ping timeout kills tasks localizing large amounts of data Key: MAPREDUCE-4817 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4817 Project: Hadoop Map/Reduce Issue Type: Bug Components: applicationmaster, mr-am Affects Versions: 0.23.3, 2.0.3-alpha Reporter: Jason Lowe Assignee: Thomas Graves Priority: Critical When a task is launched and spends more than 5 minutes localizing files, the AM will kill the task due to ping timeout. The AM's TaskHeartbeatHandler currently tracks tasks via a progress timeout and a ping timeout. The progress timeout can be controlled via mapreduce.task.timeout and even disabled by setting the property to 0. The ping timeout, however, is hardcoded to 5 minutes and cannot be configured. Therefore if the task takes too long localizing, it never gets running in order to ping back to the AM and the AM kills it due to ping timeout. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-4821) Unit Test: TestJobTrackerRestart fails when it is run with ant-1.8.4
[ https://issues.apache.org/jira/browse/MAPREDUCE-4821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13504776#comment-13504776 ] Steve Loughran commented on MAPREDUCE-4821: --- is there a JUnit 3 jar in your Ant classpath? There has to be a junit4 one else the test case won't compile -I suspect your ant installation has a junit jar that's being picked up first at test run time. {{ant -diagnostics}} will show this. If it's there, delete it and see what happens when the original test is rerun. Unit Test: TestJobTrackerRestart fails when it is run with ant-1.8.4 Key: MAPREDUCE-4821 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4821 Project: Hadoop Map/Reduce Issue Type: Bug Components: test Affects Versions: 1.0.3, 1.0.4 Environment: RHEL 6.3 on x86 Reporter: Amir Sanjar Fix For: 1.0.3, 1.1.1 Attachments: MAPREDUCE-4821-branch1.patch, MAPREDUCE-4821-release-1.0.3.patch Problem: JUnit tag @Ignore is not recognized since the testcase is JUnit3 and not JUnit4: Solution: Migrate the testcase to JUnit4, including: * Remove extends TestCase * Remove import junit.framework.TestCase; * Add import org.junit.*; * Use appropriate annotations such as @After, @Before, @Test. uploading a patch shortly -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-4819) AM can rerun job after reporting final job status to the client
[ https://issues.apache.org/jira/browse/MAPREDUCE-4819?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13504784#comment-13504784 ] Robert Joseph Evans commented on MAPREDUCE-4819: We are informing several different actors of success/failure in many different ways. # _SUCCESS file being written to HDFS by the output committer as part of commitJob() # job end notification by hitting an http server # client being informed through RPC # history server being informed by placing the log in a directory it can see # resource manager being informed that the application is done Some of these are much more important to report then others, but either way we still have at a minimum two different things that need to be tied together the commitJob and informing the RM not to run us again. Rearranging the order of them will not fix the fact that after commitJob() finishes there is the possibility that something will fail but must not fail the job. We really need to have a two phase commit in the job history file. I am about to commit the job output. commitJob() I finished committing the job output successfully. Without this there will always be the possibility that commitJob will be called twice, which would result in changes to the output directory. I would argue too that some of these are important enough that we consider reporting them twice and updating the listener to handle double reporting. Like informing the history server about the job finishing. Others it is not so critical, like job end notification or client RPC. Koji, I get that we want to reduce the risk of a user shooting themselves in the foot, but the file must be stored in a user accessible location because the entire job is run as the user. It is stored under the .staging directory which if the user deletes will cause many other problems already and probably cause the job to fail. We can try to set it up so that if the previous job history file does not exist on any app attempt but the first one we fail fast. That would prevent retries from messing up the output directory. AM can rerun job after reporting final job status to the client --- Key: MAPREDUCE-4819 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4819 Project: Hadoop Map/Reduce Issue Type: Bug Components: mr-am Affects Versions: 0.23.3, 2.0.1-alpha Reporter: Jason Lowe Assignee: Bikas Saha Priority: Critical If the AM reports final job status to the client but then crashes before unregistering with the RM then the RM can run another AM attempt. Currently AM re-attempts assume that the previous attempts did not reach a final job state, and that causes the job to rerun (from scratch, if the output format doesn't support recovery). Re-running the job when we've already told the client the final status of the job is bad for a number of reasons. If the job failed, it's confusing at best since the client was already told the job failed but the subsequent attempt could succeed. If the job succeeded there could be data loss, as a subsequent job launched by the client tries to consume the job's output as input just as the re-attempt starts removing output files in preparation for the output commit. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (MAPREDUCE-4817) Hardcoded task ping timeout kills tasks localizing large amounts of data
[ https://issues.apache.org/jira/browse/MAPREDUCE-4817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves updated MAPREDUCE-4817: - Attachment: MAPREDUCE-4817.patch here is the patch that add the config for the ping timeout. Attaching because it was finished already before other comments and in case we want to go that way. Hardcoded task ping timeout kills tasks localizing large amounts of data Key: MAPREDUCE-4817 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4817 Project: Hadoop Map/Reduce Issue Type: Bug Components: applicationmaster, mr-am Affects Versions: 0.23.3, 2.0.3-alpha Reporter: Jason Lowe Assignee: Thomas Graves Priority: Critical Attachments: MAPREDUCE-4817.patch When a task is launched and spends more than 5 minutes localizing files, the AM will kill the task due to ping timeout. The AM's TaskHeartbeatHandler currently tracks tasks via a progress timeout and a ping timeout. The progress timeout can be controlled via mapreduce.task.timeout and even disabled by setting the property to 0. The ping timeout, however, is hardcoded to 5 minutes and cannot be configured. Therefore if the task takes too long localizing, it never gets running in order to ping back to the AM and the AM kills it due to ping timeout. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (MAPREDUCE-4825) JobImpl.finished doesn't expect ERROR as a final job state
Jason Lowe created MAPREDUCE-4825: - Summary: JobImpl.finished doesn't expect ERROR as a final job state Key: MAPREDUCE-4825 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4825 Project: Hadoop Map/Reduce Issue Type: Bug Components: mr-am Reporter: Jason Lowe TestMRApp.testJobError is causing AsyncDispatcher to exit with System.exit due to an exception being thrown. From the console output from testJobError: {noformat} 2012-11-27 18:46:15,240 ERROR [AsyncDispatcher event handler] impl.TaskImpl (TaskImpl.java:internalError(665)) - Invalid event T_SCHEDULE on Task task_0__m_00 2012-11-27 18:46:15,242 FATAL [AsyncDispatcher event handler] event.AsyncDispatcher (AsyncDispatcher.java:dispatch(132)) - Error in dispatcher thread java.lang.IllegalArgumentException: Illegal job state: ERROR at org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl.finished(JobImpl.java:838) at org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl$InternalErrorTransition.transition(JobImpl.java:1622) at org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl$InternalErrorTransition.transition(JobImpl.java:1) at org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:359) at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:299) at org.apache.hadoop.yarn.state.StateMachineFactory.access$3(StateMachineFactory.java:287) at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:445) at org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl.handle(JobImpl.java:723) at org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl.handle(JobImpl.java:1) at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$JobEventDispatcher.handle(MRAppMaster.java:974) at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$JobEventDispatcher.handle(MRAppMaster.java:1) at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:128) at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:77) at java.lang.Thread.run(Thread.java:662) 2012-11-27 18:46:15,242 INFO [AsyncDispatcher event handler] event.AsyncDispatcher (AsyncDispatcher.java:dispatch(135)) - Exiting, bbye.. {noformat} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (MAPREDUCE-4825) JobImpl.finished doesn't expect ERROR as a final job state
[ https://issues.apache.org/jira/browse/MAPREDUCE-4825?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Lowe updated MAPREDUCE-4825: -- Attachment: MAPREDUCE-4825.patch Simple fix. No additional unit tests since this is fixing an existing test. JobImpl.finished doesn't expect ERROR as a final job state -- Key: MAPREDUCE-4825 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4825 Project: Hadoop Map/Reduce Issue Type: Bug Components: mr-am Reporter: Jason Lowe Attachments: MAPREDUCE-4825.patch TestMRApp.testJobError is causing AsyncDispatcher to exit with System.exit due to an exception being thrown. From the console output from testJobError: {noformat} 2012-11-27 18:46:15,240 ERROR [AsyncDispatcher event handler] impl.TaskImpl (TaskImpl.java:internalError(665)) - Invalid event T_SCHEDULE on Task task_0__m_00 2012-11-27 18:46:15,242 FATAL [AsyncDispatcher event handler] event.AsyncDispatcher (AsyncDispatcher.java:dispatch(132)) - Error in dispatcher thread java.lang.IllegalArgumentException: Illegal job state: ERROR at org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl.finished(JobImpl.java:838) at org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl$InternalErrorTransition.transition(JobImpl.java:1622) at org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl$InternalErrorTransition.transition(JobImpl.java:1) at org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:359) at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:299) at org.apache.hadoop.yarn.state.StateMachineFactory.access$3(StateMachineFactory.java:287) at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:445) at org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl.handle(JobImpl.java:723) at org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl.handle(JobImpl.java:1) at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$JobEventDispatcher.handle(MRAppMaster.java:974) at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$JobEventDispatcher.handle(MRAppMaster.java:1) at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:128) at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:77) at java.lang.Thread.run(Thread.java:662) 2012-11-27 18:46:15,242 INFO [AsyncDispatcher event handler] event.AsyncDispatcher (AsyncDispatcher.java:dispatch(135)) - Exiting, bbye.. {noformat} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (MAPREDUCE-4825) JobImpl.finished doesn't expect ERROR as a final job state
[ https://issues.apache.org/jira/browse/MAPREDUCE-4825?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Lowe updated MAPREDUCE-4825: -- Assignee: Jason Lowe Target Version/s: 2.0.3-alpha, 0.23.6 Affects Version/s: 0.23.5 2.0.3-alpha Status: Patch Available (was: Open) JobImpl.finished doesn't expect ERROR as a final job state -- Key: MAPREDUCE-4825 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4825 Project: Hadoop Map/Reduce Issue Type: Bug Components: mr-am Affects Versions: 2.0.3-alpha, 0.23.5 Reporter: Jason Lowe Assignee: Jason Lowe Attachments: MAPREDUCE-4825.patch TestMRApp.testJobError is causing AsyncDispatcher to exit with System.exit due to an exception being thrown. From the console output from testJobError: {noformat} 2012-11-27 18:46:15,240 ERROR [AsyncDispatcher event handler] impl.TaskImpl (TaskImpl.java:internalError(665)) - Invalid event T_SCHEDULE on Task task_0__m_00 2012-11-27 18:46:15,242 FATAL [AsyncDispatcher event handler] event.AsyncDispatcher (AsyncDispatcher.java:dispatch(132)) - Error in dispatcher thread java.lang.IllegalArgumentException: Illegal job state: ERROR at org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl.finished(JobImpl.java:838) at org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl$InternalErrorTransition.transition(JobImpl.java:1622) at org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl$InternalErrorTransition.transition(JobImpl.java:1) at org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:359) at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:299) at org.apache.hadoop.yarn.state.StateMachineFactory.access$3(StateMachineFactory.java:287) at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:445) at org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl.handle(JobImpl.java:723) at org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl.handle(JobImpl.java:1) at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$JobEventDispatcher.handle(MRAppMaster.java:974) at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$JobEventDispatcher.handle(MRAppMaster.java:1) at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:128) at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:77) at java.lang.Thread.run(Thread.java:662) 2012-11-27 18:46:15,242 INFO [AsyncDispatcher event handler] event.AsyncDispatcher (AsyncDispatcher.java:dispatch(135)) - Exiting, bbye.. {noformat} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-4825) JobImpl.finished doesn't expect ERROR as a final job state
[ https://issues.apache.org/jira/browse/MAPREDUCE-4825?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13504861#comment-13504861 ] Hadoop QA commented on MAPREDUCE-4825: -- {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12555049/MAPREDUCE-4825.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. The javadoc tool did not generate any warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3072//testReport/ Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3072//console This message is automatically generated. JobImpl.finished doesn't expect ERROR as a final job state -- Key: MAPREDUCE-4825 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4825 Project: Hadoop Map/Reduce Issue Type: Bug Components: mr-am Affects Versions: 2.0.3-alpha, 0.23.5 Reporter: Jason Lowe Assignee: Jason Lowe Attachments: MAPREDUCE-4825.patch TestMRApp.testJobError is causing AsyncDispatcher to exit with System.exit due to an exception being thrown. From the console output from testJobError: {noformat} 2012-11-27 18:46:15,240 ERROR [AsyncDispatcher event handler] impl.TaskImpl (TaskImpl.java:internalError(665)) - Invalid event T_SCHEDULE on Task task_0__m_00 2012-11-27 18:46:15,242 FATAL [AsyncDispatcher event handler] event.AsyncDispatcher (AsyncDispatcher.java:dispatch(132)) - Error in dispatcher thread java.lang.IllegalArgumentException: Illegal job state: ERROR at org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl.finished(JobImpl.java:838) at org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl$InternalErrorTransition.transition(JobImpl.java:1622) at org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl$InternalErrorTransition.transition(JobImpl.java:1) at org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:359) at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:299) at org.apache.hadoop.yarn.state.StateMachineFactory.access$3(StateMachineFactory.java:287) at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:445) at org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl.handle(JobImpl.java:723) at org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl.handle(JobImpl.java:1) at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$JobEventDispatcher.handle(MRAppMaster.java:974) at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$JobEventDispatcher.handle(MRAppMaster.java:1) at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:128) at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:77) at java.lang.Thread.run(Thread.java:662) 2012-11-27 18:46:15,242 INFO [AsyncDispatcher event handler] event.AsyncDispatcher (AsyncDispatcher.java:dispatch(135)) - Exiting, bbye.. {noformat} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-4824) Provide a mechanism for jobs to indicate they should not be recovered on restart
[ https://issues.apache.org/jira/browse/MAPREDUCE-4824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13504897#comment-13504897 ] Harsh J commented on MAPREDUCE-4824: Hi, - The message below in the exception can be improved I feel. I think its better to say Job ID was not recovered since it disabled recovery-upon-restart (mapred.job.restart.recover set to false).. Also, since this case is to be expected (non-default override), I think it ought to be a simple INFO log, but I understand we need to throw an Exception to halt the loading of the JIP. {code} + if (recovered !conf.getBoolean(mapred.job.restart.recover, true)) { +throw new IOException(Job + jobId + should not be recovered + +since mapred.job.restart.recover is set to false.); + } {code} - We could also add this property to mapred-default.xml and document it that way. The test changes look good. Provide a mechanism for jobs to indicate they should not be recovered on restart Key: MAPREDUCE-4824 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4824 Project: Hadoop Map/Reduce Issue Type: New Feature Components: mrv1 Affects Versions: 1.1.0 Reporter: Tom White Assignee: Tom White Attachments: MAPREDUCE-4824.patch Some jobs (like Sqoop or HBase jobs) are not idempotent, so should not be recovered on jobtracker restart. MAPREDUCE-2702 solves this problem for MR2, however the approach there is not applicable for MR1, since even if we only use the job-level part of the patch and add a isRecoverySupported method to OutputCommitter, there is no way to use that information from the JT (which initiates recovery), since the JT does not instantiate OutputCommitters - and it shouldn't since they are user-level code. (In MR2 it's OK since the MR AM calls the method.) Instead, we can add a MR configuration property to say that a job is not recoverable, and the JT could safely read this from the job conf. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-4815) FileOutputCommitter.commitJob can be very slow for jobs with many output files
[ https://issues.apache.org/jira/browse/MAPREDUCE-4815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13505012#comment-13505012 ] Jason Lowe commented on MAPREDUCE-4815: --- I think this will work well with a couple of caveats: * Write permissions to the parent directory of the output directory is a new implicit requirement over the original FileOutputFormat. I think in the vast majority of cases it won't be a problem, but it is a potential backwards-compatibility issue. * There are existing output formats that override checkOutputSpecs() and explicitly remove the verification step that outputDir doesn't exist (e.g.: TeraOutputFormat). If we only support this new scheme, those output formats could fail to commit since the rename in commitJob() will fail for a non-empty destination directory. I think we should add this as an optimized path to FileOutputFormat, but keep the original, iterative rename scheme if the output directory isn't empty for backwards compatibility. FileOutputCommitter.commitJob can be very slow for jobs with many output files -- Key: MAPREDUCE-4815 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4815 Project: Hadoop Map/Reduce Issue Type: Bug Components: mrv2 Affects Versions: 0.23.3, 2.0.1-alpha Reporter: Jason Lowe Assignee: Bikas Saha If a job generates many files to commit then the commitJob method call at the end of the job can take minutes. This is a performance regression from 1.x, as 1.x had the tasks commit directly to the final output directory as they were completing and commitJob had very little to do. The commit work was processed in parallel and overlapped the processing of outstanding tasks. In 0.23/2.x, the commit is single-threaded and waits until all tasks have completed before commencing. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (MAPREDUCE-4820) MRApps distributed-cache duplicate checks are incorrect
[ https://issues.apache.org/jira/browse/MAPREDUCE-4820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aaron T. Myers updated MAPREDUCE-4820: -- Target Version/s: 2.0.3-alpha Fix Version/s: (was: 2.0.3-alpha) MRApps distributed-cache duplicate checks are incorrect --- Key: MAPREDUCE-4820 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4820 Project: Hadoop Map/Reduce Issue Type: Bug Components: mr-am Affects Versions: 2.0.2-alpha Reporter: Alejandro Abdelnur Priority: Blocker This seems a combination of issues that are being exposed in 2.0.2-alpha by MAPREDUCE-4549. MAPREDUCE-4549 introduces a check to to ensure there are not duplicate JARs in the distributed-cache (using the JAR name as identity). In Hadoop 2 (different from Hadoop 1), all JARs in the distributed-cache are symlink-ed to the current directory of the task. MRApps, when setting up the DistributedCache (MRApps#setupDistributedCache-parseDistributedCacheArtifacts) assumes that the local resources (this includes files in the CURRENT_DIR/, CURRENT_DIR/classes/ and files in CURRENT_DIR/lib/) are part of the distributed-cache already. For systems, like Oozie, which use a launcher job to submit the real job this poses a problem because MRApps is run from the launcher job to submit the real job. The configuration of the real job has the correct distributed-cache entries (no duplicates), but because the current dir has the same files, the submission fails. It seems that MRApps should not be checking dups in the distributed-cached against JARs in the CURRENT_DIR/ or CURRENT_DIR/lib/. The dup check should be done among distributed-cached entries only. It seems YARNRunner is symlink-ing all files in the distributed cached in the current directory. In Hadoop 1 this was done only for files added to the distributed-cache using a fragment (ie #FOO) to trigger a symlink creation. Marking as a blocker because without a fix for this, Oozie cannot submit jobs to Hadoop 2 (i've debugged Oozie in a live cluster being used by BigTop -thanks Roman- to test their release work, and I've verified that Oozie 3.3 does not create duplicated entries in the distributed-cache) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (MAPREDUCE-4822) Unnecessary conversions in History Events
[ https://issues.apache.org/jira/browse/MAPREDUCE-4822?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aaron T. Myers updated MAPREDUCE-4822: -- Summary: Unnecessary conversions in History Events (was: Unnessisary conversions in History Events) Unnecessary conversions in History Events - Key: MAPREDUCE-4822 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4822 Project: Hadoop Map/Reduce Issue Type: Improvement Components: jobhistoryserver Affects Versions: 0.23.4 Reporter: Robert Joseph Evans Priority: Trivial There are a number of conversions in the Job History Event classes that are totally unnecessary. It appears that they were originally used to convert from the internal avro format, but now many of them do not pull the values from the avro they store them internally. For example: {code:title=TaskAttemptFinishedEvent.java} /** Get the task type */ public TaskType getTaskType() { return TaskType.valueOf(taskType.toString()); } {code} The code currently is taking an enum, converting it to a string and then asking the same enum to convert it back to an enum. If java work properly this should be a noop and a reference to the original taskType should be returned. There are several places that a string is having toString called on it, and since strings are immutable it returns a reference to itself. The various ids are not immutable and probably should not be changed at this point. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-4824) Provide a mechanism for jobs to indicate they should not be recovered on restart
[ https://issues.apache.org/jira/browse/MAPREDUCE-4824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13505234#comment-13505234 ] Bikas Saha commented on MAPREDUCE-4824: --- Agree with Harsh. I assume this config is job specific and cannot be inadvertently set to disable recovery of all jobs? Provide a mechanism for jobs to indicate they should not be recovered on restart Key: MAPREDUCE-4824 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4824 Project: Hadoop Map/Reduce Issue Type: New Feature Components: mrv1 Affects Versions: 1.1.0 Reporter: Tom White Assignee: Tom White Attachments: MAPREDUCE-4824.patch Some jobs (like Sqoop or HBase jobs) are not idempotent, so should not be recovered on jobtracker restart. MAPREDUCE-2702 solves this problem for MR2, however the approach there is not applicable for MR1, since even if we only use the job-level part of the patch and add a isRecoverySupported method to OutputCommitter, there is no way to use that information from the JT (which initiates recovery), since the JT does not instantiate OutputCommitters - and it shouldn't since they are user-level code. (In MR2 it's OK since the MR AM calls the method.) Instead, we can add a MR configuration property to say that a job is not recoverable, and the JT could safely read this from the job conf. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-4815) FileOutputCommitter.commitJob can be very slow for jobs with many output files
[ https://issues.apache.org/jira/browse/MAPREDUCE-4815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13505244#comment-13505244 ] Bikas Saha commented on MAPREDUCE-4815: --- Does this code user FileSystem or specifically DistributedFileSystem (HDFS)? If the former, then how does this relate to the comment [~eric14] made earlier about cloud stores? FileOutputCommitter.commitJob can be very slow for jobs with many output files -- Key: MAPREDUCE-4815 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4815 Project: Hadoop Map/Reduce Issue Type: Bug Components: mrv2 Affects Versions: 0.23.3, 2.0.1-alpha Reporter: Jason Lowe Assignee: Bikas Saha If a job generates many files to commit then the commitJob method call at the end of the job can take minutes. This is a performance regression from 1.x, as 1.x had the tasks commit directly to the final output directory as they were completing and commitJob had very little to do. The commit work was processed in parallel and overlapped the processing of outstanding tasks. In 0.23/2.x, the commit is single-threaded and waits until all tasks have completed before commencing. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-4049) plugin for generic shuffle service
[ https://issues.apache.org/jira/browse/MAPREDUCE-4049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13505252#comment-13505252 ] Arun C Murthy commented on MAPREDUCE-4049: -- Sorry, just caught up on this since I'm dealing with some health issues at home. Frankly, worrying about whose work is a subset of whose is a pointless exercise. Having said that, making related tasks sub-tasks makes sense as long as there is a coherent community (one or more developers) working together makes sense, I don't see it for MAPREDUCE-4049 vis-a-vis MAPREDUCE-2454. IAC, there is no need to debate this further - it's just a time sink. Finally, MAPREDUCE-2454 is a bunch of large-scale changes. I'm happy to commit this as long as it's ready to, without tying it in. Overall, I really don't like to see us egregiously rename core MR classes - at best it's pointless for private apis, and at worst it hammers svn log. So, pls do not change existing Shuffle etc. Avner, please upload a patch with other changes: # Use @LimitedPrivate, that way it makes it clear that this is for implementers and not end-users. # I'm ok with suggested config names (again, I'm not religious about naming). With that it's good to go. plugin for generic shuffle service -- Key: MAPREDUCE-4049 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4049 Project: Hadoop Map/Reduce Issue Type: Sub-task Components: performance, task, tasktracker Affects Versions: 1.0.3, 1.1.0, 2.0.0-alpha, 3.0.0 Reporter: Avner BenHanoch Labels: merge, plugin, rdma, shuffle Fix For: trunk Attachments: HADOOP-1.x.y.patch, Hadoop Shuffle Plugin Design.rtf, mapreduce-4049.patch, mapreduce-4049.patch, mapreduce-4049.patch, mapreduce-4049.patch Support generic shuffle service as set of two plugins: ShuffleProvider ShuffleConsumer. This will satisfy the following needs: # Better shuffle and merge performance. For example: we are working on shuffle plugin that performs shuffle over RDMA in fast networks (10gE, 40gE, or Infiniband) instead of using the current HTTP shuffle. Based on the fast RDMA shuffle, the plugin can also utilize a suitable merge approach during the intermediate merges. Hence, getting much better performance. # Satisfy MAPREDUCE-3060 - generic shuffle service for avoiding hidden dependency of NodeManager with a specific version of mapreduce shuffle (currently targeted to 0.24.0). References: # Hadoop Acceleration through Network Levitated Merging, by Prof. Weikuan Yu from Auburn University with others, [http://pasl.eng.auburn.edu/pubs/sc11-netlev.pdf] # I am attaching 2 documents with suggested Top Level Design for both plugins (currently, based on 1.0 branch) # I am providing link for downloading UDA - Mellanox's open source plugin that implements generic shuffle service using RDMA and levitated merge. Note: At this phase, the code is in C++ through JNI and you should consider it as beta only. Still, it can serve anyone that wants to implement or contribute to levitated merge. (Please be advised that levitated merge is mostly suit in very fast networks) - [http://www.mellanox.com/content/pages.php?pg=products_dynproduct_family=144menu_section=69] -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Assigned] (MAPREDUCE-4815) FileOutputCommitter.commitJob can be very slow for jobs with many output files
[ https://issues.apache.org/jira/browse/MAPREDUCE-4815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Arun C Murthy reassigned MAPREDUCE-4815: Assignee: Arun C Murthy (was: Bikas Saha) FileOutputCommitter.commitJob can be very slow for jobs with many output files -- Key: MAPREDUCE-4815 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4815 Project: Hadoop Map/Reduce Issue Type: Bug Components: mrv2 Affects Versions: 0.23.3, 2.0.1-alpha Reporter: Jason Lowe Assignee: Arun C Murthy If a job generates many files to commit then the commitJob method call at the end of the job can take minutes. This is a performance regression from 1.x, as 1.x had the tasks commit directly to the final output directory as they were completing and commitJob had very little to do. The commit work was processed in parallel and overlapped the processing of outstanding tasks. In 0.23/2.x, the commit is single-threaded and waits until all tasks have completed before commencing. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-4815) FileOutputCommitter.commitJob can be very slow for jobs with many output files
[ https://issues.apache.org/jira/browse/MAPREDUCE-4815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13505260#comment-13505260 ] Arun C Murthy commented on MAPREDUCE-4815: -- bq. Write permissions to the parent directory of the output directory is a new implicit requirement over the original FileOutputFormat. I think in the vast majority of cases it won't be a problem, but it is a potential backwards-compatibility issue. Currently that is already required since FileOutputFormat creates the output dir in the parent dir itself, so that isn't a new requirement. bq. I think we should add this as an optimized path to FileOutputFormat, but keep the original, iterative rename scheme if the output directory isn't empty for backwards compatibility. Makes sense. It's unfortunately much more code to maintain, and I'm not sure it's worth it, but a good idea nevertheless. I have a preliminary patch which I'm testing, I'll upload it asap. FileOutputCommitter.commitJob can be very slow for jobs with many output files -- Key: MAPREDUCE-4815 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4815 Project: Hadoop Map/Reduce Issue Type: Bug Components: mrv2 Affects Versions: 0.23.3, 2.0.1-alpha Reporter: Jason Lowe Assignee: Arun C Murthy If a job generates many files to commit then the commitJob method call at the end of the job can take minutes. This is a performance regression from 1.x, as 1.x had the tasks commit directly to the final output directory as they were completing and commitJob had very little to do. The commit work was processed in parallel and overlapped the processing of outstanding tasks. In 0.23/2.x, the commit is single-threaded and waits until all tasks have completed before commencing. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (MAPREDUCE-4661) Add HTTPS for WebUIs on Branch-1
[ https://issues.apache.org/jira/browse/MAPREDUCE-4661?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Plamen Jeliazkov updated MAPREDUCE-4661: Attachment: (was: https.patch) Add HTTPS for WebUIs on Branch-1 Key: MAPREDUCE-4661 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4661 Project: Hadoop Map/Reduce Issue Type: Improvement Components: security, webapps Affects Versions: 1.0.3 Reporter: Plamen Jeliazkov Assignee: Plamen Jeliazkov Attachments: MAPREDUCE-4461.patch, MAPREDUCE-4661.patch, MAPREDUCE-4661.patch, MAPREDUCE-4661.patch After investigating the methodology used to add HTTPS support in branch-2, I feel that this same approach should be back-ported to branch-1. I have taken many of the patches used for branch-2 and merged them in. I was working on top of HDP 1 at the time - I will provide a patch for trunk soon once I can confirm I am adding only the necessities for supporting HTTPS on the webUIs. As an added benefit -- this patch actually provides HTTPS webUI to HBase by extension. If you take a hadoop-core jar compiled with this patch and put it into the hbase/lib directory and apply the necessary configs to hbase/conf. = OLD IDEA(s) BEHIND ADDING HTTPS (look @ Sept 17th patch) == In order to provide full security around the cluster, the webUI should also be secure if desired to prevent cookie theft and user masquerading. Here is my proposed work. Currently I can only add HTTPS support. I do not know how to switch reliance of the HttpServer from HTTP to HTTPS fully. In order to facilitate this change I propose the following configuration additions: CONFIG PROPERTY - DEFAULT VALUE mapred.https.enable - false mapred.https.need.client.auth - false mapred.https.server.keystore.resource - ssl-server.xml mapred.job.tracker.https.port - 50035 mapred.job.tracker.https.address - IP_ADDR:50035 mapred.task.tracker.https.port - 50065 mapred.task.tracker.https.address - IP_ADDR:50065 I tested this on my local box after using keytool to generate a SSL certficate. You will need to change ssl-server.xml to point to the .keystore file after. Truststore may not be necessary; you can just point it to the keystore. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira