[jira] [Commented] (MAPREDUCE-5746) Job diagnostics can implicate wrong task for a failed job
[ https://issues.apache.org/jira/browse/MAPREDUCE-5746?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13900223#comment-13900223 ] Hudson commented on MAPREDUCE-5746: --- SUCCESS: Integrated in Hadoop-Yarn-trunk #480 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/480/]) MAPREDUCE-5746. Job diagnostics can implicate wrong task for a failed job. (Jason Lowe via kasha) (kasha: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1567666) * /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt * /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/jobhistory/JobHistoryParser.java * /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-hs/src/test/java/org/apache/hadoop/mapreduce/v2/hs/TestJobHistoryParsing.java Job diagnostics can implicate wrong task for a failed job - Key: MAPREDUCE-5746 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5746 Project: Hadoop Map/Reduce Issue Type: Bug Components: jobhistoryserver Affects Versions: 0.23.10, 2.1.1-beta Reporter: Jason Lowe Assignee: Jason Lowe Fix For: 0.23.11, 2.4.0 Attachments: MAPREDUCE-5746-v2.branch-0.23.patch, MAPREDUCE-5746-v2.patch, MAPREDUCE-5746.patch We've seen a number of cases where the history server is showing the wrong task as the reason a job failed. For example, Task task_1383802699973_515536_m_027135 failed 1 times when some other task had failed 4 times and was the real reason the job failed. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (MAPREDUCE-5746) Job diagnostics can implicate wrong task for a failed job
[ https://issues.apache.org/jira/browse/MAPREDUCE-5746?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13900317#comment-13900317 ] Hudson commented on MAPREDUCE-5746: --- FAILURE: Integrated in Hadoop-Hdfs-trunk #1672 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/1672/]) MAPREDUCE-5746. Job diagnostics can implicate wrong task for a failed job. (Jason Lowe via kasha) (kasha: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1567666) * /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt * /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/jobhistory/JobHistoryParser.java * /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-hs/src/test/java/org/apache/hadoop/mapreduce/v2/hs/TestJobHistoryParsing.java Job diagnostics can implicate wrong task for a failed job - Key: MAPREDUCE-5746 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5746 Project: Hadoop Map/Reduce Issue Type: Bug Components: jobhistoryserver Affects Versions: 0.23.10, 2.1.1-beta Reporter: Jason Lowe Assignee: Jason Lowe Fix For: 0.23.11, 2.4.0 Attachments: MAPREDUCE-5746-v2.branch-0.23.patch, MAPREDUCE-5746-v2.patch, MAPREDUCE-5746.patch We've seen a number of cases where the history server is showing the wrong task as the reason a job failed. For example, Task task_1383802699973_515536_m_027135 failed 1 times when some other task had failed 4 times and was the real reason the job failed. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (MAPREDUCE-5746) Job diagnostics can implicate wrong task for a failed job
[ https://issues.apache.org/jira/browse/MAPREDUCE-5746?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13900388#comment-13900388 ] Hudson commented on MAPREDUCE-5746: --- SUCCESS: Integrated in Hadoop-Mapreduce-trunk #1697 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1697/]) MAPREDUCE-5746. Job diagnostics can implicate wrong task for a failed job. (Jason Lowe via kasha) (kasha: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1567666) * /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt * /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/jobhistory/JobHistoryParser.java * /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-hs/src/test/java/org/apache/hadoop/mapreduce/v2/hs/TestJobHistoryParsing.java Job diagnostics can implicate wrong task for a failed job - Key: MAPREDUCE-5746 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5746 Project: Hadoop Map/Reduce Issue Type: Bug Components: jobhistoryserver Affects Versions: 0.23.10, 2.1.1-beta Reporter: Jason Lowe Assignee: Jason Lowe Fix For: 0.23.11, 2.4.0 Attachments: MAPREDUCE-5746-v2.branch-0.23.patch, MAPREDUCE-5746-v2.patch, MAPREDUCE-5746.patch We've seen a number of cases where the history server is showing the wrong task as the reason a job failed. For example, Task task_1383802699973_515536_m_027135 failed 1 times when some other task had failed 4 times and was the real reason the job failed. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (MAPREDUCE-5757) ConcurrentModificationException in JobControl.toList
[ https://issues.apache.org/jira/browse/MAPREDUCE-5757?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13900704#comment-13900704 ] Jason Lowe commented on MAPREDUCE-5757: --- Stacktrace: {noformat} Caused by: java.util.ConcurrentModificationException at java.util.LinkedList$ListItr.checkForComodification(LinkedList.java:953) at java.util.LinkedList$ListItr.next(LinkedList.java:886) at org.apache.hadoop.mapreduce.lib.jobcontrol.JobControl.toList(JobControl.java:82) at org.apache.hadoop.mapreduce.lib.jobcontrol.JobControl.getSuccessfulJobList(JobControl.java:123) at org.apache.hadoop.mapred.jobcontrol.JobControl.getSuccessfulJobs(JobControl.java:75) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.Launcher.calculateProgress(Launcher.java:252) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher.launchPig(MapReduceLauncher.java:319) at org.apache.pig.PigServer.launchPlan(PigServer.java:1283) ... 26 more {noformat} ConcurrentModificationException in JobControl.toList Key: MAPREDUCE-5757 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5757 Project: Hadoop Map/Reduce Issue Type: Bug Components: client Affects Versions: 0.23.10, 2.2.0 Reporter: Jason Lowe Despite having the fix for MAPREDUCE-5513 we saw another ConcurrencyModificationException in JobControl, so something there still isn't fixed. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Created] (MAPREDUCE-5757) ConcurrentModificationException in JobControl.toList
Jason Lowe created MAPREDUCE-5757: - Summary: ConcurrentModificationException in JobControl.toList Key: MAPREDUCE-5757 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5757 Project: Hadoop Map/Reduce Issue Type: Bug Components: client Affects Versions: 2.2.0, 0.23.10 Reporter: Jason Lowe Despite having the fix for MAPREDUCE-5513 we saw another ConcurrencyModificationException in JobControl, so something there still isn't fixed. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Assigned] (MAPREDUCE-5757) ConcurrentModificationException in JobControl.toList
[ https://issues.apache.org/jira/browse/MAPREDUCE-5757?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Lowe reassigned MAPREDUCE-5757: - Assignee: Jason Lowe The locking in the fix for MAPREDUCE-5513 is mismatched. The toList method is static and therefore locking the class, while the other methods are locking the object. ConcurrentModificationException in JobControl.toList Key: MAPREDUCE-5757 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5757 Project: Hadoop Map/Reduce Issue Type: Bug Components: client Affects Versions: 0.23.10, 2.2.0 Reporter: Jason Lowe Assignee: Jason Lowe Despite having the fix for MAPREDUCE-5513 we saw another ConcurrencyModificationException in JobControl, so something there still isn't fixed. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (MAPREDUCE-5513) ConcurrentModificationException in JobControl
[ https://issues.apache.org/jira/browse/MAPREDUCE-5513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13900721#comment-13900721 ] Jason Lowe commented on MAPREDUCE-5513: --- Note that this is still occurring, see MAPREDUCE-5757. ConcurrentModificationException in JobControl - Key: MAPREDUCE-5513 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5513 Project: Hadoop Map/Reduce Issue Type: Bug Affects Versions: 2.1.0-beta, 0.23.9 Reporter: Jason Lowe Assignee: Robert Parker Fix For: 3.0.0, 0.23.10, 2.2.0 Attachments: MAPREDUCE-5513-1.patch JobControl.toList is locking individual lists to iterate them, but those lists can be modified elsewhere without holding the list lock. The locking approaches are mismatched, with toList holding the lock on the actual list object while other methods hold the JobControl lock when modifying the lists. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (MAPREDUCE-5757) ConcurrentModificationException in JobControl.toList
[ https://issues.apache.org/jira/browse/MAPREDUCE-5757?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Lowe updated MAPREDUCE-5757: -- Attachment: MAPREDUCE-5757.patch Patch to always lock the object rather than the class. Don't know of an easy way to unit test this. ConcurrentModificationException in JobControl.toList Key: MAPREDUCE-5757 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5757 Project: Hadoop Map/Reduce Issue Type: Bug Components: client Affects Versions: 0.23.10, 2.2.0 Reporter: Jason Lowe Assignee: Jason Lowe Attachments: MAPREDUCE-5757.patch Despite having the fix for MAPREDUCE-5513 we saw another ConcurrencyModificationException in JobControl, so something there still isn't fixed. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (MAPREDUCE-5757) ConcurrentModificationException in JobControl.toList
[ https://issues.apache.org/jira/browse/MAPREDUCE-5757?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Lowe updated MAPREDUCE-5757: -- Target Version/s: 0.23.11, 2.4.0 Status: Patch Available (was: Open) ConcurrentModificationException in JobControl.toList Key: MAPREDUCE-5757 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5757 Project: Hadoop Map/Reduce Issue Type: Bug Components: client Affects Versions: 2.2.0, 0.23.10 Reporter: Jason Lowe Assignee: Jason Lowe Attachments: MAPREDUCE-5757.patch Despite having the fix for MAPREDUCE-5513 we saw another ConcurrencyModificationException in JobControl, so something there still isn't fixed. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (MAPREDUCE-5751) MR app master fails to start in some cases if mapreduce.job.classloader is true
[ https://issues.apache.org/jira/browse/MAPREDUCE-5751?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13900752#comment-13900752 ] Gera Shegalov commented on MAPREDUCE-5751: -- I think you can easily add a test case to TestMRAppMaster MR app master fails to start in some cases if mapreduce.job.classloader is true --- Key: MAPREDUCE-5751 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5751 Project: Hadoop Map/Reduce Issue Type: Bug Affects Versions: 2.2.0 Reporter: Sangjin Lee Assignee: Sangjin Lee Attachments: mapreduce-5751.patch If mapreduce.job.classloader is set to true, and the MR client includes a jetty jar in its libjars or job jar, the MR app master fails to start. A typical stack trace we get is as follows: {noformat} java.lang.ClassCastException: org.mortbay.jetty.webapp.WebInfConfiguration cannot be cast to org.mortbay.jetty.webapp.Configuration at org.mortbay.jetty.webapp.WebAppContext.loadConfigurations(WebAppContext.java:890) at org.mortbay.jetty.webapp.WebAppContext.doStart(WebAppContext.java:462) at org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50) at org.mortbay.jetty.handler.HandlerCollection.doStart(HandlerCollection.java:152) at org.mortbay.jetty.handler.ContextHandlerCollection.doStart(ContextHandlerCollection.java:156) at org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50) at org.mortbay.jetty.handler.HandlerWrapper.doStart(HandlerWrapper.java:130) at org.mortbay.jetty.Server.doStart(Server.java:224) at org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50) at org.apache.hadoop.http.HttpServer.start(HttpServer.java:676) at org.apache.hadoop.yarn.webapp.WebApps$Builder.start(WebApps.java:208) at org.apache.hadoop.mapreduce.v2.app.client.MRClientService.start(MRClientService.java:151) at org.apache.hadoop.yarn.service.CompositeService.start(CompositeService.java:68) at org.apache.hadoop.mapreduce.v2.app.MRAppMaster.start(MRAppMaster.java:1040) at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$1.run(MRAppMaster.java:1307) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1478) at org.apache.hadoop.mapreduce.v2.app.MRAppMaster.initAndStartAppMaster(MRAppMaster.java:1303) at org.apache.hadoop.mapreduce.v2.app.MRAppMaster.main(MRAppMaster.java:1259) {noformat} This happens because as part of the MR app master start the jetty classes are loaded normally through the app classloader, but WebAppContext tries to load the specific Configuration class via the thread context classloader (which had been set to the user job classloader). -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (MAPREDUCE-5756) FileInputFormat.listStatus() including directories in its results
[ https://issues.apache.org/jira/browse/MAPREDUCE-5756?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13900772#comment-13900772 ] Jason Dere commented on MAPREDUCE-5756: --- Ok, looking a little more at this .. so FileInputFormat.listStatus() is returning the same results on hadoop-1 and hadoop-2, and it includes the directories, so I guess listStatus() is not the issue. It looks like what CombineFileInputFormat.getSplits() does with the file list after getting it is different between hadoop-1 and hadoop-2, where hadoop-2 includes those directories in the list of InputSplits: (Hadoop 20S means hadoop 1.x) {noformat} 2014-02-13 13:35:32,492 ERROR shims.HadoopShimsSecure (HadoopShimsSecure.java:getSplits(345)) - ** Hadoop version: 0.20S 2014-02-13 13:35:32,492 ERROR shims.HadoopShimsSecure (HadoopShimsSecure.java:getSplits(349)) - ** called super.getSplits(): [Paths:/00_0:0+50 Locations:127.0.0.1:; ] {noformat} (Hadoop 23 means hadoop 2.x) {noformat} 2014-02-13 13:38:12,425 ERROR shims.HadoopShimsSecure (HadoopShimsSecure.java:getSplits(345)) - ** Hadoop version: 0.23 2014-02-13 13:38:12,425 ERROR shims.HadoopShimsSecure (HadoopShimsSecure.java:getSplits(349)) - ** called super.getSplits(): [Paths:/00_0:0+50 Locations:127.0.0.1:; , Paths:/Users:0+0,/build:0+0,/tmp:0+0,/user:0+0 Locations:; ] {noformat} FileInputFormat.listStatus() including directories in its results - Key: MAPREDUCE-5756 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5756 Project: Hadoop Map/Reduce Issue Type: Bug Reporter: Jason Dere Trying to track down HIVE-6401, where we see some is not a file errors because getSplits() is giving us directories. I believe the culprit is FileInputFormat.listStatus(): {code} if (recursive stat.isDirectory()) { addInputPathRecursively(result, fs, stat.getPath(), inputFilter); } else { result.add(stat); } {code} Which seems to be allowing directories to be added to the results if recursive is false. Is this meant to return directories? If not, I think it should look like this: {code} if (stat.isDirectory()) { if (recursive) { addInputPathRecursively(result, fs, stat.getPath(), inputFilter); } } else { result.add(stat); } {code} -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (MAPREDUCE-5756) FileInputFormat.listStatus() including directories in its results
[ https://issues.apache.org/jira/browse/MAPREDUCE-5756?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13900776#comment-13900776 ] Jason Dere commented on MAPREDUCE-5756: --- Looks like the changes in MAPREDUCE-4470 may be causing the difference in the 1.x vs 2.x behavior. Should CombineFileInputFormat be filtering out any locations which turn out to be directories here? FileInputFormat.listStatus() including directories in its results - Key: MAPREDUCE-5756 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5756 Project: Hadoop Map/Reduce Issue Type: Bug Reporter: Jason Dere Trying to track down HIVE-6401, where we see some is not a file errors because getSplits() is giving us directories. I believe the culprit is FileInputFormat.listStatus(): {code} if (recursive stat.isDirectory()) { addInputPathRecursively(result, fs, stat.getPath(), inputFilter); } else { result.add(stat); } {code} Which seems to be allowing directories to be added to the results if recursive is false. Is this meant to return directories? If not, I think it should look like this: {code} if (stat.isDirectory()) { if (recursive) { addInputPathRecursively(result, fs, stat.getPath(), inputFilter); } } else { result.add(stat); } {code} -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (MAPREDUCE-5756) CombineFileInputFormat.getSplits() including directories in its results
[ https://issues.apache.org/jira/browse/MAPREDUCE-5756?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Dere updated MAPREDUCE-5756: -- Summary: CombineFileInputFormat.getSplits() including directories in its results (was: FileInputFormat.listStatus() including directories in its results) CombineFileInputFormat.getSplits() including directories in its results --- Key: MAPREDUCE-5756 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5756 Project: Hadoop Map/Reduce Issue Type: Bug Reporter: Jason Dere Trying to track down HIVE-6401, where we see some is not a file errors because getSplits() is giving us directories. I believe the culprit is FileInputFormat.listStatus(): {code} if (recursive stat.isDirectory()) { addInputPathRecursively(result, fs, stat.getPath(), inputFilter); } else { result.add(stat); } {code} Which seems to be allowing directories to be added to the results if recursive is false. Is this meant to return directories? If not, I think it should look like this: {code} if (stat.isDirectory()) { if (recursive) { addInputPathRecursively(result, fs, stat.getPath(), inputFilter); } } else { result.add(stat); } {code} -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (MAPREDUCE-5757) ConcurrentModificationException in JobControl.toList
[ https://issues.apache.org/jira/browse/MAPREDUCE-5757?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13900788#comment-13900788 ] Hadoop QA commented on MAPREDUCE-5757: -- {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12628849/MAPREDUCE-5757.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/4356//testReport/ Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/4356//console This message is automatically generated. ConcurrentModificationException in JobControl.toList Key: MAPREDUCE-5757 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5757 Project: Hadoop Map/Reduce Issue Type: Bug Components: client Affects Versions: 0.23.10, 2.2.0 Reporter: Jason Lowe Assignee: Jason Lowe Attachments: MAPREDUCE-5757.patch Despite having the fix for MAPREDUCE-5513 we saw another ConcurrencyModificationException in JobControl, so something there still isn't fixed. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Created] (MAPREDUCE-5758) Reducer local data is not deleted until job completes
Jason Lowe created MAPREDUCE-5758: - Summary: Reducer local data is not deleted until job completes Key: MAPREDUCE-5758 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5758 Project: Hadoop Map/Reduce Issue Type: Bug Components: mrv2 Affects Versions: 2.2.0, 0.23.10 Reporter: Jason Lowe Ran into an instance where a reducer shuffled a large amount of data and subsequently failed, but the local data is not purged when the task fails but only after the entire job completes. This wastes disk space unnecessarily since the data is no longer relevant after the task-attempt exits. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (MAPREDUCE-5758) Reducer local data is not deleted until job completes
[ https://issues.apache.org/jira/browse/MAPREDUCE-5758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13900827#comment-13900827 ] Jason Lowe commented on MAPREDUCE-5758: --- When tasks run under YARN they are handed app-level local directories to write into and those are in turn passed down to the tasks to write local data. Since the local output locations are not under the container directory, YARN does not clean them up when the container exits. They are only reaped when the app-level directory is deleted which occurs after the application completes. Tasks should use the container-specific local directory for temporary local outputs rather than the app-specific directory, so if they crash YARN can automatically clean them promptly. Note that map outputs would have to be committed to the same app-level local location they are today in order to survive the container exiting and the ShuffleHandler to find them later. However they could be accumulated before commit in a container-specific directory so if the map attempt fails the data is reaped promptly rather than only when the job completes. This would also help minimize chances of inter-task file collisions such that occurred in MAPREDUCE-5211. Reducer local data is not deleted until job completes - Key: MAPREDUCE-5758 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5758 Project: Hadoop Map/Reduce Issue Type: Bug Components: mrv2 Affects Versions: 0.23.10, 2.2.0 Reporter: Jason Lowe Ran into an instance where a reducer shuffled a large amount of data and subsequently failed, but the local data is not purged when the task fails but only after the entire job completes. This wastes disk space unnecessarily since the data is no longer relevant after the task-attempt exits. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (MAPREDUCE-5757) ConcurrentModificationException in JobControl.toList
[ https://issues.apache.org/jira/browse/MAPREDUCE-5757?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13900842#comment-13900842 ] Kihwal Lee commented on MAPREDUCE-5757: --- +1 lgtm ConcurrentModificationException in JobControl.toList Key: MAPREDUCE-5757 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5757 Project: Hadoop Map/Reduce Issue Type: Bug Components: client Affects Versions: 0.23.10, 2.2.0 Reporter: Jason Lowe Assignee: Jason Lowe Attachments: MAPREDUCE-5757.patch Despite having the fix for MAPREDUCE-5513 we saw another ConcurrencyModificationException in JobControl, so something there still isn't fixed. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (MAPREDUCE-5670) CombineFileRecordReader should report progress when moving to the next file
[ https://issues.apache.org/jira/browse/MAPREDUCE-5670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13900863#comment-13900863 ] Jason Lowe commented on MAPREDUCE-5670: --- +1 lgtm. Committing this. CombineFileRecordReader should report progress when moving to the next file --- Key: MAPREDUCE-5670 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5670 Project: Hadoop Map/Reduce Issue Type: Bug Components: mrv2 Affects Versions: 0.23.9 Reporter: Jason Lowe Assignee: Chen He Priority: Minor Attachments: MR-5670v3.patch If a combine split consists of many empty files (i.e.: no record found by the underlying record reader) then theoretically a task can timeout due to lack of reported progress. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (MAPREDUCE-5670) CombineFileRecordReader should report progress when moving to the next file
[ https://issues.apache.org/jira/browse/MAPREDUCE-5670?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Lowe updated MAPREDUCE-5670: -- Resolution: Fixed Fix Version/s: 2.4.0 Hadoop Flags: Reviewed Status: Resolved (was: Patch Available) Thanks, Chen! I committed this to trunk and branch-2. CombineFileRecordReader should report progress when moving to the next file --- Key: MAPREDUCE-5670 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5670 Project: Hadoop Map/Reduce Issue Type: Bug Components: mrv2 Affects Versions: 0.23.9 Reporter: Jason Lowe Assignee: Chen He Priority: Minor Fix For: 2.4.0 Attachments: MR-5670v3.patch If a combine split consists of many empty files (i.e.: no record found by the underlying record reader) then theoretically a task can timeout due to lack of reported progress. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (MAPREDUCE-5670) CombineFileRecordReader should report progress when moving to the next file
[ https://issues.apache.org/jira/browse/MAPREDUCE-5670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13900903#comment-13900903 ] Hudson commented on MAPREDUCE-5670: --- SUCCESS: Integrated in Hadoop-trunk-Commit #5166 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/5166/]) MAPREDUCE-5670. CombineFileRecordReader should report progress when moving to the next file. Contributed by Chen He (jlowe: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1568118) * /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt * /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapred/lib/CombineFileRecordReader.java * /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/input/CombineFileRecordReader.java * /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/test/java/org/apache/hadoop/mapred/lib/TestCombineFileRecordReader.java * /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/test/java/org/apache/hadoop/mapreduce/lib/input/TestCombineFileRecordReader.java CombineFileRecordReader should report progress when moving to the next file --- Key: MAPREDUCE-5670 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5670 Project: Hadoop Map/Reduce Issue Type: Bug Components: mrv2 Affects Versions: 0.23.9 Reporter: Jason Lowe Assignee: Chen He Priority: Minor Fix For: 2.4.0 Attachments: MR-5670v3.patch If a combine split consists of many empty files (i.e.: no record found by the underlying record reader) then theoretically a task can timeout due to lack of reported progress. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (MAPREDUCE-5641) History for failed Application Masters should be made available to the Job History Server
[ https://issues.apache.org/jira/browse/MAPREDUCE-5641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Kanter updated MAPREDUCE-5641: - Attachment: MAPREDUCE-5641.patch I’ve attached a preliminary version of the patch. Once we all agree on the specifics of the design, I can add unit tests. The patch follows the design I outlined before where the RM will write a file when it sees an AM die and the JHS see that and copies the jhist and similar files to the done_intermediate dir. I have tested this by running jobs and killing the AM. This results in incomplete information, as expected; however, in some cases some of the information won’t make 100% sense or is missing (e.g. no Finish Time if the AM didn’t actually finish). I’ve put in some code to take care of these situations. I’ve also attached a preliminary YARN patch to YARN-1731. {quote} How will the JHS copy the file to the intermediate directory? It likely won't have access to the staging directory containing the jhist file. {quote} I modified the permissions from 0700 to 0701. History for failed Application Masters should be made available to the Job History Server - Key: MAPREDUCE-5641 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5641 Project: Hadoop Map/Reduce Issue Type: Improvement Components: applicationmaster, jobhistoryserver Affects Versions: 2.2.0 Reporter: Robert Kanter Assignee: Robert Kanter Attachments: MAPREDUCE-5641.patch Currently, the JHS has no information about jobs whose AMs have failed. This is because the History is written by the AM to the intermediate folder just before finishing, so when it fails for any reason, this information isn't copied there. However, it is not lost as its in the AM's staging directory. To make the History available in the JHS, all we need to do is have another mechanism to move the History from the staging directory to the intermediate directory. The AM also writes a Summary file before exiting normally, which is also unavailable when the AM fails. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (MAPREDUCE-4490) JVM reuse is incompatible with LinuxTaskController (and therefore incompatible with Security)
[ https://issues.apache.org/jira/browse/MAPREDUCE-4490?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sam liu updated MAPREDUCE-4490: --- Attachment: MAPREDUCE-4490.patch New patch basing on latest branch origin/branch-1.2 JVM reuse is incompatible with LinuxTaskController (and therefore incompatible with Security) - Key: MAPREDUCE-4490 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4490 Project: Hadoop Map/Reduce Issue Type: Bug Components: task-controller, tasktracker Affects Versions: 0.20.205.0, 1.0.3, 1.2.1 Reporter: George Datskos Assignee: sam liu Priority: Critical Labels: patch Fix For: 1.2.1 Attachments: MAPREDUCE-4490.patch, MAPREDUCE-4490.patch, MAPREDUCE-4490.patch When using LinuxTaskController, JVM reuse (mapred.job.reuse.jvm.num.tasks 1) with more map tasks in a job than there are map slots in the cluster will result in immediate task failures for the second task in each JVM (and then the JVM exits). We have investigated this bug and the root cause is as follows. When using LinuxTaskController, the userlog directory for a task attempt (../userlogs/job/task-attempt) is created only on the first invocation (when the JVM is launched) because userlogs directories are created by the task-controller binary which only runs *once* per JVM. Therefore, attempting to create log.index is guaranteed to fail with ENOENT leading to immediate task failure and child JVM exit. {quote} 2012-07-24 14:29:11,914 INFO org.apache.hadoop.mapred.TaskLog: Starting logging for a new task attempt_201207241401_0013_m_27_0 in the same JVM as that of the first task /var/log/hadoop/mapred/userlogs/job_201207241401_0013/attempt_201207241401_0013_m_06_0 2012-07-24 14:29:11,915 WARN org.apache.hadoop.mapred.Child: Error running child ENOENT: No such file or directory at org.apache.hadoop.io.nativeio.NativeIO.open(Native Method) at org.apache.hadoop.io.SecureIOUtils.createForWrite(SecureIOUtils.java:161) at org.apache.hadoop.mapred.TaskLog.writeToIndexFile(TaskLog.java:296) at org.apache.hadoop.mapred.TaskLog.syncLogs(TaskLog.java:369) at org.apache.hadoop.mapred.Child.main(Child.java:229) {quote} The above error occurs in a JVM which runs tasks 6 and 27. Task6 goes smoothly. Then Task27 starts. The directory /var/log/hadoop/mapred/userlogs/job_201207241401_0013/attempt_201207241401_0013_m_027_0 is never created so when mapred.Child tries to write the log.index file for Task27, it fails with ENOENT because the attempt_201207241401_0013_m_027_0 directory does not exist. Therefore, the second task in each JVM is guaranteed to fail (and then the JVM exits) every time when using LinuxTaskController. Note that this problem does not occur when using the DefaultTaskController because the userlogs directories are created for each task (not just for each JVM as with LinuxTaskController). For each task, the TaskRunner calls the TaskController's createLogDir method before attempting to write out an index file. * DefaultTaskController#createLogDir: creates log directory for each task * LinuxTaskController#createLogDir: does nothing ** task-controller binary creates log directory [create_attempt_directories] (but only for the first task) Possible Solution: add a new command to task-controller *initialize task* to create attempt directories. Call that command, with ShellCommandExecutor, in the LinuxTaskController#createLogDir method -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (MAPREDUCE-4490) JVM reuse is incompatible with LinuxTaskController (and therefore incompatible with Security)
[ https://issues.apache.org/jira/browse/MAPREDUCE-4490?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sam liu updated MAPREDUCE-4490: --- Status: Open (was: Patch Available) Will upload new patch for latest code base of branch origin/branch-1.2 JVM reuse is incompatible with LinuxTaskController (and therefore incompatible with Security) - Key: MAPREDUCE-4490 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4490 Project: Hadoop Map/Reduce Issue Type: Bug Components: task-controller, tasktracker Affects Versions: 1.2.1, 1.0.3, 0.20.205.0 Reporter: George Datskos Assignee: sam liu Priority: Critical Labels: patch Fix For: 1.2.1 Attachments: MAPREDUCE-4490.patch, MAPREDUCE-4490.patch, MAPREDUCE-4490.patch When using LinuxTaskController, JVM reuse (mapred.job.reuse.jvm.num.tasks 1) with more map tasks in a job than there are map slots in the cluster will result in immediate task failures for the second task in each JVM (and then the JVM exits). We have investigated this bug and the root cause is as follows. When using LinuxTaskController, the userlog directory for a task attempt (../userlogs/job/task-attempt) is created only on the first invocation (when the JVM is launched) because userlogs directories are created by the task-controller binary which only runs *once* per JVM. Therefore, attempting to create log.index is guaranteed to fail with ENOENT leading to immediate task failure and child JVM exit. {quote} 2012-07-24 14:29:11,914 INFO org.apache.hadoop.mapred.TaskLog: Starting logging for a new task attempt_201207241401_0013_m_27_0 in the same JVM as that of the first task /var/log/hadoop/mapred/userlogs/job_201207241401_0013/attempt_201207241401_0013_m_06_0 2012-07-24 14:29:11,915 WARN org.apache.hadoop.mapred.Child: Error running child ENOENT: No such file or directory at org.apache.hadoop.io.nativeio.NativeIO.open(Native Method) at org.apache.hadoop.io.SecureIOUtils.createForWrite(SecureIOUtils.java:161) at org.apache.hadoop.mapred.TaskLog.writeToIndexFile(TaskLog.java:296) at org.apache.hadoop.mapred.TaskLog.syncLogs(TaskLog.java:369) at org.apache.hadoop.mapred.Child.main(Child.java:229) {quote} The above error occurs in a JVM which runs tasks 6 and 27. Task6 goes smoothly. Then Task27 starts. The directory /var/log/hadoop/mapred/userlogs/job_201207241401_0013/attempt_201207241401_0013_m_027_0 is never created so when mapred.Child tries to write the log.index file for Task27, it fails with ENOENT because the attempt_201207241401_0013_m_027_0 directory does not exist. Therefore, the second task in each JVM is guaranteed to fail (and then the JVM exits) every time when using LinuxTaskController. Note that this problem does not occur when using the DefaultTaskController because the userlogs directories are created for each task (not just for each JVM as with LinuxTaskController). For each task, the TaskRunner calls the TaskController's createLogDir method before attempting to write out an index file. * DefaultTaskController#createLogDir: creates log directory for each task * LinuxTaskController#createLogDir: does nothing ** task-controller binary creates log directory [create_attempt_directories] (but only for the first task) Possible Solution: add a new command to task-controller *initialize task* to create attempt directories. Call that command, with ShellCommandExecutor, in the LinuxTaskController#createLogDir method -- This message was sent by Atlassian JIRA (v6.1.5#6160)