[jira] [Commented] (MAPREDUCE-6002) MR task should prevent report error to AM when process is shutting down
[ https://issues.apache.org/jira/browse/MAPREDUCE-6002?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14075607#comment-14075607 ] Hudson commented on MAPREDUCE-6002: --- FAILURE: Integrated in Hadoop-Yarn-trunk #625 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/625/]) MAPREDUCE-6002. Made MR task avoid reporting error to AM when the task process is shutting down. Contributed by Wangda Tan. (zjshen: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1613743) * /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt * /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapred/LocalContainerLauncher.java * /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapred/YarnChild.java * /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapred/Task.java MR task should prevent report error to AM when process is shutting down --- Key: MAPREDUCE-6002 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6002 Project: Hadoop Map/Reduce Issue Type: Bug Components: task Affects Versions: 2.5.0 Reporter: Wangda Tan Assignee: Wangda Tan Fix For: 2.5.0 Attachments: MR-6002.patch With MAPREDUCE-5900, preempted MR task should not be treat as failed. But it is still possible a MR task fail and report to AM when preemption take effect and the AM hasn't received completed container from RM yet. It will cause the task attempt marked failed instead of preempted. An example is FileSystem has shutdown hook, it will close all FileSystem instance, if at the same time, the FileSystem is in-use (like reading split details from HDFS), MR task will fail and report the fatal error to MR AM. An exception will be raised: {code} 2014-07-22 01:46:19,613 FATAL [IPC Server handler 10 on 56903] org.apache.hadoop.mapred.TaskAttemptListenerImpl: Task: attempt_1405985051088_0018_m_25_0 - exited : java.io.IOException: Filesystem closed at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:707) at org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:776) at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:837) at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:645) at java.io.DataInputStream.readByte(DataInputStream.java:265) at org.apache.hadoop.io.WritableUtils.readVLong(WritableUtils.java:308) at org.apache.hadoop.io.WritableUtils.readVIntInRange(WritableUtils.java:348) at org.apache.hadoop.io.Text.readString(Text.java:464) at org.apache.hadoop.io.Text.readString(Text.java:457) at org.apache.hadoop.mapred.MapTask.getSplitDetails(MapTask.java:357) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:731) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340) at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:167) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1594) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:162) {code} We should prevent this, because it is possible other exceptions happen when shutting down, we shouldn't report any of such exceptions to AM. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (MAPREDUCE-6002) MR task should prevent report error to AM when process is shutting down
[ https://issues.apache.org/jira/browse/MAPREDUCE-6002?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14075628#comment-14075628 ] Hudson commented on MAPREDUCE-6002: --- FAILURE: Integrated in Hadoop-Hdfs-trunk #1817 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/1817/]) MAPREDUCE-6002. Made MR task avoid reporting error to AM when the task process is shutting down. Contributed by Wangda Tan. (zjshen: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1613743) * /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt * /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapred/LocalContainerLauncher.java * /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapred/YarnChild.java * /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapred/Task.java MR task should prevent report error to AM when process is shutting down --- Key: MAPREDUCE-6002 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6002 Project: Hadoop Map/Reduce Issue Type: Bug Components: task Affects Versions: 2.5.0 Reporter: Wangda Tan Assignee: Wangda Tan Fix For: 2.5.0 Attachments: MR-6002.patch With MAPREDUCE-5900, preempted MR task should not be treat as failed. But it is still possible a MR task fail and report to AM when preemption take effect and the AM hasn't received completed container from RM yet. It will cause the task attempt marked failed instead of preempted. An example is FileSystem has shutdown hook, it will close all FileSystem instance, if at the same time, the FileSystem is in-use (like reading split details from HDFS), MR task will fail and report the fatal error to MR AM. An exception will be raised: {code} 2014-07-22 01:46:19,613 FATAL [IPC Server handler 10 on 56903] org.apache.hadoop.mapred.TaskAttemptListenerImpl: Task: attempt_1405985051088_0018_m_25_0 - exited : java.io.IOException: Filesystem closed at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:707) at org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:776) at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:837) at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:645) at java.io.DataInputStream.readByte(DataInputStream.java:265) at org.apache.hadoop.io.WritableUtils.readVLong(WritableUtils.java:308) at org.apache.hadoop.io.WritableUtils.readVIntInRange(WritableUtils.java:348) at org.apache.hadoop.io.Text.readString(Text.java:464) at org.apache.hadoop.io.Text.readString(Text.java:457) at org.apache.hadoop.mapred.MapTask.getSplitDetails(MapTask.java:357) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:731) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340) at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:167) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1594) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:162) {code} We should prevent this, because it is possible other exceptions happen when shutting down, we shouldn't report any of such exceptions to AM. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (MAPREDUCE-6002) MR task should prevent report error to AM when process is shutting down
[ https://issues.apache.org/jira/browse/MAPREDUCE-6002?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14075636#comment-14075636 ] Hudson commented on MAPREDUCE-6002: --- SUCCESS: Integrated in Hadoop-Mapreduce-trunk #1844 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1844/]) MAPREDUCE-6002. Made MR task avoid reporting error to AM when the task process is shutting down. Contributed by Wangda Tan. (zjshen: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1613743) * /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt * /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapred/LocalContainerLauncher.java * /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapred/YarnChild.java * /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapred/Task.java MR task should prevent report error to AM when process is shutting down --- Key: MAPREDUCE-6002 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6002 Project: Hadoop Map/Reduce Issue Type: Bug Components: task Affects Versions: 2.5.0 Reporter: Wangda Tan Assignee: Wangda Tan Fix For: 2.5.0 Attachments: MR-6002.patch With MAPREDUCE-5900, preempted MR task should not be treat as failed. But it is still possible a MR task fail and report to AM when preemption take effect and the AM hasn't received completed container from RM yet. It will cause the task attempt marked failed instead of preempted. An example is FileSystem has shutdown hook, it will close all FileSystem instance, if at the same time, the FileSystem is in-use (like reading split details from HDFS), MR task will fail and report the fatal error to MR AM. An exception will be raised: {code} 2014-07-22 01:46:19,613 FATAL [IPC Server handler 10 on 56903] org.apache.hadoop.mapred.TaskAttemptListenerImpl: Task: attempt_1405985051088_0018_m_25_0 - exited : java.io.IOException: Filesystem closed at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:707) at org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:776) at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:837) at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:645) at java.io.DataInputStream.readByte(DataInputStream.java:265) at org.apache.hadoop.io.WritableUtils.readVLong(WritableUtils.java:308) at org.apache.hadoop.io.WritableUtils.readVIntInRange(WritableUtils.java:348) at org.apache.hadoop.io.Text.readString(Text.java:464) at org.apache.hadoop.io.Text.readString(Text.java:457) at org.apache.hadoop.mapred.MapTask.getSplitDetails(MapTask.java:357) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:731) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340) at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:167) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1594) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:162) {code} We should prevent this, because it is possible other exceptions happen when shutting down, we shouldn't report any of such exceptions to AM. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (MAPREDUCE-6002) MR task should prevent report error to AM when process is shutting down
[ https://issues.apache.org/jira/browse/MAPREDUCE-6002?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14075523#comment-14075523 ] Hudson commented on MAPREDUCE-6002: --- FAILURE: Integrated in Hadoop-trunk-Commit #5977 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/5977/]) MAPREDUCE-6002. Made MR task avoid reporting error to AM when the task process is shutting down. Contributed by Wangda Tan. (zjshen: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1613743) * /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt * /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapred/LocalContainerLauncher.java * /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapred/YarnChild.java * /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapred/Task.java MR task should prevent report error to AM when process is shutting down --- Key: MAPREDUCE-6002 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6002 Project: Hadoop Map/Reduce Issue Type: Bug Components: task Affects Versions: 2.5.0 Reporter: Wangda Tan Assignee: Wangda Tan Attachments: MR-6002.patch With MAPREDUCE-5900, preempted MR task should not be treat as failed. But it is still possible a MR task fail and report to AM when preemption take effect and the AM hasn't received completed container from RM yet. It will cause the task attempt marked failed instead of preempted. An example is FileSystem has shutdown hook, it will close all FileSystem instance, if at the same time, the FileSystem is in-use (like reading split details from HDFS), MR task will fail and report the fatal error to MR AM. An exception will be raised: {code} 2014-07-22 01:46:19,613 FATAL [IPC Server handler 10 on 56903] org.apache.hadoop.mapred.TaskAttemptListenerImpl: Task: attempt_1405985051088_0018_m_25_0 - exited : java.io.IOException: Filesystem closed at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:707) at org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:776) at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:837) at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:645) at java.io.DataInputStream.readByte(DataInputStream.java:265) at org.apache.hadoop.io.WritableUtils.readVLong(WritableUtils.java:308) at org.apache.hadoop.io.WritableUtils.readVIntInRange(WritableUtils.java:348) at org.apache.hadoop.io.Text.readString(Text.java:464) at org.apache.hadoop.io.Text.readString(Text.java:457) at org.apache.hadoop.mapred.MapTask.getSplitDetails(MapTask.java:357) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:731) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340) at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:167) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1594) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:162) {code} We should prevent this, because it is possible other exceptions happen when shutting down, we shouldn't report any of such exceptions to AM. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (MAPREDUCE-6002) MR task should prevent report error to AM when process is shutting down
[ https://issues.apache.org/jira/browse/MAPREDUCE-6002?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14074185#comment-14074185 ] Zhijie Shen commented on MAPREDUCE-6002: TaskUmbilicalProtocol#fsError and #fatalError are the two calls that will result in TA_FAILMSG, and consequently move a task attempt to failure. Checking whether the task attempt process is stopping and only notifying the listener when the process is NOT stopping can prevent the task attempt being moved to FAILED because of the exception caused by stopping a process, such as the aforementioned case. In general, the solution makes sense to me. Just one concern: it may result in another race condition on the contradictory. For example, an exception which is NOT caused by stopping the task attempt process happens MERELY before the shutdown hook is invoked. Then, when we check whether the task attempt process is stopping, it already returns true. In this extreme case, the exception is going to be missed by the listener, and the task attempt is moved to PREEMPTED instead of FAILED. While marking a TA that is supposed to PREEMPTED as FAILED and vice versa are the rare cases, IMHO, they have different levels of down side. Marking a TA that is supposed to PREEMPTED as FAILED is likely to make the task not be able to retry. IMHO, On the other side, marking a TA that is supposed to FAILED as PREEMPTED will make the attempt retry even it used up the retry quota, which is not too bad. Offering users more what the are promised sounds better than offering less. Any thoughts? MR task should prevent report error to AM when process is shutting down --- Key: MAPREDUCE-6002 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6002 Project: Hadoop Map/Reduce Issue Type: Bug Components: task Affects Versions: 2.5.0 Reporter: Wangda Tan Assignee: Wangda Tan Attachments: MR-6002.patch With MAPREDUCE-5900, preempted MR task should not be treat as failed. But it is still possible a MR task fail and report to AM when preemption take effect and the AM hasn't received completed container from RM yet. It will cause the task attempt marked failed instead of preempted. An example is FileSystem has shutdown hook, it will close all FileSystem instance, if at the same time, the FileSystem is in-use (like reading split details from HDFS), MR task will fail and report the fatal error to MR AM. An exception will be raised: {code} 2014-07-22 01:46:19,613 FATAL [IPC Server handler 10 on 56903] org.apache.hadoop.mapred.TaskAttemptListenerImpl: Task: attempt_1405985051088_0018_m_25_0 - exited : java.io.IOException: Filesystem closed at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:707) at org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:776) at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:837) at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:645) at java.io.DataInputStream.readByte(DataInputStream.java:265) at org.apache.hadoop.io.WritableUtils.readVLong(WritableUtils.java:308) at org.apache.hadoop.io.WritableUtils.readVIntInRange(WritableUtils.java:348) at org.apache.hadoop.io.Text.readString(Text.java:464) at org.apache.hadoop.io.Text.readString(Text.java:457) at org.apache.hadoop.mapred.MapTask.getSplitDetails(MapTask.java:357) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:731) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340) at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:167) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1594) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:162) {code} We should prevent this, because it is possible other exceptions happen when shutting down, we shouldn't report any of such exceptions to AM. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (MAPREDUCE-6002) MR task should prevent report error to AM when process is shutting down
[ https://issues.apache.org/jira/browse/MAPREDUCE-6002?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14074228#comment-14074228 ] Wangda Tan commented on MAPREDUCE-6002: --- Hi [~zjshen], Thanks for review I totally agree with you, I think we should ignore the extremely race condition you mentioned too, since we don't deprive its right to retry :) Wangda MR task should prevent report error to AM when process is shutting down --- Key: MAPREDUCE-6002 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6002 Project: Hadoop Map/Reduce Issue Type: Bug Components: task Affects Versions: 2.5.0 Reporter: Wangda Tan Assignee: Wangda Tan Attachments: MR-6002.patch With MAPREDUCE-5900, preempted MR task should not be treat as failed. But it is still possible a MR task fail and report to AM when preemption take effect and the AM hasn't received completed container from RM yet. It will cause the task attempt marked failed instead of preempted. An example is FileSystem has shutdown hook, it will close all FileSystem instance, if at the same time, the FileSystem is in-use (like reading split details from HDFS), MR task will fail and report the fatal error to MR AM. An exception will be raised: {code} 2014-07-22 01:46:19,613 FATAL [IPC Server handler 10 on 56903] org.apache.hadoop.mapred.TaskAttemptListenerImpl: Task: attempt_1405985051088_0018_m_25_0 - exited : java.io.IOException: Filesystem closed at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:707) at org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:776) at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:837) at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:645) at java.io.DataInputStream.readByte(DataInputStream.java:265) at org.apache.hadoop.io.WritableUtils.readVLong(WritableUtils.java:308) at org.apache.hadoop.io.WritableUtils.readVIntInRange(WritableUtils.java:348) at org.apache.hadoop.io.Text.readString(Text.java:464) at org.apache.hadoop.io.Text.readString(Text.java:457) at org.apache.hadoop.mapred.MapTask.getSplitDetails(MapTask.java:357) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:731) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340) at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:167) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1594) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:162) {code} We should prevent this, because it is possible other exceptions happen when shutting down, we shouldn't report any of such exceptions to AM. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (MAPREDUCE-6002) MR task should prevent report error to AM when process is shutting down
[ https://issues.apache.org/jira/browse/MAPREDUCE-6002?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14074398#comment-14074398 ] Jason Lowe commented on MAPREDUCE-6002: --- bq. In this extreme case, the exception is going to be missed by the listener, and the task attempt is moved to PREEMPTED instead of FAILED. This will only be true if the task was trying to be preempted *as* it failed, correct? The AM will see the container completion event from the RM, and since the attempt didn't explicitly report a completion status it will key off the container status code to determine the attempt's fate. If the attempt really happened to fail independently just as it was being preempted then that's a race we can live with either way, IMHO. The thing we don't want is to have the attempt fail _because_ of a preemption or task-kill, so I think it will be safe to squelch errors that are occurring during shutdown. I think the biggest issue will be if an error in the task attempt causes the entire JVM to start shutting down before the error is reported via the umbilical (e.g.: the user code calls System.exit on an error). The good news is that the task attempt will still end up in the FAILED state but any useful, context-specific error messages from the attempt will not be reported via the umbilical. The AM will only know that the task attempt exited without saying why. I suspect this is a rare situation when it occurs, probably correctable in the user's code in many of those cases, and the attempt logs should be able to sort things out if it does occur. MR task should prevent report error to AM when process is shutting down --- Key: MAPREDUCE-6002 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6002 Project: Hadoop Map/Reduce Issue Type: Bug Components: task Affects Versions: 2.5.0 Reporter: Wangda Tan Assignee: Wangda Tan Attachments: MR-6002.patch With MAPREDUCE-5900, preempted MR task should not be treat as failed. But it is still possible a MR task fail and report to AM when preemption take effect and the AM hasn't received completed container from RM yet. It will cause the task attempt marked failed instead of preempted. An example is FileSystem has shutdown hook, it will close all FileSystem instance, if at the same time, the FileSystem is in-use (like reading split details from HDFS), MR task will fail and report the fatal error to MR AM. An exception will be raised: {code} 2014-07-22 01:46:19,613 FATAL [IPC Server handler 10 on 56903] org.apache.hadoop.mapred.TaskAttemptListenerImpl: Task: attempt_1405985051088_0018_m_25_0 - exited : java.io.IOException: Filesystem closed at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:707) at org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:776) at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:837) at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:645) at java.io.DataInputStream.readByte(DataInputStream.java:265) at org.apache.hadoop.io.WritableUtils.readVLong(WritableUtils.java:308) at org.apache.hadoop.io.WritableUtils.readVIntInRange(WritableUtils.java:348) at org.apache.hadoop.io.Text.readString(Text.java:464) at org.apache.hadoop.io.Text.readString(Text.java:457) at org.apache.hadoop.mapred.MapTask.getSplitDetails(MapTask.java:357) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:731) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340) at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:167) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1594) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:162) {code} We should prevent this, because it is possible other exceptions happen when shutting down, we shouldn't report any of such exceptions to AM. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (MAPREDUCE-6002) MR task should prevent report error to AM when process is shutting down
[ https://issues.apache.org/jira/browse/MAPREDUCE-6002?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14074592#comment-14074592 ] Zhijie Shen commented on MAPREDUCE-6002: Thanks for your feedback, [~jlowe]! bq. This will only be true if the task was trying to be preempted as it failed, correct? The AM will see the container completion event from the RM, and since the attempt didn't explicitly report a completion status it will key off the container status code to determine the attempt's fate. If the attempt really happened to fail independently just as it was being preempted then that's a race we can live with either way, IMHO. The thing we don't want is to have the attempt fail because of a preemption or task-kill, so I think it will be safe to squelch errors that are occurring during shutdown. Exactly. This was the point I'd like to make, and the patch is actually solving the problem in this way. bq. The good news is that the task attempt will still end up in the FAILED state but any useful, Isn't it possible that PREEMPTED from RM still comes before AM knows the task attempt FAILED? Say preemption logic has already happened on RM, and the completed container status has already be sent to AM, but NM hasn't notified RM and PingChecker hasn't found it. Anyway, it is still safe, because it doesn't break the agreement that we don't want is to have the attempt fail because of a preemption or task-kill. MR task should prevent report error to AM when process is shutting down --- Key: MAPREDUCE-6002 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6002 Project: Hadoop Map/Reduce Issue Type: Bug Components: task Affects Versions: 2.5.0 Reporter: Wangda Tan Assignee: Wangda Tan Attachments: MR-6002.patch With MAPREDUCE-5900, preempted MR task should not be treat as failed. But it is still possible a MR task fail and report to AM when preemption take effect and the AM hasn't received completed container from RM yet. It will cause the task attempt marked failed instead of preempted. An example is FileSystem has shutdown hook, it will close all FileSystem instance, if at the same time, the FileSystem is in-use (like reading split details from HDFS), MR task will fail and report the fatal error to MR AM. An exception will be raised: {code} 2014-07-22 01:46:19,613 FATAL [IPC Server handler 10 on 56903] org.apache.hadoop.mapred.TaskAttemptListenerImpl: Task: attempt_1405985051088_0018_m_25_0 - exited : java.io.IOException: Filesystem closed at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:707) at org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:776) at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:837) at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:645) at java.io.DataInputStream.readByte(DataInputStream.java:265) at org.apache.hadoop.io.WritableUtils.readVLong(WritableUtils.java:308) at org.apache.hadoop.io.WritableUtils.readVIntInRange(WritableUtils.java:348) at org.apache.hadoop.io.Text.readString(Text.java:464) at org.apache.hadoop.io.Text.readString(Text.java:457) at org.apache.hadoop.mapred.MapTask.getSplitDetails(MapTask.java:357) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:731) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340) at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:167) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1594) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:162) {code} We should prevent this, because it is possible other exceptions happen when shutting down, we shouldn't report any of such exceptions to AM. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (MAPREDUCE-6002) MR task should prevent report error to AM when process is shutting down
[ https://issues.apache.org/jira/browse/MAPREDUCE-6002?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14075183#comment-14075183 ] Wangda Tan commented on MAPREDUCE-6002: --- Jason, thanks for your comments, bq. I suspect this is a rare situation when it occurs, probably correctable in the user's code in many of those cases, and the attempt logs should be able to sort things out if it does occur. I agree, in normal failure, no matter what kind of exception throw, YarnChild should be able to catch them and report to AM. In some rare cases, if some error cause JVM starting shutdown before reporting to AM, it cannot successfully report to AM in a big chance even if we don't change this. To Zhijie, bq. Isn't it possible that PREEMPTED from RM still comes before AM knows the task attempt FAILED? I think what Jason mentioned is another case: there's no preemption happens, it's a failure happens in TA side, and JVM shutdown happens before TA can report such error to AM. Thanks, Wangda MR task should prevent report error to AM when process is shutting down --- Key: MAPREDUCE-6002 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6002 Project: Hadoop Map/Reduce Issue Type: Bug Components: task Affects Versions: 2.5.0 Reporter: Wangda Tan Assignee: Wangda Tan Attachments: MR-6002.patch With MAPREDUCE-5900, preempted MR task should not be treat as failed. But it is still possible a MR task fail and report to AM when preemption take effect and the AM hasn't received completed container from RM yet. It will cause the task attempt marked failed instead of preempted. An example is FileSystem has shutdown hook, it will close all FileSystem instance, if at the same time, the FileSystem is in-use (like reading split details from HDFS), MR task will fail and report the fatal error to MR AM. An exception will be raised: {code} 2014-07-22 01:46:19,613 FATAL [IPC Server handler 10 on 56903] org.apache.hadoop.mapred.TaskAttemptListenerImpl: Task: attempt_1405985051088_0018_m_25_0 - exited : java.io.IOException: Filesystem closed at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:707) at org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:776) at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:837) at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:645) at java.io.DataInputStream.readByte(DataInputStream.java:265) at org.apache.hadoop.io.WritableUtils.readVLong(WritableUtils.java:308) at org.apache.hadoop.io.WritableUtils.readVIntInRange(WritableUtils.java:348) at org.apache.hadoop.io.Text.readString(Text.java:464) at org.apache.hadoop.io.Text.readString(Text.java:457) at org.apache.hadoop.mapred.MapTask.getSplitDetails(MapTask.java:357) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:731) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340) at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:167) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1594) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:162) {code} We should prevent this, because it is possible other exceptions happen when shutting down, we shouldn't report any of such exceptions to AM. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (MAPREDUCE-6002) MR task should prevent report error to AM when process is shutting down
[ https://issues.apache.org/jira/browse/MAPREDUCE-6002?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14075253#comment-14075253 ] Zhijie Shen commented on MAPREDUCE-6002: +1 for the patch. I will commit the patch late tomorrow, given [~jlowe] some time, if you'd like to look into the patch as well. MR task should prevent report error to AM when process is shutting down --- Key: MAPREDUCE-6002 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6002 Project: Hadoop Map/Reduce Issue Type: Bug Components: task Affects Versions: 2.5.0 Reporter: Wangda Tan Assignee: Wangda Tan Attachments: MR-6002.patch With MAPREDUCE-5900, preempted MR task should not be treat as failed. But it is still possible a MR task fail and report to AM when preemption take effect and the AM hasn't received completed container from RM yet. It will cause the task attempt marked failed instead of preempted. An example is FileSystem has shutdown hook, it will close all FileSystem instance, if at the same time, the FileSystem is in-use (like reading split details from HDFS), MR task will fail and report the fatal error to MR AM. An exception will be raised: {code} 2014-07-22 01:46:19,613 FATAL [IPC Server handler 10 on 56903] org.apache.hadoop.mapred.TaskAttemptListenerImpl: Task: attempt_1405985051088_0018_m_25_0 - exited : java.io.IOException: Filesystem closed at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:707) at org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:776) at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:837) at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:645) at java.io.DataInputStream.readByte(DataInputStream.java:265) at org.apache.hadoop.io.WritableUtils.readVLong(WritableUtils.java:308) at org.apache.hadoop.io.WritableUtils.readVIntInRange(WritableUtils.java:348) at org.apache.hadoop.io.Text.readString(Text.java:464) at org.apache.hadoop.io.Text.readString(Text.java:457) at org.apache.hadoop.mapred.MapTask.getSplitDetails(MapTask.java:357) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:731) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340) at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:167) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1594) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:162) {code} We should prevent this, because it is possible other exceptions happen when shutting down, we shouldn't report any of such exceptions to AM. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (MAPREDUCE-6002) MR task should prevent report error to AM when process is shutting down
[ https://issues.apache.org/jira/browse/MAPREDUCE-6002?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14072957#comment-14072957 ] Hadoop QA commented on MAPREDUCE-6002: -- {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12657552/MR-6002.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/4764//testReport/ Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/4764//console This message is automatically generated. MR task should prevent report error to AM when process is shutting down --- Key: MAPREDUCE-6002 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6002 Project: Hadoop Map/Reduce Issue Type: Bug Components: task Affects Versions: 2.5.0 Reporter: Wangda Tan Assignee: Wangda Tan Attachments: MR-6002.patch With MAPREDUCE-5900, preempted MR task should not be treat as failed. But it is still possible a MR task fail and report to AM when preemption take effect and the AM hasn't received completed container from RM yet. It will cause the task attempt marked failed instead of preempted. An example is FileSystem has shutdown hook, it will close all FileSystem instance, if at the same time, the FileSystem is in-use (like reading split details from HDFS), MR task will fail and report the fatal error to MR AM. An exception will be raised: {code} 2014-07-22 01:46:19,613 FATAL [IPC Server handler 10 on 56903] org.apache.hadoop.mapred.TaskAttemptListenerImpl: Task: attempt_1405985051088_0018_m_25_0 - exited : java.io.IOException: Filesystem closed at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:707) at org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:776) at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:837) at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:645) at java.io.DataInputStream.readByte(DataInputStream.java:265) at org.apache.hadoop.io.WritableUtils.readVLong(WritableUtils.java:308) at org.apache.hadoop.io.WritableUtils.readVIntInRange(WritableUtils.java:348) at org.apache.hadoop.io.Text.readString(Text.java:464) at org.apache.hadoop.io.Text.readString(Text.java:457) at org.apache.hadoop.mapred.MapTask.getSplitDetails(MapTask.java:357) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:731) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340) at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:167) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1594) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:162) {code} We should prevent this, because it is possible other exceptions happen when shutting down, we shouldn't report any of such exceptions to AM. -- This message was sent by Atlassian JIRA (v6.2#6252)