[jira] [Commented] (SPARK-19276) FetchFailures can be hidden by user (or sql) exception handling
[ https://issues.apache.org/jira/browse/SPARK-19276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16424923#comment-16424923 ] chen xiao commented on SPARK-19276: --- [~imranr], I have created another task about this issue several days ago: https://issues.apache.org/jira/browse/SPARK-23816. Could you please help review it? > FetchFailures can be hidden by user (or sql) exception handling > --- > > Key: SPARK-19276 > URL: https://issues.apache.org/jira/browse/SPARK-19276 > Project: Spark > Issue Type: Bug > Components: Scheduler, Spark Core, SQL >Affects Versions: 2.1.0 >Reporter: Imran Rashid >Assignee: Imran Rashid >Priority: Critical > Fix For: 2.2.0 > > > The scheduler handles node failures by looking for a special > {{FetchFailedException}} thrown by the shuffle block fetcher. This is > handled in {{Executor}} and then passed as a special msg back to the driver: > https://github.com/apache/spark/blob/278fa1eb305220a85c816c948932d6af8fa619aa/core/src/main/scala/org/apache/spark/executor/Executor.scala#L403 > However, user code exists in between the shuffle block fetcher and that catch > block -- it could intercept the exception, wrap it with something else, and > throw a different exception. If that happens, spark treats it as an ordinary > task failure, and retries the task, rather than regenerating the missing > shuffle data. The task eventually is retried 4 times, its doomed to fail > each time, and the job is failed. > You might think that no user code should do that -- but even sparksql does it: > https://github.com/apache/spark/blob/278fa1eb305220a85c816c948932d6af8fa619aa/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatWriter.scala#L214 > Here's an example stack trace. This is from Spark 1.6, so the sql code is > not the same, but the problem is still there: > {noformat} > 17/01/13 19:18:02 WARN scheduler.TaskSetManager: Lost task 0.0 in stage > 1983.0 (TID 304851, xxx): org.apache.spark.SparkException: Task failed while > writing rows. > at > org.apache.spark.sql.execution.datasources.DynamicPartitionWriterContainer.writeRows(WriterContainer.scala:414) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) > at org.apache.spark.scheduler.Task.run(Task.scala:89) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: org.apache.spark.shuffle.FetchFailedException: Failed to connect > to xxx/yyy:zzz > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:323) > ... > 17/01/13 19:19:29 ERROR scheduler.TaskSetManager: Task 0 in stage 1983.0 > failed 4 times; aborting job > {noformat} > I think the right fix here is to also set a fetch failure status in the > {{TaskContextImpl}}, so the executor can check that instead of just one > exception. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19276) FetchFailures can be hidden by user (or sql) exception handling
[ https://issues.apache.org/jira/browse/SPARK-19276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16422588#comment-16422588 ] Imran Rashid commented on SPARK-19276: -- Oh thanks for pointing that out [~xchen12138], I think you are absolutely correct. Would you mind opening another bug and cc'ing me? It would be great if you can attach logs > FetchFailures can be hidden by user (or sql) exception handling > --- > > Key: SPARK-19276 > URL: https://issues.apache.org/jira/browse/SPARK-19276 > Project: Spark > Issue Type: Bug > Components: Scheduler, Spark Core, SQL >Affects Versions: 2.1.0 >Reporter: Imran Rashid >Assignee: Imran Rashid >Priority: Critical > Fix For: 2.2.0 > > > The scheduler handles node failures by looking for a special > {{FetchFailedException}} thrown by the shuffle block fetcher. This is > handled in {{Executor}} and then passed as a special msg back to the driver: > https://github.com/apache/spark/blob/278fa1eb305220a85c816c948932d6af8fa619aa/core/src/main/scala/org/apache/spark/executor/Executor.scala#L403 > However, user code exists in between the shuffle block fetcher and that catch > block -- it could intercept the exception, wrap it with something else, and > throw a different exception. If that happens, spark treats it as an ordinary > task failure, and retries the task, rather than regenerating the missing > shuffle data. The task eventually is retried 4 times, its doomed to fail > each time, and the job is failed. > You might think that no user code should do that -- but even sparksql does it: > https://github.com/apache/spark/blob/278fa1eb305220a85c816c948932d6af8fa619aa/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatWriter.scala#L214 > Here's an example stack trace. This is from Spark 1.6, so the sql code is > not the same, but the problem is still there: > {noformat} > 17/01/13 19:18:02 WARN scheduler.TaskSetManager: Lost task 0.0 in stage > 1983.0 (TID 304851, xxx): org.apache.spark.SparkException: Task failed while > writing rows. > at > org.apache.spark.sql.execution.datasources.DynamicPartitionWriterContainer.writeRows(WriterContainer.scala:414) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) > at org.apache.spark.scheduler.Task.run(Task.scala:89) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: org.apache.spark.shuffle.FetchFailedException: Failed to connect > to xxx/yyy:zzz > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:323) > ... > 17/01/13 19:19:29 ERROR scheduler.TaskSetManager: Task 0 in stage 1983.0 > failed 4 times; aborting job > {noformat} > I think the right fix here is to also set a fetch failure status in the > {{TaskContextImpl}}, so the executor can check that instead of just one > exception. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19276) FetchFailures can be hidden by user (or sql) exception handling
[ https://issues.apache.org/jira/browse/SPARK-19276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16421879#comment-16421879 ] chen xiao commented on SPARK-19276: --- I believe it cause another bug. If the task is killed with reason "another attempt success" and calling interrupt of the thread. The thread will throw a ClosedByInterruptException. It's an IO exception but not a Fetch fail exception. So when a speculative task creating input stream and kill by interrupt function of the thread, spark will treat it as fetch failure and try to regenerate shuffle data, which is completely wrong. > FetchFailures can be hidden by user (or sql) exception handling > --- > > Key: SPARK-19276 > URL: https://issues.apache.org/jira/browse/SPARK-19276 > Project: Spark > Issue Type: Bug > Components: Scheduler, Spark Core, SQL >Affects Versions: 2.1.0 >Reporter: Imran Rashid >Assignee: Imran Rashid >Priority: Critical > Fix For: 2.2.0 > > > The scheduler handles node failures by looking for a special > {{FetchFailedException}} thrown by the shuffle block fetcher. This is > handled in {{Executor}} and then passed as a special msg back to the driver: > https://github.com/apache/spark/blob/278fa1eb305220a85c816c948932d6af8fa619aa/core/src/main/scala/org/apache/spark/executor/Executor.scala#L403 > However, user code exists in between the shuffle block fetcher and that catch > block -- it could intercept the exception, wrap it with something else, and > throw a different exception. If that happens, spark treats it as an ordinary > task failure, and retries the task, rather than regenerating the missing > shuffle data. The task eventually is retried 4 times, its doomed to fail > each time, and the job is failed. > You might think that no user code should do that -- but even sparksql does it: > https://github.com/apache/spark/blob/278fa1eb305220a85c816c948932d6af8fa619aa/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatWriter.scala#L214 > Here's an example stack trace. This is from Spark 1.6, so the sql code is > not the same, but the problem is still there: > {noformat} > 17/01/13 19:18:02 WARN scheduler.TaskSetManager: Lost task 0.0 in stage > 1983.0 (TID 304851, xxx): org.apache.spark.SparkException: Task failed while > writing rows. > at > org.apache.spark.sql.execution.datasources.DynamicPartitionWriterContainer.writeRows(WriterContainer.scala:414) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) > at org.apache.spark.scheduler.Task.run(Task.scala:89) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: org.apache.spark.shuffle.FetchFailedException: Failed to connect > to xxx/yyy:zzz > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:323) > ... > 17/01/13 19:19:29 ERROR scheduler.TaskSetManager: Task 0 in stage 1983.0 > failed 4 times; aborting job > {noformat} > I think the right fix here is to also set a fetch failure status in the > {{TaskContextImpl}}, so the executor can check that instead of just one > exception. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19276) FetchFailures can be hidden by user (or sql) exception handling
[ https://issues.apache.org/jira/browse/SPARK-19276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15830519#comment-15830519 ] Mark Hamstra commented on SPARK-19276: -- Ok, I haven't read your PR closely yet, so I missed that. This question looks like something that could use more eyes and insights. [~kayousterhout][~matei][~r...@databricks.com] > FetchFailures can be hidden by user (or sql) exception handling > --- > > Key: SPARK-19276 > URL: https://issues.apache.org/jira/browse/SPARK-19276 > Project: Spark > Issue Type: Bug > Components: Scheduler, Spark Core, SQL >Affects Versions: 2.1.0 >Reporter: Imran Rashid >Priority: Critical > > The scheduler handles node failures by looking for a special > {{FetchFailedException}} thrown by the shuffle block fetcher. This is > handled in {{Executor}} and then passed as a special msg back to the driver: > https://github.com/apache/spark/blob/278fa1eb305220a85c816c948932d6af8fa619aa/core/src/main/scala/org/apache/spark/executor/Executor.scala#L403 > However, user code exists in between the shuffle block fetcher and that catch > block -- it could intercept the exception, wrap it with something else, and > throw a different exception. If that happens, spark treats it as an ordinary > task failure, and retries the task, rather than regenerating the missing > shuffle data. The task eventually is retried 4 times, its doomed to fail > each time, and the job is failed. > You might think that no user code should do that -- but even sparksql does it: > https://github.com/apache/spark/blob/278fa1eb305220a85c816c948932d6af8fa619aa/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatWriter.scala#L214 > Here's an example stack trace. This is from Spark 1.6, so the sql code is > not the same, but the problem is still there: > {noformat} > 17/01/13 19:18:02 WARN scheduler.TaskSetManager: Lost task 0.0 in stage > 1983.0 (TID 304851, xxx): org.apache.spark.SparkException: Task failed while > writing rows. > at > org.apache.spark.sql.execution.datasources.DynamicPartitionWriterContainer.writeRows(WriterContainer.scala:414) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) > at org.apache.spark.scheduler.Task.run(Task.scala:89) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: org.apache.spark.shuffle.FetchFailedException: Failed to connect > to xxx/yyy:zzz > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:323) > ... > 17/01/13 19:19:29 ERROR scheduler.TaskSetManager: Task 0 in stage 1983.0 > failed 4 times; aborting job > {noformat} > I think the right fix here is to also set a fetch failure status in the > {{TaskContextImpl}}, so the executor can check that instead of just one > exception. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19276) FetchFailures can be hidden by user (or sql) exception handling
[ https://issues.apache.org/jira/browse/SPARK-19276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15830487#comment-15830487 ] Imran Rashid commented on SPARK-19276: -- [~markhamstra] bq. I guess my only real question is if we should allow for the possibility of a FetchFailedException not only being caught, but also the failure being remedied by some means other than the usual handling in the driver. I wondered about this too -- in fact, in the end, the pr *does* allow that. You'll notice that if the fetch failure is set in the task context, but the task succeeds, than I log an error but otherwise let the task continue. We only send it back to the driver if there is also a task failure. I decided to do it that way in case there is some crazy existing code out there which relies on this behavior ... but honestly I'm not sure that is the right decision, maybe it should still send the fetch failure back to the driver and fail the task in that case too. > FetchFailures can be hidden by user (or sql) exception handling > --- > > Key: SPARK-19276 > URL: https://issues.apache.org/jira/browse/SPARK-19276 > Project: Spark > Issue Type: Bug > Components: Scheduler, Spark Core, SQL >Affects Versions: 2.1.0 >Reporter: Imran Rashid >Priority: Critical > > The scheduler handles node failures by looking for a special > {{FetchFailedException}} thrown by the shuffle block fetcher. This is > handled in {{Executor}} and then passed as a special msg back to the driver: > https://github.com/apache/spark/blob/278fa1eb305220a85c816c948932d6af8fa619aa/core/src/main/scala/org/apache/spark/executor/Executor.scala#L403 > However, user code exists in between the shuffle block fetcher and that catch > block -- it could intercept the exception, wrap it with something else, and > throw a different exception. If that happens, spark treats it as an ordinary > task failure, and retries the task, rather than regenerating the missing > shuffle data. The task eventually is retried 4 times, its doomed to fail > each time, and the job is failed. > You might think that no user code should do that -- but even sparksql does it: > https://github.com/apache/spark/blob/278fa1eb305220a85c816c948932d6af8fa619aa/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatWriter.scala#L214 > Here's an example stack trace. This is from Spark 1.6, so the sql code is > not the same, but the problem is still there: > {noformat} > 17/01/13 19:18:02 WARN scheduler.TaskSetManager: Lost task 0.0 in stage > 1983.0 (TID 304851, xxx): org.apache.spark.SparkException: Task failed while > writing rows. > at > org.apache.spark.sql.execution.datasources.DynamicPartitionWriterContainer.writeRows(WriterContainer.scala:414) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) > at org.apache.spark.scheduler.Task.run(Task.scala:89) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: org.apache.spark.shuffle.FetchFailedException: Failed to connect > to xxx/yyy:zzz > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:323) > ... > 17/01/13 19:19:29 ERROR scheduler.TaskSetManager: Task 0 in stage 1983.0 > failed 4 times; aborting job > {noformat} > I think the right fix here is to also set a fetch failure status in the > {{TaskContextImpl}}, so the executor can check that instead of just one > exception. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19276) FetchFailures can be hidden by user (or sql) exception handling
[ https://issues.apache.org/jira/browse/SPARK-19276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15830387#comment-15830387 ] Mark Hamstra commented on SPARK-19276: -- This all makes sense, and the PR is a good effort to fix this kind of "accidental" swallowing of FetchFailedException. I guess my only real question is if we should allow for the possibility of a FetchFailedException not only being caught, but also the failure being remedied by some means other than the usual handling in the driver. I'm not sure exactly how or why that kind of "fix the fetch failure before Spark tries to handle it" would be done; and it would seem that something like that would be prone to subtle errors, so maybe we should just set it in stone that nobody but the driver should try to fix a fetch failure -- which would make the approach of your "guarantee that the FetchFailedException is seen by the driver" PR completely correct. > FetchFailures can be hidden by user (or sql) exception handling > --- > > Key: SPARK-19276 > URL: https://issues.apache.org/jira/browse/SPARK-19276 > Project: Spark > Issue Type: Bug > Components: Scheduler, Spark Core, SQL >Affects Versions: 2.1.0 >Reporter: Imran Rashid >Priority: Critical > > The scheduler handles node failures by looking for a special > {{FetchFailedException}} thrown by the shuffle block fetcher. This is > handled in {{Executor}} and then passed as a special msg back to the driver: > https://github.com/apache/spark/blob/278fa1eb305220a85c816c948932d6af8fa619aa/core/src/main/scala/org/apache/spark/executor/Executor.scala#L403 > However, user code exists in between the shuffle block fetcher and that catch > block -- it could intercept the exception, wrap it with something else, and > throw a different exception. If that happens, spark treats it as an ordinary > task failure, and retries the task, rather than regenerating the missing > shuffle data. The task eventually is retried 4 times, its doomed to fail > each time, and the job is failed. > You might think that no user code should do that -- but even sparksql does it: > https://github.com/apache/spark/blob/278fa1eb305220a85c816c948932d6af8fa619aa/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatWriter.scala#L214 > Here's an example stack trace. This is from Spark 1.6, so the sql code is > not the same, but the problem is still there: > {noformat} > 17/01/13 19:18:02 WARN scheduler.TaskSetManager: Lost task 0.0 in stage > 1983.0 (TID 304851, xxx): org.apache.spark.SparkException: Task failed while > writing rows. > at > org.apache.spark.sql.execution.datasources.DynamicPartitionWriterContainer.writeRows(WriterContainer.scala:414) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) > at org.apache.spark.scheduler.Task.run(Task.scala:89) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: org.apache.spark.shuffle.FetchFailedException: Failed to connect > to xxx/yyy:zzz > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:323) > ... > 17/01/13 19:19:29 ERROR scheduler.TaskSetManager: Task 0 in stage 1983.0 > failed 4 times; aborting job > {noformat} > I think the right fix here is to also set a fetch failure status in the > {{TaskContextImpl}}, so the executor can check that instead of just one > exception. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19276) FetchFailures can be hidden by user (or sql) exception handling
[ https://issues.apache.org/jira/browse/SPARK-19276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15829396#comment-15829396 ] Apache Spark commented on SPARK-19276: -- User 'squito' has created a pull request for this issue: https://github.com/apache/spark/pull/16639 > FetchFailures can be hidden by user (or sql) exception handling > --- > > Key: SPARK-19276 > URL: https://issues.apache.org/jira/browse/SPARK-19276 > Project: Spark > Issue Type: Bug > Components: Scheduler, Spark Core, SQL >Affects Versions: 2.1.0 >Reporter: Imran Rashid >Priority: Critical > > The scheduler handles node failures by looking for a special > {{FetchFailedException}} thrown by the shuffle block fetcher. This is > handled in {{Executor}} and then passed as a special msg back to the driver: > https://github.com/apache/spark/blob/278fa1eb305220a85c816c948932d6af8fa619aa/core/src/main/scala/org/apache/spark/executor/Executor.scala#L403 > However, user code exists in between the shuffle block fetcher and that catch > block -- it could intercept the exception, wrap it with something else, and > throw a different exception. If that happens, spark treats it as an ordinary > task failure, and retries the task, rather than regenerating the missing > shuffle data. The task eventually is retried 4 times, its doomed to fail > each time, and the job is failed. > You might think that no user code should do that -- but even sparksql does it: > https://github.com/apache/spark/blob/278fa1eb305220a85c816c948932d6af8fa619aa/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatWriter.scala#L214 > Here's an example stack trace. This is from Spark 1.6, so the sql code is > not the same, but the problem is still there: > {noformat} > 17/01/13 19:18:02 WARN scheduler.TaskSetManager: Lost task 0.0 in stage > 1983.0 (TID 304851, xxx): org.apache.spark.SparkException: Task failed while > writing rows. > at > org.apache.spark.sql.execution.datasources.DynamicPartitionWriterContainer.writeRows(WriterContainer.scala:414) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) > at org.apache.spark.scheduler.Task.run(Task.scala:89) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: org.apache.spark.shuffle.FetchFailedException: Failed to connect > to xxx/yyy:zzz > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:323) > ... > 17/01/13 19:19:29 ERROR scheduler.TaskSetManager: Task 0 in stage 1983.0 > failed 4 times; aborting job > {noformat} > I think the right fix here is to also set a fetch failure status in the > {{TaskContextImpl}}, so the executor can check that instead of just one > exception. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19276) FetchFailures can be hidden be user (or sql) exception handling
[ https://issues.apache.org/jira/browse/SPARK-19276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15828518#comment-15828518 ] Imran Rashid commented on SPARK-19276: -- I haven't been successful in creating a test case to reproduce this. In my attempts, I see retries sometimes failing to fetch block *status*, which happens in the initialization of the {{ShuffleBlockFetcherIterator}}. since that is outside user code, it does get handled correctly, and the shuffle data is regenerated. But, the problem is clearly there, and I've seen it effect real users. open to ideas for the test case & reproduction. > FetchFailures can be hidden be user (or sql) exception handling > --- > > Key: SPARK-19276 > URL: https://issues.apache.org/jira/browse/SPARK-19276 > Project: Spark > Issue Type: Bug > Components: Scheduler, Spark Core, SQL >Affects Versions: 2.1.0 >Reporter: Imran Rashid >Priority: Critical > > The scheduler handles node failures by looking for a special > {{FetchFailedException}} thrown by the shuffle block fetcher. This is > handled in {{Executor}} and then passed as a special msg back to the driver: > https://github.com/apache/spark/blob/278fa1eb305220a85c816c948932d6af8fa619aa/core/src/main/scala/org/apache/spark/executor/Executor.scala#L403 > However, user code exists in between the shuffle block fetcher and that catch > block -- it could intercept the exception, wrap it with something else, and > throw a different exception. If that happens, spark treats it as an ordinary > task failure, and retries the task, rather than regenerating the missing > shuffle data. The task eventually is retried 4 times, its doomed to fail > each time, and the job is failed. > You might think that no user code should do that -- but even sparksql does it: > https://github.com/apache/spark/blob/278fa1eb305220a85c816c948932d6af8fa619aa/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatWriter.scala#L214 > Here's an example stack trace. This is from Spark 1.6, so the sql code is > not the same, but the problem is still there: > {noformat} > 17/01/13 19:18:02 WARN scheduler.TaskSetManager: Lost task 0.0 in stage > 1983.0 (TID 304851, xxx): org.apache.spark.SparkException: Task failed while > writing rows. > at > org.apache.spark.sql.execution.datasources.DynamicPartitionWriterContainer.writeRows(WriterContainer.scala:414) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) > at org.apache.spark.scheduler.Task.run(Task.scala:89) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: org.apache.spark.shuffle.FetchFailedException: Failed to connect > to xxx/yyy:zzz > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:323) > ... > 17/01/13 19:19:29 ERROR scheduler.TaskSetManager: Task 0 in stage 1983.0 > failed 4 times; aborting job > {noformat} > I think the right fix here is to also set a fetch failure status in the > {{TaskContextImpl}}, so the executor can check that instead of just one > exception. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org