[jira] [Commented] (SPARK-23811) Same tasks' FetchFailed event comes before Success will cause child stage never succeed

2018-03-28 Thread Li Yuanjian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16418434#comment-16418434
 ] 

Li Yuanjian commented on SPARK-23811:
-

 

The scenario can be reproduced by below test case added in 
`{{DAGSchedulerSuite`}}
{code:java}
/**
 * This tests the case where origin task success after speculative task got 
FetchFailed
 * before.
 */
test("[SPARK-23811] Fetch failed task should kill other attempt") {
  // Create 3 RDDs with shuffle dependencies on each other: rddA <--- rddB <--- 
rddC
  val rddA = new MyRDD(sc, 2, Nil)
  val shuffleDepA = new ShuffleDependency(rddA, new HashPartitioner(2))
  val shuffleIdA = shuffleDepA.shuffleId

  val rddB = new MyRDD(sc, 2, List(shuffleDepA), tracker = mapOutputTracker)
  val shuffleDepB = new ShuffleDependency(rddB, new HashPartitioner(2))

  val rddC = new MyRDD(sc, 2, List(shuffleDepB), tracker = mapOutputTracker)

  submit(rddC, Array(0, 1))

  // Complete both tasks in rddA.
  assert(taskSets(0).stageId === 0 && taskSets(0).stageAttemptId === 0)
  complete(taskSets(0), Seq(
(Success, makeMapStatus("hostA", 2)),
(Success, makeMapStatus("hostB", 2

  // The first task success
  runEvent(makeCompletionEvent(
taskSets(1).tasks(0), Success, makeMapStatus("hostB", 2)))

  // The second task's speculative attempt fails first, but task self still 
running.
  // This may caused by ExecutorLost.
  runEvent(makeCompletionEvent(
taskSets(1).tasks(1),
FetchFailed(makeBlockManagerId("hostA"), shuffleIdA, 0, 0, "ignored"),
null))
  // Check currently missing partition
  assert(mapOutputTracker.findMissingPartitions(shuffleDepB.shuffleId).get.size 
=== 1)
  val missingPartition = 
mapOutputTracker.findMissingPartitions(shuffleDepB.shuffleId).get(0)

  // The second result task self success soon
  runEvent(makeCompletionEvent(
taskSets(1).tasks(1), Success, makeMapStatus("hostB", 2)))
  // No missing partitions here, this will cause child stage never succeed
  assert(mapOutputTracker.findMissingPartitions(shuffleDepB.shuffleId).get.size 
=== 0)
}
{code}
 

> Same tasks' FetchFailed event comes before Success will cause child stage 
> never succeed
> ---
>
> Key: SPARK-23811
> URL: https://issues.apache.org/jira/browse/SPARK-23811
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0, 2.3.0
>Reporter: Li Yuanjian
>Priority: Major
> Attachments: 1.png, 2.png
>
>
> This is a bug caused by abnormal scenario describe below:
>  # ShuffleMapTask 1.0 running, this task will fetch data from ExecutorA
>  # ExecutorA Lost, trigger `mapOutputTracker.removeOutputsOnExecutor(execId)` 
> , shuffleStatus changed.
>  # Speculative ShuffleMapTask 1.1 start, got a FetchFailed immediately.
>  # ShuffleMapTask 1 is the last task of its stage, so this stage will never 
> succeed because of there's no missing task DagScheduler can get.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23811) Same tasks' FetchFailed event comes before Success will cause child stage never succeed

2018-03-28 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16418431#comment-16418431
 ] 

Apache Spark commented on SPARK-23811:
--

User 'xuanyuanking' has created a pull request for this issue:
https://github.com/apache/spark/pull/20930

> Same tasks' FetchFailed event comes before Success will cause child stage 
> never succeed
> ---
>
> Key: SPARK-23811
> URL: https://issues.apache.org/jira/browse/SPARK-23811
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0, 2.3.0
>Reporter: Li Yuanjian
>Priority: Major
> Attachments: 1.png, 2.png
>
>
> This is a bug caused by abnormal scenario describe below:
>  # ShuffleMapTask 1.0 running, this task will fetch data from ExecutorA
>  # ExecutorA Lost, trigger `mapOutputTracker.removeOutputsOnExecutor(execId)` 
> , shuffleStatus changed.
>  # Speculative ShuffleMapTask 1.1 start, got a FetchFailed immediately.
>  # ShuffleMapTask 1 is the last task of its stage, so this stage will never 
> succeed because of there's no missing task DagScheduler can get.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org