[GitHub] [spark] Stove-hust commented on pull request #40393: [SPARK-40082] Schedule mergeFinalize when push merge shuffleMapStage retry but no running tasks

2023-03-21 Thread via GitHub


Stove-hust commented on PR #40393:
URL: https://github.com/apache/spark/pull/40393#issuecomment-1478924572

   @mridulm 
   yep,it`s me
   Username: StoveM
   Full name: Fencheng Mei


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] Stove-hust commented on pull request #40393: [SPARK-40082] Schedule mergeFinalize when push merge shuffleMapStage retry but no running tasks

2023-03-21 Thread via GitHub


Stove-hust commented on PR #40393:
URL: https://github.com/apache/spark/pull/40393#issuecomment-1478924279

   > 
   
   yep,it`s me 
   Username:StoveM
   Full name:   Fencheng Mei


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] Stove-hust commented on pull request #40393: [SPARK-40082] Schedule mergeFinalize when push merge shuffleMapStage retry but no running tasks

2023-03-21 Thread via GitHub


Stove-hust commented on PR #40393:
URL: https://github.com/apache/spark/pull/40393#issuecomment-1478849048

   > I could not cherry pick this into 3.4 and 3.3 - we should fix for those 
branches as well IMO. Can you create a PR against those two branches as well 
@Stove-hust ? Thanks
   
   No problem


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] Stove-hust commented on pull request #40393: [SPARK-40082] Schedule mergeFinalize when push merge shuffleMapStage retry but no running tasks

2023-03-20 Thread via GitHub


Stove-hust commented on PR #40393:
URL: https://github.com/apache/spark/pull/40393#issuecomment-1475695975

   > Technically, 3 :-) The UT that I added will generate 2 tests - one for 
push based shuffle and one without. And we have the initial test you added.
   > 
   > You dont need to mark it as written by me ! We can include it in your PR - 
with any changes you make as part of the adding it.
   
   Thanks for your answer, I have added all three UTs (including you wrote)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] Stove-hust commented on pull request #40393: [SPARK-40082] Schedule mergeFinalize when push merge shuffleMapStage retry but no running tasks

2023-03-19 Thread via GitHub


Stove-hust commented on PR #40393:
URL: https://github.com/apache/spark/pull/40393#issuecomment-1475528785

   > @Stove-hust To clarify - I meant add this as well (after you had a chance 
to look at it and clean it up if required - this was from my test setup). We 
should keep the UT you had added - and it is important to test the specific 
code expectation as it stands today.
   
   Sorry, I misunderstood what you meant。
   I think the UT  written by you is great, can I write your UT in my PR, I 
will mark this part of UT written by you。
   I have one more question, so for this PR we will have two UT, is that right?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] Stove-hust commented on pull request #40393: [SPARK-40082] Schedule mergeFinalize when push merge shuffleMapStage retry but no running tasks

2023-03-18 Thread via GitHub


Stove-hust commented on PR #40393:
URL: https://github.com/apache/spark/pull/40393#issuecomment-1474918516

   > Instead of only testing specifically for the flag - which is subject to 
change as the implementation evolves, we should also test for behavior here.
   > 
   > This is the reproducible test I was using (with some changes) to test 
approaches for this bug - and it mimics the case I saw in our production 
reasonably well. (In DAGSchedulerSuite):
   > 
   > ```
   >   for (pushBasedShuffleEnabled <- Seq(true, false)) {
   > test("SPARK-40082: recomputation of shuffle map stage with no pending 
partitions should not " +
   > s"hang. pushBasedShuffleEnabled = $pushBasedShuffleEnabled") {
   > 
   >   if (pushBasedShuffleEnabled) {
   > initPushBasedShuffleConfs(conf)
   > DAGSchedulerSuite.clearMergerLocs()
   > DAGSchedulerSuite.addMergerLocs(Seq("host1", "host2", "host3", 
"host4", "host5"))
   >   }
   > 
   >   var taskIdCount = 0
   > 
   >   var completedStage: List[Int] = Nil
   >   val listener = new SparkListener() {
   > override def onStageCompleted(event: SparkListenerStageCompleted): 
Unit = {
   >   completedStage = completedStage :+ event.stageInfo.stageId
   > }
   >   }
   >   sc.addSparkListener(listener)
   > 
   >   val fetchFailParentPartition = 0
   > 
   >   val shuffleMapRdd0 = new MyRDD(sc, 2, Nil)
   >   val shuffleDep0 = new ShuffleDependency(shuffleMapRdd0, new 
HashPartitioner(2))
   > 
   >   val shuffleMapRdd1 = new MyRDD(sc, 2, List(shuffleDep0), tracker = 
mapOutputTracker)
   >   val shuffleDep1 = new ShuffleDependency(shuffleMapRdd1, new 
HashPartitioner(2))
   > 
   >   val reduceRdd = new MyRDD(sc, 2, List(shuffleDep1), tracker = 
mapOutputTracker)
   > 
   >   // submit the initial mapper stage, generate shuffle output for 
first reducer stage.
   >   submitMapStage(shuffleDep0)
   > 
   >   // Map stage completes successfully,
   >   completeShuffleMapStageSuccessfully(0, 0, 3, Seq("hostA", "hostB"))
   >   taskIdCount += 2
   >   assert(completedStage === List(0))
   > 
   >   // Now submit the first reducer stage
   >   submitMapStage(shuffleDep1)
   > 
   >   def createTaskInfo(speculative: Boolean): TaskInfo = {
   > val taskInfo = new TaskInfo(
   >   taskId = taskIdCount,
   >   index = 0,
   >   attemptNumber = 0,
   >   partitionId = 0,
   >   launchTime = 0L,
   >   executorId = "",
   >   host = "hostC",
   >   TaskLocality.ANY,
   >   speculative = speculative)
   > taskIdCount += 1
   > taskInfo
   >   }
   > 
   >   val normalTask = createTaskInfo(speculative = false);
   >   val speculativeTask = createTaskInfo(speculative = true)
   > 
   >   // fail task 1.0 due to FetchFailed, and make 1.1 succeed.
   >   runEvent(makeCompletionEvent(taskSets(1).tasks(0),
   > FetchFailed(makeBlockManagerId("hostA"), shuffleDep0.shuffleId, 
normalTask.taskId,
   >   fetchFailParentPartition, normalTask.index, "ignored"),
   > result = null,
   > Seq.empty,
   > Array.empty,
   > normalTask))
   > 
   >   // Make the speculative task succeed after initial task has failed
   >   runEvent(makeCompletionEvent(taskSets(1).tasks(0), Success,
   > result = MapStatus(BlockManagerId("hostD-exec1", "hostD", 34512),
   >   Array.fill[Long](2)(2), mapTaskId = speculativeTask.taskId),
   > taskInfo = speculativeTask))
   > 
   >   // The second task, for partition 1 succeeds as well.
   >   runEvent(makeCompletionEvent(taskSets(1).tasks(1), Success,
   > result = MapStatus(BlockManagerId("hostE-exec2", "hostE", 23456),
   >   Array.fill[Long](2)(2), mapTaskId = taskIdCount),
   >   ))
   >   taskIdCount += 1
   > 
   >   sc.listenerBus.waitUntilEmpty()
   >   assert(completedStage === List(0, 2))
   > 
   >   // the stages will now get resubmitted due to the failure
   >   Thread.sleep(DAGScheduler.RESUBMIT_TIMEOUT * 2)
   > 
   >   // parent map stage resubmitted
   >   assert(scheduler.runningStages.size === 1)
   >   val mapStage = scheduler.runningStages.head
   > 
   >   // Stage 1 is same as Stage 0 - but created for the ShuffleMapTask 
2, as it is a
   >   // different job
   >   assert(mapStage.id === 1)
   >   assert(mapStage.latestInfo.failureReason.isEmpty)
   >   // only the partition reported in fetch failure is resubmitted
   >   assert(mapStage.latestInfo.numTasks === 1)
   > 
   >   val stage0Retry = taskSets.filter(_.stageId == 1)
   >   assert(stage0Retry.size === 1)
   >   // make the original task succeed
   >   
runEvent(makeCompletionEvent(stage0Retry.head.tasks(fetchFailParentPartition), 
Success,
   > 

[GitHub] [spark] Stove-hust commented on pull request #40393: [SPARK-40082] Schedule mergeFinalize when push merge shuffleMapStage retry but no running tasks

2023-03-17 Thread via GitHub


Stove-hust commented on PR #40393:
URL: https://github.com/apache/spark/pull/40393#issuecomment-1474162354

   > So this is an interesting coincidence, I literally encountered a 
production job which seems to be hitting this exact same issue :-) I was in the 
process of creating a test case, but my intuition was along the same lines as 
this PR.
   > 
   > Can you create a test case to validate this behavior @Stove-hust ? 
Essentially it should fail with current master, and succeed after this change.
   > 
   > Thanks for working on this fix
   
   Added UT


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] Stove-hust commented on pull request #40393: [SPARK-40082] Schedule mergeFinalize when push merge shuffleMapStage retry but no running tasks

2023-03-17 Thread via GitHub


Stove-hust commented on PR #40393:
URL: https://github.com/apache/spark/pull/40393#issuecomment-1473303194

   > So this is an interesting coincidence, I literally encountered a 
production job which seems to be hitting this exact same issue :-) I was in the 
process of creating a test case, but my intuition was along the same lines as 
this PR.
   > 
   > Can you create a test case to validate this behavior @Stove-hust ? 
Essentially it should fail with current master, and succeed after this change.
   > 
   > Thanks for working on this fix
   
   No problem


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] Stove-hust commented on pull request #40393: [SPARK-40082] Schedule mergeFinalize when push merge shuffleMapStage retry but no running tasks

2023-03-15 Thread via GitHub


Stove-hust commented on PR #40393:
URL: https://github.com/apache/spark/pull/40393#issuecomment-1469402096

   > @Stove-hust Haven't had a chance to look at it yet. I'll take a look at it 
this week.
   
   tks


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] Stove-hust commented on pull request #40393: [SPARK-40082] Schedule mergeFinalize when push merge shuffleMapStage retry but no running tasks

2023-03-14 Thread via GitHub


Stove-hust commented on PR #40393:
URL: https://github.com/apache/spark/pull/40393#issuecomment-146938

   @otterc Hello, is there anything else I should add?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] Stove-hust commented on pull request #40393: []SPARK-40082]

2023-03-13 Thread via GitHub


Stove-hust commented on PR #40393:
URL: https://github.com/apache/spark/pull/40393#issuecomment-1467340408

   > @Stove-hust Thank you for reporting and the patch. Would you be able to 
share driver logs?
   
   Sure(Add some comments)
   --- stage 10 faield 
   22/10/15 10:55:58 WARN task-result-getter-1 TaskSetManager: Lost task 435.1 
in stage 10.0 (TID 6822, zw02-data-hdp-dn21102.mt, executor 101): 
FetchFailed(null, shuffleId=3, mapIndex=-1, mapId=-1, reduceId=435, message=
   org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output 
location for shuffle 3 partition 435
   22/10/15 10:55:58 INFO dag-scheduler-event-loop DAGScheduler: 
ShuffleMapStage 10 (processCmd at CliDriver.java:386) failed in 601.792 s due 
to org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output 
location for shuffle 3 partition 435
   
   -- resubmit stage 10 && parentStage 9
   22/10/15 10:55:58 INFO dag-scheduler-event-loop DAGScheduler: Resubmitting 
ShuffleMapStage 9 (processCmd at CliDriver.java:386) and ShuffleMapStage 10 
(processCmd at CliDriver.java:386) due to fetch failure
   22/10/15 10:55:58 INFO dag-scheduler-event-loop DAGScheduler: Resubmitting 
failed stages
   22/10/15 10:55:58 INFO dag-scheduler-event-loop DAGScheduler: Submitting 
ShuffleMapStage 9 (MapPartitionsRDD[22] at processCmd at CliDriver.java:386), 
which has no missing parents
   22/10/15 10:55:58 INFO dag-scheduler-event-loop DAGScheduler: Push-based 
shuffle disabled for ShuffleMapStage 9 (processCmd at CliDriver.java:386) since 
it is already shuffle merge finalized
   22/10/15 10:55:58 INFO dag-scheduler-event-loop DAGScheduler: Submitting 3 
missing tasks from ShuffleMapStage 9 (MapPartitionsRDD[22] at processCmd at 
CliDriver.java:386) (first 15 tasks are for partitions Vector(98, 372, 690))
   22/10/15 10:55:58 INFO dag-scheduler-event-loop YarnClusterScheduler: Adding 
task set 9.1 with 3 tasks
   
   -- The first stage10 task completes one after another, and 
notifyDriverAboutPushCompletion to end stage 10, and mark finalizeTask, because 
the stage is not in runningStages, so the stage cannot be marked 
shuffleMergeFinalized.
   22/10/15 10:55:58 INFO task-result-getter-0 TaskSetManager: Finished task 
325.0 in stage 10.0 (TID 6166) in 154455 ms on zw02-data-hdp-dn25537.mt 
(executor 117) (494/500)
   22/10/15 10:55:59 WARN task-result-getter-1 TaskSetManager: Lost task 325.1 
in stage 10.0 (TID 6671, zw02-data-hdp-dn23160.mt, executor 47): TaskKilled 
(another attempt succeeded)
   22/10/15 10:56:20 WARN task-result-getter-1 TaskSetManager: Lost task 358.1 
in stage 10.0 (TID 6731, zw02-data-hdp-dn25537.mt, executor 95): TaskKilled 
(another attempt succeeded)
   22/10/15 10:56:20 INFO task-result-getter-1 TaskSetManager: Task 358.1 in 
stage 10.0 (TID 6731) failed, but the task will not be re-executed (either 
because the task failed with a shuffle data fetch failure, so the previous 
stage needs to be re-run, or because a different copy of the task has already 
succeeded).
   
   --- Removed TaskSet 10.0, whose tasks have all completed
   22/10/15 10:56:22 INFO task-result-getter-1 TaskSetManager: Ignoring 
task-finished event for 435.0 in stage 10.0 because task 435 has already 
completed successfully
   22/10/15 10:56:22 INFO task-result-getter-1 YarnClusterScheduler: Removed 
TaskSet 10.0, whose tasks have all completed, from pool 
   
   --- notifyDriverAboutPushCompletion stage 10
   22/10/15 10:56:23 INFO dag-scheduler-event-loop DAGScheduler: 
ShuffleMapStage 10 (processCmd at CliDriver.java:386) scheduled for finalizing 
shuffle merge in 0 s
   22/10/15 10:56:23 INFO shuffle-merge-finalizer-2 DAGScheduler: 
ShuffleMapStage 10 (processCmd at CliDriver.java:386) finalizing the shuffle 
merge with registering merge results set to true
   
   --- stage 9 finished 
   22/10/15 10:57:51 INFO task-result-getter-1 TaskSetManager: Finished task 
2.0 in stage 9.1 (TID 6825) in 112825 ms on zw02-data-hdp-dn25559.mt (executor 
74) (3/3)
   22/10/15 10:57:51 INFO task-result-getter-1 YarnClusterScheduler: Removed 
TaskSet 9.1, whose tasks have all completed, from pool 
   22/10/15 10:57:51 INFO dag-scheduler-event-loop DAGScheduler: 
ShuffleMapStage 9 (processCmd at CliDriver.java:386) finished in 112.832 s
   
   --- resubmit stage 10
   2/10/15 10:57:51 INFO dag-scheduler-event-loop DAGScheduler: looking for 
newly runnable stages
   22/10/15 10:57:51 INFO dag-scheduler-event-loop DAGScheduler: running: 
Set(ShuffleMapStage 11, ShuffleMapStage 8)
   22/10/15 10:57:51 INFO dag-scheduler-event-loop DAGScheduler: waiting: 
Set(ShuffleMapStage 12, ShuffleMapStage 10)
   22/10/15 10:57:51 INFO dag-scheduler-event-loop DAGScheduler: failed: Set()
   22/10/15 10:57:51 INFO dag-scheduler-event-loop DAGScheduler: Submitting 
ShuffleMapStage 10 (MapPartitionsRDD[36] at processCmd at CliDriver.java:386), 
which has no missing parents
   22/10/15 10:57:51 INFO dag-scheduler-event-loop OutputCommitCoordinator: 

[GitHub] [spark] Stove-hust commented on pull request #40393: []SPARK-40082]

2023-03-13 Thread via GitHub


Stove-hust commented on PR #40393:
URL: https://github.com/apache/spark/pull/40393#issuecomment-1467339828

   > @Stove-hust Thank you for reporting and the patch. Would you be able to 
share driver logs?
   
   **
   --- stage 10 faield 
   22/10/15 10:55:58 WARN task-result-getter-1 TaskSetManager: Lost task 435.1 
in stage 10.0 (TID 6822, zw02-data-hdp-dn21102.mt, executor 101): 
FetchFailed(null, shuffleId=3, mapIndex=-1, mapId=-1, reduceId=435, message=
   org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output 
location for shuffle 3 partition 435
   22/10/15 10:55:58 INFO dag-scheduler-event-loop DAGScheduler: 
ShuffleMapStage 10 (processCmd at CliDriver.java:386) failed in 601.792 s due 
to org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output 
location for shuffle 3 partition 435
   
   --- resubmit stage 10 && parentStage 9
   22/10/15 10:55:58 INFO dag-scheduler-event-loop DAGScheduler: Resubmitting 
ShuffleMapStage 9 (processCmd at CliDriver.java:386) and ShuffleMapStage 10 
(processCmd at CliDriver.java:386) due to fetch failure
   22/10/15 10:55:58 INFO dag-scheduler-event-loop DAGScheduler: Resubmitting 
failed stages
   22/10/15 10:55:58 INFO dag-scheduler-event-loop DAGScheduler: Submitting 
ShuffleMapStage 9 (MapPartitionsRDD[22] at processCmd at CliDriver.java:386), 
which has no missing parents
   22/10/15 10:55:58 INFO dag-scheduler-event-loop DAGScheduler: Push-based 
shuffle disabled for ShuffleMapStage 9 (processCmd at CliDriver.java:386) since 
it is already shuffle merge finalized
   22/10/15 10:55:58 INFO dag-scheduler-event-loop DAGScheduler: Submitting 3 
missing tasks from ShuffleMapStage 9 (MapPartitionsRDD[22] at processCmd at 
CliDriver.java:386) (first 15 tasks are for partitions Vector(98, 372, 690))
   22/10/15 10:55:58 INFO dag-scheduler-event-loop YarnClusterScheduler: Adding 
task set 9.1 with 3 tasks
   
   --- The first stage10 task completes one after another, and 
notifyDriverAboutPushCompletion to end stage 10, and mark finalizeTask, because 
the stage is not in runningStages, so the stage cannot be marked 
shuffleMergeFinalized.
   22/10/15 10:55:58 INFO task-result-getter-0 TaskSetManager: Finished task 
325.0 in stage 10.0 (TID 6166) in 154455 ms on zw02-data-hdp-dn25537.mt 
(executor 117) (494/500)
   22/10/15 10:55:59 WARN task-result-getter-1 TaskSetManager: Lost task 325.1 
in stage 10.0 (TID 6671, zw02-data-hdp-dn23160.mt, executor 47): TaskKilled 
(another attempt succeeded)
   22/10/15 10:56:20 WARN task-result-getter-1 TaskSetManager: Lost task 358.1 
in stage 10.0 (TID 6731, zw02-data-hdp-dn25537.mt, executor 95): TaskKilled 
(another attempt succeeded)
   22/10/15 10:56:20 INFO task-result-getter-1 TaskSetManager: Task 358.1 in 
stage 10.0 (TID 6731) failed, but the task will not be re-executed (either 
because the task failed with a shuffle data fetch failure, so the previous 
stage needs to be re-run, or because a different copy of the task has already 
succeeded).
   
   --- Removed TaskSet 10.0, whose tasks have all completed
   22/10/15 10:56:22 INFO task-result-getter-1 TaskSetManager: Ignoring 
task-finished event for 435.0 in stage 10.0 because task 435 has already 
completed successfully
   22/10/15 10:56:22 INFO task-result-getter-1 YarnClusterScheduler: Removed 
TaskSet 10.0, whose tasks have all completed, from pool 
   
   --- notifyDriverAboutPushCompletion stage 10
   22/10/15 10:56:23 INFO dag-scheduler-event-loop DAGScheduler: 
ShuffleMapStage 10 (processCmd at CliDriver.java:386) scheduled for finalizing 
shuffle merge in 0 s
   22/10/15 10:56:23 INFO shuffle-merge-finalizer-2 DAGScheduler: 
ShuffleMapStage 10 (processCmd at CliDriver.java:386) finalizing the shuffle 
merge with registering merge results set to true
   
   --- stage 9 finished 
   22/10/15 10:57:51 INFO task-result-getter-1 TaskSetManager: Finished task 
2.0 in stage 9.1 (TID 6825) in 112825 ms on zw02-data-hdp-dn25559.mt (executor 
74) (3/3)
   22/10/15 10:57:51 INFO task-result-getter-1 YarnClusterScheduler: Removed 
TaskSet 9.1, whose tasks have all completed, from pool 
   22/10/15 10:57:51 INFO dag-scheduler-event-loop DAGScheduler: 
ShuffleMapStage 9 (processCmd at CliDriver.java:386) finished in 112.832 s
   
   --- resubmit stage 10
   2/10/15 10:57:51 INFO dag-scheduler-event-loop DAGScheduler: looking for 
newly runnable stages
   22/10/15 10:57:51 INFO dag-scheduler-event-loop DAGScheduler: running: 
Set(ShuffleMapStage 11, ShuffleMapStage 8)
   22/10/15 10:57:51 INFO dag-scheduler-event-loop DAGScheduler: waiting: 
Set(ShuffleMapStage 12, ShuffleMapStage 10)
   22/10/15 10:57:51 INFO dag-scheduler-event-loop DAGScheduler: failed: Set()
   22/10/15 10:57:51 INFO dag-scheduler-event-loop DAGScheduler: Submitting 
ShuffleMapStage 10 (MapPartitionsRDD[36] at processCmd at CliDriver.java:386), 
which has no missing parents
   22/10/15 10:57:51 INFO dag-scheduler-event-loop OutputCommitCoordinator: 
Reusing state from 

[GitHub] [spark] Stove-hust commented on pull request #40393: []SPARK-40082]

2023-03-13 Thread via GitHub


Stove-hust commented on PR #40393:
URL: https://github.com/apache/spark/pull/40393#issuecomment-1467339346

   > @Stove-hust Thank you for reporting and the patch. Would you be able to 
share driver logs?
   
   sure.
   `# stage 10 faield 
   22/10/15 10:55:58 WARN task-result-getter-1 TaskSetManager: Lost task 435.1 
in stage 10.0 (TID 6822, zw02-data-hdp-dn21102.mt, executor 101): 
FetchFailed(null, shuffleId=3, mapIndex=-1, mapId=-1, reduceId=435, message=
   org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output 
location for shuffle 3 partition 435
   22/10/15 10:55:58 INFO dag-scheduler-event-loop DAGScheduler: 
ShuffleMapStage 10 (processCmd at CliDriver.java:386) failed in 601.792 s due 
to org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output 
location for shuffle 3 partition 435
   
   # resubmit stage 10 && parentStage 9
   22/10/15 10:55:58 INFO dag-scheduler-event-loop DAGScheduler: Resubmitting 
ShuffleMapStage 9 (processCmd at CliDriver.java:386) and ShuffleMapStage 10 
(processCmd at CliDriver.java:386) due to fetch failure
   22/10/15 10:55:58 INFO dag-scheduler-event-loop DAGScheduler: Resubmitting 
failed stages
   22/10/15 10:55:58 INFO dag-scheduler-event-loop DAGScheduler: Submitting 
ShuffleMapStage 9 (MapPartitionsRDD[22] at processCmd at CliDriver.java:386), 
which has no missing parents
   22/10/15 10:55:58 INFO dag-scheduler-event-loop DAGScheduler: Push-based 
shuffle disabled for ShuffleMapStage 9 (processCmd at CliDriver.java:386) since 
it is already shuffle merge finalized
   22/10/15 10:55:58 INFO dag-scheduler-event-loop DAGScheduler: Submitting 3 
missing tasks from ShuffleMapStage 9 (MapPartitionsRDD[22] at processCmd at 
CliDriver.java:386) (first 15 tasks are for partitions Vector(98, 372, 690))
   22/10/15 10:55:58 INFO dag-scheduler-event-loop YarnClusterScheduler: Adding 
task set 9.1 with 3 tasks
   
   # The first stage10 task completes one after another, and 
notifyDriverAboutPushCompletion to end stage 10, and mark finalizeTask, because 
the stage is not in runningStages, so the stage cannot be marked 
shuffleMergeFinalized.
   22/10/15 10:55:58 INFO task-result-getter-0 TaskSetManager: Finished task 
325.0 in stage 10.0 (TID 6166) in 154455 ms on zw02-data-hdp-dn25537.mt 
(executor 117) (494/500)
   22/10/15 10:55:59 WARN task-result-getter-1 TaskSetManager: Lost task 325.1 
in stage 10.0 (TID 6671, zw02-data-hdp-dn23160.mt, executor 47): TaskKilled 
(another attempt succeeded)
   22/10/15 10:56:20 WARN task-result-getter-1 TaskSetManager: Lost task 358.1 
in stage 10.0 (TID 6731, zw02-data-hdp-dn25537.mt, executor 95): TaskKilled 
(another attempt succeeded)
   22/10/15 10:56:20 INFO task-result-getter-1 TaskSetManager: Task 358.1 in 
stage 10.0 (TID 6731) failed, but the task will not be re-executed (either 
because the task failed with a shuffle data fetch failure, so the previous 
stage needs to be re-run, or because a different copy of the task has already 
succeeded).
   
   # Removed TaskSet 10.0, whose tasks have all completed
   22/10/15 10:56:22 INFO task-result-getter-1 TaskSetManager: Ignoring 
task-finished event for 435.0 in stage 10.0 because task 435 has already 
completed successfully
   22/10/15 10:56:22 INFO task-result-getter-1 YarnClusterScheduler: Removed 
TaskSet 10.0, whose tasks have all completed, from pool 
   
   # notifyDriverAboutPushCompletion stage 10
   22/10/15 10:56:23 INFO dag-scheduler-event-loop DAGScheduler: 
ShuffleMapStage 10 (processCmd at CliDriver.java:386) scheduled for finalizing 
shuffle merge in 0 s
   22/10/15 10:56:23 INFO shuffle-merge-finalizer-2 DAGScheduler: 
ShuffleMapStage 10 (processCmd at CliDriver.java:386) finalizing the shuffle 
merge with registering merge results set to true
   
   # stage 9 finished 
   22/10/15 10:57:51 INFO task-result-getter-1 TaskSetManager: Finished task 
2.0 in stage 9.1 (TID 6825) in 112825 ms on zw02-data-hdp-dn25559.mt (executor 
74) (3/3)
   22/10/15 10:57:51 INFO task-result-getter-1 YarnClusterScheduler: Removed 
TaskSet 9.1, whose tasks have all completed, from pool 
   22/10/15 10:57:51 INFO dag-scheduler-event-loop DAGScheduler: 
ShuffleMapStage 9 (processCmd at CliDriver.java:386) finished in 112.832 s
   
   # resubmit stage 10
   2/10/15 10:57:51 INFO dag-scheduler-event-loop DAGScheduler: looking for 
newly runnable stages
   22/10/15 10:57:51 INFO dag-scheduler-event-loop DAGScheduler: running: 
Set(ShuffleMapStage 11, ShuffleMapStage 8)
   22/10/15 10:57:51 INFO dag-scheduler-event-loop DAGScheduler: waiting: 
Set(ShuffleMapStage 12, ShuffleMapStage 10)
   22/10/15 10:57:51 INFO dag-scheduler-event-loop DAGScheduler: failed: Set()
   22/10/15 10:57:51 INFO dag-scheduler-event-loop DAGScheduler: Submitting 
ShuffleMapStage 10 (MapPartitionsRDD[36] at processCmd at CliDriver.java:386), 
which has no missing parents
   22/10/15 10:57:51 INFO dag-scheduler-event-loop OutputCommitCoordinator: 
Reusing state from previous