yaooqinn edited a comment on issue #25941: [WIP][SPARK-29257][Core][Shuffle] Use task attempt number as noop reduce id to handle disk failures during shuffle URL: https://github.com/apache/spark/pull/25941#issuecomment-535545941 We will not create duplicate shuffle files at all. The unsuccess attempts may have a chance to create or pick a subdirectory under the "blockmgr-xxx" before disk failures on it, but they can not commit the index and data files because of the disk failure. The “shuffle_$shuffleId_$mapId_$attemptNumber_0.index” and “shuffle_$shuffleId_$mapId_$attemptNumber_0.data” may locate in different disks, and any of them meets the bad disk will abort the write process and fail the task attempt. Only the success task can create those files. In another case, if the disk failure happens in shuffle read phase, which may cause fetch failed exception and re-run the dependent map taskes, but I guess for those duplicated files, our processing logic will not change.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org