Faiz Halde created SPARK-43407: ---------------------------------- Summary: Can executors recover/reuse shuffle files upon failure? Key: SPARK-43407 URL: https://issues.apache.org/jira/browse/SPARK-43407 Project: Spark Issue Type: Question Components: Spark Core Affects Versions: 3.3.1 Reporter: Faiz Halde
Hello, We've been in touch with a few spark specialists who suggested us a potential solution to improve the reliability of our jobs that are shuffle heavy Here is what our setup looks like * Spark version: 3.3.1 * Java version: 1.8 * We do not use external shuffle service * We use spot instances We run spark jobs on clusters that use Amazon EBS volumes. The spark.local.dir is mounted on this EBS volume. One of the offerings from the service we use is EBS migration which basically means if a host is about to get evicted, a new host is created and the EBS volume is attached to it When Spark assigns a new executor to the newly created instance, it basically can recover all the shuffle files that are already persisted in the migrated EBS volume Is this how it works? Do executors recover / re-register the shuffle files that they found? So far I have not come across any recovery mechanism. I can only see {noformat} KubernetesLocalDiskShuffleDataIO{noformat} that has a pre-init step where it tries to register the available shuffle files to itself A natural follow-up on this, If what they claim is true, then ideally we should expect that when an executor is killed/OOM'd and a new executor is spawned on the same host, the new executor registers the shuffle files to itself. Is that so? Thanks -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org