Hi Spark Community,
I am working on an optimization where I need to map Spark Partition IDs to
their underlying Input File Names before the job execution starts.

My Approach:
I access *df.queryExecution.executedPlan -> FileSourceScanExec
-> FileScanRDD
<https://github.com/apache/spark/blob/6df8d57b30e7fad18cb9e05309eed4e801128b62/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileScanRDD.scala#L80>*

<https://github.com/apache/spark/blob/6df8d57b30e7fad18cb9e05309eed4e801128b62/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileScanRDD.scala#L80>to
extract the file mapping directly from the driver's metadata. This gives me
a *Map[PartitionID, Seq[FileName]]* instantly without triggering a Spark
job. Later, I use a SparkListener to identify finished files.

My Questions:
- *Immutability*: Can I guarantee that the mapping inside
FileScanRDD.partitions is immutable for the lifespan of that specific
DataFrame/RDD execution plan?

-* Dynamic Allocation & Failures*: If a job runs on a cluster with Dynamic
Allocation enabled (executors added/removed) or if nodes fail causing task
retries: Is it guaranteed that the Partition ID - Files mapping remains
constant? My assumption: The scheduler might reschedule the task to a
different node, but the partition mapping itself stays the same. Is this
correct?

- *Adaptive Query Execution (AQE)*: I am aware that since Spark 3.2.0,
spark.sql.adaptive.enabled
<https://spark.apache.org/docs/latest/sql-performance-tuning.html#adaptive-query-execution>
is
true by default. Does AQE ever modify the initial FileSourceScanExec
partitions at runtime in a simple Read -> Write flow (without explicit
shuffles)?

I understand this relies on internal APIs, but I want to ensure the logic
regarding partition ID stability is predictable.
Thanks in advance!

Reply via email to