Hi Mich, Thanks for the reply.
I did come across that file but it didn't align with the appearance of `PartitionedFile`: https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/PartitionedFileUtil.scala In fact, the code snippet you shared also references the type `PartitionedFile`. There's actually this javadoc.io page for a `PartitionedFile` at org.apache.spark.sql.execution.datasources for spark-sql_2.12:3.0.2: https://javadoc.io/doc/org.apache.spark/spark-sql_2.12/3.0.2/org/apache/spark/sql/execution/datasources/PartitionedFile.html. I double checked the source code for version 3.0.2 and doesn't seem to exist there either: https://github.com/apache/spark/tree/v3.0.2/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources Ashley On Mon, 8 Apr 2024 at 22:41, Mich Talebzadeh <mich.talebza...@gmail.com> wrote: > Hi, > > I believe this is the package > > > https://raw.githubusercontent.com/apache/spark/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FilePartition.scala > > And the code > > case class FilePartition(index: Int, files: Array[PartitionedFile]) > extends Partition with InputPartition { > override def preferredLocations(): Array[String] = { > // Computes total number of bytes that can be retrieved from each host. > val hostToNumBytes = mutable.HashMap.empty[String, Long] > files.foreach { file => > file.locations.filter(_ != "localhost").foreach { host => > hostToNumBytes(host) = hostToNumBytes.getOrElse(host, 0L) + > file.length > } > } > > // Selects the first 3 hosts with the most data to be retrieved. > hostToNumBytes.toSeq.sortBy { > case (host, numBytes) => numBytes > }.reverse.take(3).map { > case (host, numBytes) => host > }.toArray > } > } > > HTH > > Mich Talebzadeh, > Technologist | Solutions Architect | Data Engineer | Generative AI > London > United Kingdom > > > view my Linkedin profile > <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> > > > https://en.everybodywiki.com/Mich_Talebzadeh > > > > *Disclaimer:* The information provided is correct to the best of my > knowledge but of course cannot be guaranteed . It is essential to note > that, as with any advice, quote "one test result is worth one-thousand > expert opinions (Werner <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von > Braun <https://en.wikipedia.org/wiki/Wernher_von_Braun>)". > > > On Mon, 8 Apr 2024 at 20:31, Ashley McManamon < > ashley.mcmana...@quantcast.com> wrote: > >> Hi All, >> >> I've been diving into the source code to get a better understanding of >> how file splitting works from a user perspective. I've hit a deadend at >> `PartitionedFile`, for which I cannot seem to find a definition? It appears >> though it should be found at >> org.apache.spark.sql.execution.datasources but I find no definition in >> the entire source code. Am I missing something? >> >> I appreciate there may be an obvious answer here, apologies if I'm being >> naive. >> >> Thanks, >> Ashley McManamon >> >>