When using spark.read on a large number of small files, these are automatically coalesced into fewer partitions. The only documentation I can find on this is in the Spark 2.0.0 release notes, where it simply says ( http://spark.apache.org/releases/spark-release-2-0-0.html):
"Automatic file coalescing for native data sources" Can anyone point me to documentation explaining what triggers this feature, how it decides how many partitions to coalesce to, and what counts as a "native data source"? I couldn't find any mention of this feature in the SQL Programming Guide and Google was not helpful. -- Daniel Siegmann Senior Software Engineer *SecurityScorecard Inc.* 214 W 29th Street, 5th Floor New York, NY 10001