I think like all other read operations, it is driven by input format used, and I think some variation of combine file input format is used by default. I think you can test it by force a particular input format which gets ine file per split, then you should end up with same number of partitions as your dsta files
On Sat, 20 May 2017 at 5:12 am, Aakash Basu <aakash.spark....@gmail.com> wrote: > Hey all, > > A reply on this would be great! > > Thanks, > A.B. > > On 17-May-2017 1:43 AM, "Daniel Siegmann" <dsiegm...@securityscorecard.io> > wrote: > >> When using spark.read on a large number of small files, these are >> automatically coalesced into fewer partitions. The only documentation I can >> find on this is in the Spark 2.0.0 release notes, where it simply says ( >> http://spark.apache.org/releases/spark-release-2-0-0.html): >> >> "Automatic file coalescing for native data sources" >> >> Can anyone point me to documentation explaining what triggers this >> feature, how it decides how many partitions to coalesce to, and what counts >> as a "native data source"? I couldn't find any mention of this feature in >> the SQL Programming Guide and Google was not helpful. >> >> -- >> Daniel Siegmann >> Senior Software Engineer >> *SecurityScorecard Inc.* >> 214 W 29th Street, 5th Floor >> New York, NY 10001 >> >> -- Best Regards, Ayan Guha