Re: Documentation on "Automatic file coalescing for native data sources"?

ayan guha Fri, 19 May 2017 16:16:23 -0700

I think like all other read operations, it is driven by input format used,
and I think some variation of combine file input format is used by default.
I think you can test it by force a particular input format which gets ine
file per split, then you should end up with same number of partitions as
your dsta files


On Sat, 20 May 2017 at 5:12 am, Aakash Basu <aakash.spark....@gmail.com>
wrote:

> Hey all,
>
> A reply on this would be great!
>
> Thanks,
> A.B.
>
> On 17-May-2017 1:43 AM, "Daniel Siegmann" <dsiegm...@securityscorecard.io>
> wrote:
>
>> When using spark.read on a large number of small files, these are
>> automatically coalesced into fewer partitions. The only documentation I can
>> find on this is in the Spark 2.0.0 release notes, where it simply says (
>> http://spark.apache.org/releases/spark-release-2-0-0.html):
>>
>> "Automatic file coalescing for native data sources"
>>
>> Can anyone point me to documentation explaining what triggers this
>> feature, how it decides how many partitions to coalesce to, and what counts
>> as a "native data source"? I couldn't find any mention of this feature in
>> the SQL Programming Guide and Google was not helpful.
>>
>> --
>> Daniel Siegmann
>> Senior Software Engineer
>> *SecurityScorecard Inc.*
>> 214 W 29th Street, 5th Floor
>> New York, NY 10001
>>
>> --
Best Regards,
Ayan Guha

Re: Documentation on "Automatic file coalescing for native data sources"?

Reply via email to