Re: Documentation on "Automatic file coalescing for native data sources"?

2017-05-26 Thread Daniel Siegmann
Thanks for the help everyone. It seems the automatic coalescing doesn't happen when accessing ORC data through a Hive metastore unless you configure spark.sql.hive.convertMetastoreOrc to be true (it is false by default). I'm not sure if this is documented somewhere, or if there's any reason not

Re: Documentation on "Automatic file coalescing for native data sources"?

2017-05-20 Thread Kabeer Ahmed
Thank you Takeshi.As far as I see from the code pointed, the default number of bytes to pack in a partition is set to 128MB - size of the parquet block size. Daniel,It seems you do have a need to modify the number of bytes you want to pack per partition. I am curious to know the scenario. Please

Re: Documentation on "Automatic file coalescing for native data sources"?

2017-05-20 Thread Takeshi Yamamuro
I think this document points to a logic here: https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala#L418 This logic merge small files into a partition and you can control this threshold via `spark.sql.files.maxPartitionBytes`.

Re: Documentation on "Automatic file coalescing for native data sources"?

2017-05-19 Thread ayan guha
I think like all other read operations, it is driven by input format used, and I think some variation of combine file input format is used by default. I think you can test it by force a particular input format which gets ine file per split, then you should end up with same number of partitions as

Re: Documentation on "Automatic file coalescing for native data sources"?

2017-05-19 Thread Aakash Basu
Hey all, A reply on this would be great! Thanks, A.B. On 17-May-2017 1:43 AM, "Daniel Siegmann" wrote: > When using spark.read on a large number of small files, these are > automatically coalesced into fewer partitions. The only documentation I can > find on

Documentation on "Automatic file coalescing for native data sources"?

2017-05-16 Thread Daniel Siegmann
When using spark.read on a large number of small files, these are automatically coalesced into fewer partitions. The only documentation I can find on this is in the Spark 2.0.0 release notes, where it simply says ( http://spark.apache.org/releases/spark-release-2-0-0.html): "Automatic file