Re: Merging files

Jinfeng Ni Thu, 23 Jun 2016 10:37:57 -0700

Do you partition by day in your CTAS? If that's the case, CTAS will
produce at least one parquet file for each value of "day".  If you
have 100 days, then you will end up at least 100 files. However, in
case the query is executed in distributed mode, there could be more
than one file per value.


In order to get one and only one parquet file for each partition
value, turn on this option:

alter session set `store.partition.hash_distribute` = true;



On Thu, Jun 23, 2016 at 10:26 AM, Jason Altekruse <[email protected]> wrote:
> Apply a sort in your CTAS, this will force the data down to a single stream
> before writing.
>
> Jason Altekruse
> Software Engineer at Dremio
> Apache Drill Committer
>
> On Thu, Jun 23, 2016 at 10:23 AM, John Omernik <[email protected]> wrote:
>
>> When have a small query writing smaller data (like aggregate tables for
>> faster aggregates for Dashboards etc).  It appears to write a ton of small
>> files.  Not sure why, maybe its just how the join worked out etc. I have a
>> "day" that is 1.5M in total size, but 400 files total. This seems
>> excessive.
>>
>> While I don't have the "small files" issues because I run MapR-FS, having
>> 400 files that make 1.5 mb of total date kills me on the planning phase.
>>  How can I get Drill, when doing a CTAS to go through a round of
>> consolidation on the parquet files?
>>
>> Thanks
>>
>> John
>>

Re: Merging files

Reply via email to