Re: Merging files

John Omernik Thu, 23 Jun 2016 10:46:44 -0700

Jinfeng -

I wrote my item prior to reading yours. Just an FYI, when I ran with that
settting, I got a "CannotPlanException" (with an error that is easily the
longest "non-verbose"( heck this beats all the verbose errors I've had)
I've ever seen. I'd post it here, but I am not unsure if my Google has
enough storage to handle this message....


(kidding... sorta)

John



On Thu, Jun 23, 2016 at 12:37 PM, Jinfeng Ni <[email protected]> wrote:

> Do you partition by day in your CTAS? If that's the case, CTAS will
> produce at least one parquet file for each value of "day".  If you
> have 100 days, then you will end up at least 100 files. However, in
> case the query is executed in distributed mode, there could be more
> than one file per value.
>
> In order to get one and only one parquet file for each partition
> value, turn on this option:
>
> alter session set `store.partition.hash_distribute` = true;
>
>
>
> On Thu, Jun 23, 2016 at 10:26 AM, Jason Altekruse <[email protected]>
> wrote:
> > Apply a sort in your CTAS, this will force the data down to a single
> stream
> > before writing.
> >
> > Jason Altekruse
> > Software Engineer at Dremio
> > Apache Drill Committer
> >
> > On Thu, Jun 23, 2016 at 10:23 AM, John Omernik <[email protected]> wrote:
> >
> >> When have a small query writing smaller data (like aggregate tables for
> >> faster aggregates for Dashboards etc).  It appears to write a ton of
> small
> >> files.  Not sure why, maybe its just how the join worked out etc. I
> have a
> >> "day" that is 1.5M in total size, but 400 files total. This seems
> >> excessive.
> >>
> >> While I don't have the "small files" issues because I run MapR-FS,
> having
> >> 400 files that make 1.5 mb of total date kills me on the planning phase.
> >>  How can I get Drill, when doing a CTAS to go through a round of
> >> consolidation on the parquet files?
> >>
> >> Thanks
> >>
> >> John
> >>
>

Re: Merging files

Reply via email to