Jinfeng - I wrote my item prior to reading yours. Just an FYI, when I ran with that settting, I got a "CannotPlanException" (with an error that is easily the longest "non-verbose"( heck this beats all the verbose errors I've had) I've ever seen. I'd post it here, but I am not unsure if my Google has enough storage to handle this message....
(kidding... sorta) John On Thu, Jun 23, 2016 at 12:37 PM, Jinfeng Ni <[email protected]> wrote: > Do you partition by day in your CTAS? If that's the case, CTAS will > produce at least one parquet file for each value of "day". If you > have 100 days, then you will end up at least 100 files. However, in > case the query is executed in distributed mode, there could be more > than one file per value. > > In order to get one and only one parquet file for each partition > value, turn on this option: > > alter session set `store.partition.hash_distribute` = true; > > > > On Thu, Jun 23, 2016 at 10:26 AM, Jason Altekruse <[email protected]> > wrote: > > Apply a sort in your CTAS, this will force the data down to a single > stream > > before writing. > > > > Jason Altekruse > > Software Engineer at Dremio > > Apache Drill Committer > > > > On Thu, Jun 23, 2016 at 10:23 AM, John Omernik <[email protected]> wrote: > > > >> When have a small query writing smaller data (like aggregate tables for > >> faster aggregates for Dashboards etc). It appears to write a ton of > small > >> files. Not sure why, maybe its just how the join worked out etc. I > have a > >> "day" that is 1.5M in total size, but 400 files total. This seems > >> excessive. > >> > >> While I don't have the "small files" issues because I run MapR-FS, > having > >> 400 files that make 1.5 mb of total date kills me on the planning phase. > >> How can I get Drill, when doing a CTAS to go through a round of > >> consolidation on the parquet files? > >> > >> Thanks > >> > >> John > >> >
