Ah, yes, I am not using the Partition by, I am using directory partitions because I have ongoing days to load (and inter day loads)
Thanks On Fri, Jun 24, 2016 at 12:57 AM, Jinfeng Ni <[email protected]> wrote: > This hash_distribute option should only matter when you have CTAS > "partition by". If you do not do partition in CTAS, there should be > no impact at all (in theory). Essentially, this option is to > re-distribute the data according to the partition key, before Drill > writes to target tables. See DRILL-3381 [1]. > > > [1] https://issues.apache.org/jira/browse/DRILL-3381 > > > On Thu, Jun 23, 2016 at 11:59 AM, John Omernik <[email protected]> wrote: > > It's basically a two level grouping that has a LEFT JOIN > > > > so > > > > select a.field1, a.field2, sum(b.somefield) as new_thing > > from table1 a LEFT JOIN table2 b on a.id = b.id > > where a.field1 = '2015-05-05' > > group by a.field1, b.field2 > > > > It's not a very complicated query, but it doesn't like the > hash_distribute > > :) > > > > On Thu, Jun 23, 2016 at 1:46 PM, Jinfeng Ni <[email protected]> > wrote: > > > >> I looked at the code. 1. Drill did log this long CanNotPlan msg in > >> error level. 2) It was replaced with a much shorter version msg only > >> when CanNotPlan was caused by cartesian join. > >> > >> I guess your query probably did not have cartesian join, and > >> CanNotPlan was caused by other reasons. Either way, I think Drill > >> should not display such verbose error msg (which is essentially the > >> planner internal state). > >> > >> I actually do not need the error message. The query itself would be > >> good enough in many cases. If the query has sensitive information in > >> column names, you can replace sensitive column names with arbitrary > >> names and still would hit the same issue, if your data is from > >> schema-on-read source (parquet / csv etc). Drill's planner does not > >> have schema infor; changing column name across the query would not > >> impact planner's behavior. > >> > >> btw: the long error change was part of DRILL-2958 [1] > >> > >> > >> [1] https://issues.apache.org/jira/browse/DRILL-2958 > >> > >> > >> > >> On Thu, Jun 23, 2016 at 11:13 AM, John Omernik <[email protected]> > wrote: > >> > Unfortunatly, the 14 mb error message contains to much proprietary > >> > information for me to post to a Jira, the query itself may also be a > bit > >> to > >> > revealing.This is Drill 1.6, so maybe the issue isn't fixed in my > >> version? > >> > do you know the original JIRA for the really long error? > >> > > >> > On Thu, Jun 23, 2016 at 12:56 PM, Jinfeng Ni <[email protected]> > >> wrote: > >> > > >> >> This "CannotPlanException" definitely is a bug in query planner. I > >> >> thought we had put code to show that extremely long error msg "only" > >> >> in debug mode. Looks like it's not that case. > >> >> > >> >> Could you please open a JIRA and post your query, if possible? thx. > >> >> > >> >> On Thu, Jun 23, 2016 at 10:45 AM, John Omernik <[email protected]> > >> wrote: > >> >> > Jinfeng - > >> >> > > >> >> > I wrote my item prior to reading yours. Just an FYI, when I ran > with > >> that > >> >> > settting, I got a "CannotPlanException" (with an error that is > easily > >> the > >> >> > longest "non-verbose"( heck this beats all the verbose errors I've > >> had) > >> >> > I've ever seen. I'd post it here, but I am not unsure if my Google > has > >> >> > enough storage to handle this message.... > >> >> > > >> >> > (kidding... sorta) > >> >> > > >> >> > John > >> >> > > >> >> > > >> >> > > >> >> > On Thu, Jun 23, 2016 at 12:37 PM, Jinfeng Ni < > [email protected]> > >> >> wrote: > >> >> > > >> >> >> Do you partition by day in your CTAS? If that's the case, CTAS > will > >> >> >> produce at least one parquet file for each value of "day". If you > >> >> >> have 100 days, then you will end up at least 100 files. However, > in > >> >> >> case the query is executed in distributed mode, there could be > more > >> >> >> than one file per value. > >> >> >> > >> >> >> In order to get one and only one parquet file for each partition > >> >> >> value, turn on this option: > >> >> >> > >> >> >> alter session set `store.partition.hash_distribute` = true; > >> >> >> > >> >> >> > >> >> >> > >> >> >> On Thu, Jun 23, 2016 at 10:26 AM, Jason Altekruse < > [email protected]> > >> >> >> wrote: > >> >> >> > Apply a sort in your CTAS, this will force the data down to a > >> single > >> >> >> stream > >> >> >> > before writing. > >> >> >> > > >> >> >> > Jason Altekruse > >> >> >> > Software Engineer at Dremio > >> >> >> > Apache Drill Committer > >> >> >> > > >> >> >> > On Thu, Jun 23, 2016 at 10:23 AM, John Omernik < > [email protected]> > >> >> wrote: > >> >> >> > > >> >> >> >> When have a small query writing smaller data (like aggregate > >> tables > >> >> for > >> >> >> >> faster aggregates for Dashboards etc). It appears to write a > ton > >> of > >> >> >> small > >> >> >> >> files. Not sure why, maybe its just how the join worked out > etc. > >> I > >> >> >> have a > >> >> >> >> "day" that is 1.5M in total size, but 400 files total. This > seems > >> >> >> >> excessive. > >> >> >> >> > >> >> >> >> While I don't have the "small files" issues because I run > MapR-FS, > >> >> >> having > >> >> >> >> 400 files that make 1.5 mb of total date kills me on the > planning > >> >> phase. > >> >> >> >> How can I get Drill, when doing a CTAS to go through a round > of > >> >> >> >> consolidation on the parquet files? > >> >> >> >> > >> >> >> >> Thanks > >> >> >> >> > >> >> >> >> John > >> >> >> >> > >> >> >> > >> >> > >> >
