Re: Merging files

John Omernik Thu, 23 Jun 2016 12:00:02 -0700

It's basically a two level grouping that has a LEFT JOIN

so


select a.field1, a.field2, sum(b.somefield) as new_thing
from table1 a LEFT JOIN table2 b on a.id = b.id
where a.field1 = '2015-05-05'
group by a.field1, b.field2

It's not a very complicated query, but it doesn't like the hash_distribute
:)

On Thu, Jun 23, 2016 at 1:46 PM, Jinfeng Ni <[email protected]> wrote:

> I looked at the code. 1. Drill did log this long CanNotPlan msg in
> error level. 2) It was replaced with a much shorter version msg only
> when CanNotPlan was caused by cartesian join.
>
> I guess your query probably did not have cartesian join, and
> CanNotPlan was caused by other reasons. Either way, I think Drill
> should not display such verbose error msg (which is essentially the
> planner internal state).
>
> I actually do not need the error message. The query itself would be
> good enough in many cases. If the query has sensitive information in
> column names, you can replace sensitive column names with arbitrary
> names and still would hit the same issue, if your data is from
> schema-on-read source (parquet / csv etc).  Drill's planner does not
> have schema infor; changing column name across the query would not
> impact planner's behavior.
>
> btw: the long error change was part of DRILL-2958 [1]
>
>
> [1] https://issues.apache.org/jira/browse/DRILL-2958
>
>
>
> On Thu, Jun 23, 2016 at 11:13 AM, John Omernik <[email protected]> wrote:
> > Unfortunatly, the 14 mb error message contains to much proprietary
> > information for me to post to a Jira, the query itself may also be a bit
> to
> > revealing.This is Drill 1.6, so maybe the issue isn't fixed in my
> version?
> > do you know the original JIRA for the really long error?
> >
> > On Thu, Jun 23, 2016 at 12:56 PM, Jinfeng Ni <[email protected]>
> wrote:
> >
> >> This "CannotPlanException" definitely is a bug in query planner. I
> >> thought we had put code to show that extremely long error msg "only"
> >> in debug mode. Looks like it's not that case.
> >>
> >> Could you please open a JIRA and post your query, if possible? thx.
> >>
> >> On Thu, Jun 23, 2016 at 10:45 AM, John Omernik <[email protected]>
> wrote:
> >> > Jinfeng -
> >> >
> >> > I wrote my item prior to reading yours. Just an FYI, when I ran with
> that
> >> > settting, I got a "CannotPlanException" (with an error that is easily
> the
> >> > longest "non-verbose"( heck this beats all the verbose errors I've
> had)
> >> > I've ever seen. I'd post it here, but I am not unsure if my Google has
> >> > enough storage to handle this message....
> >> >
> >> > (kidding... sorta)
> >> >
> >> > John
> >> >
> >> >
> >> >
> >> > On Thu, Jun 23, 2016 at 12:37 PM, Jinfeng Ni <[email protected]>
> >> wrote:
> >> >
> >> >> Do you partition by day in your CTAS? If that's the case, CTAS will
> >> >> produce at least one parquet file for each value of "day".  If you
> >> >> have 100 days, then you will end up at least 100 files. However, in
> >> >> case the query is executed in distributed mode, there could be more
> >> >> than one file per value.
> >> >>
> >> >> In order to get one and only one parquet file for each partition
> >> >> value, turn on this option:
> >> >>
> >> >> alter session set `store.partition.hash_distribute` = true;
> >> >>
> >> >>
> >> >>
> >> >> On Thu, Jun 23, 2016 at 10:26 AM, Jason Altekruse <[email protected]>
> >> >> wrote:
> >> >> > Apply a sort in your CTAS, this will force the data down to a
> single
> >> >> stream
> >> >> > before writing.
> >> >> >
> >> >> > Jason Altekruse
> >> >> > Software Engineer at Dremio
> >> >> > Apache Drill Committer
> >> >> >
> >> >> > On Thu, Jun 23, 2016 at 10:23 AM, John Omernik <[email protected]>
> >> wrote:
> >> >> >
> >> >> >> When have a small query writing smaller data (like aggregate
> tables
> >> for
> >> >> >> faster aggregates for Dashboards etc).  It appears to write a ton
> of
> >> >> small
> >> >> >> files.  Not sure why, maybe its just how the join worked out etc.
> I
> >> >> have a
> >> >> >> "day" that is 1.5M in total size, but 400 files total. This seems
> >> >> >> excessive.
> >> >> >>
> >> >> >> While I don't have the "small files" issues because I run MapR-FS,
> >> >> having
> >> >> >> 400 files that make 1.5 mb of total date kills me on the planning
> >> phase.
> >> >> >>  How can I get Drill, when doing a CTAS to go through a round of
> >> >> >> consolidation on the parquet files?
> >> >> >>
> >> >> >> Thanks
> >> >> >>
> >> >> >> John
> >> >> >>
> >> >>
> >>
>

Re: Merging files

Reply via email to