Re: Merging files

John Omernik Fri, 24 Jun 2016 06:01:46 -0700

Ah, yes, I am not using the Partition by, I am using directory partitions
because I have ongoing days to load (and inter day loads)


Thanks


On Fri, Jun 24, 2016 at 12:57 AM, Jinfeng Ni <[email protected]> wrote:

> This hash_distribute option should only matter when you have CTAS
> "partition by".  If you do not do partition in CTAS,  there should be
> no impact at all (in theory). Essentially, this option is to
> re-distribute the data according to the partition key, before Drill
> writes to target tables.  See DRILL-3381 [1].
>
>
> [1] https://issues.apache.org/jira/browse/DRILL-3381
>
>
> On Thu, Jun 23, 2016 at 11:59 AM, John Omernik <[email protected]> wrote:
> > It's basically a two level grouping that has a LEFT JOIN
> >
> > so
> >
> > select a.field1, a.field2, sum(b.somefield) as new_thing
> > from table1 a LEFT JOIN table2 b on a.id = b.id
> > where a.field1 = '2015-05-05'
> > group by a.field1, b.field2
> >
> > It's not a very complicated query, but it doesn't like the
> hash_distribute
> > :)
> >
> > On Thu, Jun 23, 2016 at 1:46 PM, Jinfeng Ni <[email protected]>
> wrote:
> >
> >> I looked at the code. 1. Drill did log this long CanNotPlan msg in
> >> error level. 2) It was replaced with a much shorter version msg only
> >> when CanNotPlan was caused by cartesian join.
> >>
> >> I guess your query probably did not have cartesian join, and
> >> CanNotPlan was caused by other reasons. Either way, I think Drill
> >> should not display such verbose error msg (which is essentially the
> >> planner internal state).
> >>
> >> I actually do not need the error message. The query itself would be
> >> good enough in many cases. If the query has sensitive information in
> >> column names, you can replace sensitive column names with arbitrary
> >> names and still would hit the same issue, if your data is from
> >> schema-on-read source (parquet / csv etc).  Drill's planner does not
> >> have schema infor; changing column name across the query would not
> >> impact planner's behavior.
> >>
> >> btw: the long error change was part of DRILL-2958 [1]
> >>
> >>
> >> [1] https://issues.apache.org/jira/browse/DRILL-2958
> >>
> >>
> >>
> >> On Thu, Jun 23, 2016 at 11:13 AM, John Omernik <[email protected]>
> wrote:
> >> > Unfortunatly, the 14 mb error message contains to much proprietary
> >> > information for me to post to a Jira, the query itself may also be a
> bit
> >> to
> >> > revealing.This is Drill 1.6, so maybe the issue isn't fixed in my
> >> version?
> >> > do you know the original JIRA for the really long error?
> >> >
> >> > On Thu, Jun 23, 2016 at 12:56 PM, Jinfeng Ni <[email protected]>
> >> wrote:
> >> >
> >> >> This "CannotPlanException" definitely is a bug in query planner. I
> >> >> thought we had put code to show that extremely long error msg "only"
> >> >> in debug mode. Looks like it's not that case.
> >> >>
> >> >> Could you please open a JIRA and post your query, if possible? thx.
> >> >>
> >> >> On Thu, Jun 23, 2016 at 10:45 AM, John Omernik <[email protected]>
> >> wrote:
> >> >> > Jinfeng -
> >> >> >
> >> >> > I wrote my item prior to reading yours. Just an FYI, when I ran
> with
> >> that
> >> >> > settting, I got a "CannotPlanException" (with an error that is
> easily
> >> the
> >> >> > longest "non-verbose"( heck this beats all the verbose errors I've
> >> had)
> >> >> > I've ever seen. I'd post it here, but I am not unsure if my Google
> has
> >> >> > enough storage to handle this message....
> >> >> >
> >> >> > (kidding... sorta)
> >> >> >
> >> >> > John
> >> >> >
> >> >> >
> >> >> >
> >> >> > On Thu, Jun 23, 2016 at 12:37 PM, Jinfeng Ni <
> [email protected]>
> >> >> wrote:
> >> >> >
> >> >> >> Do you partition by day in your CTAS? If that's the case, CTAS
> will
> >> >> >> produce at least one parquet file for each value of "day".  If you
> >> >> >> have 100 days, then you will end up at least 100 files. However,
> in
> >> >> >> case the query is executed in distributed mode, there could be
> more
> >> >> >> than one file per value.
> >> >> >>
> >> >> >> In order to get one and only one parquet file for each partition
> >> >> >> value, turn on this option:
> >> >> >>
> >> >> >> alter session set `store.partition.hash_distribute` = true;
> >> >> >>
> >> >> >>
> >> >> >>
> >> >> >> On Thu, Jun 23, 2016 at 10:26 AM, Jason Altekruse <
> [email protected]>
> >> >> >> wrote:
> >> >> >> > Apply a sort in your CTAS, this will force the data down to a
> >> single
> >> >> >> stream
> >> >> >> > before writing.
> >> >> >> >
> >> >> >> > Jason Altekruse
> >> >> >> > Software Engineer at Dremio
> >> >> >> > Apache Drill Committer
> >> >> >> >
> >> >> >> > On Thu, Jun 23, 2016 at 10:23 AM, John Omernik <
> [email protected]>
> >> >> wrote:
> >> >> >> >
> >> >> >> >> When have a small query writing smaller data (like aggregate
> >> tables
> >> >> for
> >> >> >> >> faster aggregates for Dashboards etc).  It appears to write a
> ton
> >> of
> >> >> >> small
> >> >> >> >> files.  Not sure why, maybe its just how the join worked out
> etc.
> >> I
> >> >> >> have a
> >> >> >> >> "day" that is 1.5M in total size, but 400 files total. This
> seems
> >> >> >> >> excessive.
> >> >> >> >>
> >> >> >> >> While I don't have the "small files" issues because I run
> MapR-FS,
> >> >> >> having
> >> >> >> >> 400 files that make 1.5 mb of total date kills me on the
> planning
> >> >> phase.
> >> >> >> >>  How can I get Drill, when doing a CTAS to go through a round
> of
> >> >> >> >> consolidation on the parquet files?
> >> >> >> >>
> >> >> >> >> Thanks
> >> >> >> >>
> >> >> >> >> John
> >> >> >> >>
> >> >> >>
> >> >>
> >>
>

Re: Merging files

Reply via email to