Re: Merging files

Jinfeng Ni Thu, 23 Jun 2016 11:46:38 -0700

I looked at the code. 1. Drill did log this long CanNotPlan msg in
error level. 2) It was replaced with a much shorter version msg only
when CanNotPlan was caused by cartesian join.


I guess your query probably did not have cartesian join, and
CanNotPlan was caused by other reasons. Either way, I think Drill
should not display such verbose error msg (which is essentially the
planner internal state).

I actually do not need the error message. The query itself would be
good enough in many cases. If the query has sensitive information in
column names, you can replace sensitive column names with arbitrary
names and still would hit the same issue, if your data is from
schema-on-read source (parquet / csv etc).  Drill's planner does not
have schema infor; changing column name across the query would not
impact planner's behavior.

btw: the long error change was part of DRILL-2958 [1]


[1] https://issues.apache.org/jira/browse/DRILL-2958



On Thu, Jun 23, 2016 at 11:13 AM, John Omernik <j...@omernik.com> wrote:
> Unfortunatly, the 14 mb error message contains to much proprietary
> information for me to post to a Jira, the query itself may also be a bit to
> revealing.This is Drill 1.6, so maybe the issue isn't fixed in my version?
> do you know the original JIRA for the really long error?
>
> On Thu, Jun 23, 2016 at 12:56 PM, Jinfeng Ni <jinfengn...@gmail.com> wrote:
>
>> This "CannotPlanException" definitely is a bug in query planner. I
>> thought we had put code to show that extremely long error msg "only"
>> in debug mode. Looks like it's not that case.
>>
>> Could you please open a JIRA and post your query, if possible? thx.
>>
>> On Thu, Jun 23, 2016 at 10:45 AM, John Omernik <j...@omernik.com> wrote:
>> > Jinfeng -
>> >
>> > I wrote my item prior to reading yours. Just an FYI, when I ran with that
>> > settting, I got a "CannotPlanException" (with an error that is easily the
>> > longest "non-verbose"( heck this beats all the verbose errors I've had)
>> > I've ever seen. I'd post it here, but I am not unsure if my Google has
>> > enough storage to handle this message....
>> >
>> > (kidding... sorta)
>> >
>> > John
>> >
>> >
>> >
>> > On Thu, Jun 23, 2016 at 12:37 PM, Jinfeng Ni <jinfengn...@gmail.com>
>> wrote:
>> >
>> >> Do you partition by day in your CTAS? If that's the case, CTAS will
>> >> produce at least one parquet file for each value of "day".  If you
>> >> have 100 days, then you will end up at least 100 files. However, in
>> >> case the query is executed in distributed mode, there could be more
>> >> than one file per value.
>> >>
>> >> In order to get one and only one parquet file for each partition
>> >> value, turn on this option:
>> >>
>> >> alter session set `store.partition.hash_distribute` = true;
>> >>
>> >>
>> >>
>> >> On Thu, Jun 23, 2016 at 10:26 AM, Jason Altekruse <ja...@dremio.com>
>> >> wrote:
>> >> > Apply a sort in your CTAS, this will force the data down to a single
>> >> stream
>> >> > before writing.
>> >> >
>> >> > Jason Altekruse
>> >> > Software Engineer at Dremio
>> >> > Apache Drill Committer
>> >> >
>> >> > On Thu, Jun 23, 2016 at 10:23 AM, John Omernik <j...@omernik.com>
>> wrote:
>> >> >
>> >> >> When have a small query writing smaller data (like aggregate tables
>> for
>> >> >> faster aggregates for Dashboards etc).  It appears to write a ton of
>> >> small
>> >> >> files.  Not sure why, maybe its just how the join worked out etc. I
>> >> have a
>> >> >> "day" that is 1.5M in total size, but 400 files total. This seems
>> >> >> excessive.
>> >> >>
>> >> >> While I don't have the "small files" issues because I run MapR-FS,
>> >> having
>> >> >> 400 files that make 1.5 mb of total date kills me on the planning
>> phase.
>> >> >>  How can I get Drill, when doing a CTAS to go through a round of
>> >> >> consolidation on the parquet files?
>> >> >>
>> >> >> Thanks
>> >> >>
>> >> >> John
>> >> >>
>> >>
>>

Re: Merging files

Reply via email to