Understood,..
I can see Jacques and Jinfeng's point of view.

I would not bother think about this query .... but some claim that this
query ran orders of magnitude faster on some other sql engine...

I don't know how.. they got that performance from this query..

Sungwook



On Tue, Aug 25, 2015 at 10:04 AM, Jacques Nadeau <[email protected]> wrote:

> CONVERT_FROMUTF8 is a trivial operation.  I would be surprised if that was
> the bottleneck.  See here [1] for the implementation.  I'm guessing that
> the problem is simply that the best alternative on this structure would be
> a non-equality join.  The way I look at this query is you have data set
> like the following:
>
> dx1  dx2  dx3  dx4  dx5
> #    #    #    #    #
> #    #    #    #    #
> #    #    #    #    #
> #    #    #    #    #
> #    #    #    #    #
>
> And you are doing a join:
>
> select * from baseTable t
> join dxTable dx on (dx1 = t.c1 OR dx2 = t.c2 OR dx3 = t.c3 OR dx4 = t.c4 OR
> dx5 = t.c5)
>
> That isn't a pretty algorithm in general.  I'm guessing that the best
> opportunity is to convert the data to something where you can start taking
> advantage of hashing.  For example, put all the values in a single list
> with each base table row key.  For each value, prefix by column id.  Then
> do a join on the new compound key and then get to distinct rows per base
> table row key.  It isn't pretty but it would be much better than the
> current approach.
>
>
> [1]
>
> https://github.com/apache/drill/blob/master/exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/conv/UTF8ConvertFrom.java
>
>
> --
> Jacques Nadeau
> CTO and Co-Founder, Dremio
>
> On Tue, Aug 25, 2015 at 7:48 AM, Jinfeng Ni <[email protected]> wrote:
>
> > I looked at the two physical plans you posted. Seems they are different :
> > the first one has multiple CONVERT_FROMUTF8() call, while the second does
> > not have.
> >
> > 1.
> >
> > Filter(condition=[AND(>=(CAST($0):INTEGER, 2009), <=(CAST($0):INTEGER,
> > 2013), OR(=(CONVERT_FROMUTF8($1), '39891'), =(CONVERT_FROMUTF8($1),
> > '4280'), =(CONVERT_FROMUTF8($1), '4281'), =(CONVERT_FROMUTF8($1),
> '42820'),
> > =(CONVERT_FROMUTF8($1), '42821'), =(CONVERT_FROMUTF8($1), '42822'),
> > =(CONVERT_FROMUTF8($1), '42823'), =(CONVERT_FROMUTF8($1), '42830'),
> > =(CONVERT_FROMUTF8($1), '42831'), =(CONVERT_FROMUTF8($1), '42832'),
> > =(CONVERT_FROMUTF8($1), '42833'), =(CONVERT_FROMUTF8($1), '42840'),
> > =(CONVERT_FROMUTF8($1), '42841'), =(CONVERT_FROMUTF8($1), '42842'),
> > =(CONVERT_FROMUTF8($1), '42843'), =(CONVERT_FROMUTF8($1), '4289'),
> > =(CONVERT_FROMUTF8($2), '39891'), =(CONVERT_FROMUTF8($2), '4280'),
> > =(CONVERT_FROMUTF8($2), '4281'), =(CONVERT_FROMUTF8($2), '42820'),
> > =(CONVERT_FROMUTF8($2), '42821'), =(CONVERT_FROMUTF8($2), '42822'),
> > =(CONVERT_FROMUTF8($2), '42823'), =(CONVERT_FROMUTF8($2), '42830'),
> > =(CONVERT_FROMUTF8($2), '42831'), =(CONVERT_FROMUTF8($2), '42832'),
> > =(CONVERT_FROMUTF8($2), '42833'), =(CONVERT_FROMUTF8($2), '42840'),
> > =(CONVERT_FROMUTF8($2), '42841'), =(CONVERT_FROMUTF8($2), '42842'),
> > =(CONVERT_FROMUTF8($2), '42843'), =(CONVERT_FROMUTF8($2), '4289')))]) :
> >
> > 2.
> > Filter(condition=[AND(>=($0, 2009), <=($0, 2013),
> > OR(=($1, '39891'), =($1, '4280'), =($1, '4281'), =($1, '42820'), =($1,
> > '42821'), =($1, '42822'), =($1, '42823'), =($1, '42830'), =($1, '42831'),
> > =($1, '42832'), =($1, '42833'), =($1, '42840'), =($1, '42841'), =($1,
> > '42842'), =($1, '42843'), =($1, '4289'), =($2, '39891'), =($2, '4280'),
> > =($2, '4281'), =($2, '42820'), =($2, '42821'), =($2, '42822'), =($2,
> > '42823'), =($2, '42830'), =($2, '42831'), =($2, '42832'), =($2, '42833'),
> > =($2, '42840'), =($2, '42841'), =($2, '42842'), =($2, '42843'), =($2,
> > '4289'), =($3, '39891'), =($3, '4280'), =($3, '4281'), =($3, '42820'),
> > =($3, '42821'), =($3, '42822'), =($3, '42823'), =($3, '42830'), =($3,
> > '42831'), =($3, '42832'), =($3, '42833'), =($3, '42840'), =($3, '42841'),
> > =($3, '42842'), =($3, '42843'), =($3, '4289'), =($4, '39891'), =($4,
> > '4280'), =($4, '4281'), =($4, '42820'), =($4, '42821'), =($4, '42822'),
> > =($4, '42823'), =($4, '42830'), =($4, '42831'), =($4, '42832'), =($4,
> > '42833'), =($4, '42840'), =($4, '42841'), =($4, '42842'), =($4, '42843'),
> > =($4, '4289'), =($5, '39891'), =($5, '4280'), =($5, '4281'), =($5,
> > '42820'), =($5, '42821'), =($5, '42822'), =($5, '42823'), =($5, '42830'),
> > =($5, '42831'), =($5, '42832'), =($5, '42833'), =($5, '42840'), =($5,
> > '42841'), =($5, '42842'), =($5, '42843'), =($5, '4289')))])
> >
> > Can you please post the query PLUS the corresponding plan together?
> >
> > For the first plan, there are room for improvement, since Drill should
> push
> > the common expression : CONVERT_FROMUTF8($2) down into a project, and
> refer
> > the result of expression directly in Filter, which will avoid the
> repeated
> > evaluation in Filter.
> >
> > For the second plan, I do not see room for improvement.
> >
> >
> >
> >
> > On Tue, Aug 25, 2015 at 2:38 AM, Sungwook Yoon <[email protected]>
> wrote:
> >
> > > Aman,
> > >
> > > Sorry for slow reply.
> > >
> > > Here is the SQL query.
> > >
> > > SELECT count(1) FROM `o` a WHERE (a.DX1 IN ('39891', '4280', '4281',
> > > '42820', '42821', '42822', '42823', '42830', '42831', '42832', '42833',
> > > '42840', '42841', '42842', '42843', '4289') OR a.DX2 IN ('39891',
> '4280',
> > > '4281', '42820', '42821', '42822', '42823', '42830', '42831', '42832',
> > > '42833', '42840', '42841', '42842', '42843', '4289') OR a.DX3 IN
> > ('39891',
> > > '4280', '4281', '42820', '42821', '42822', '42823', '42830', '42831',
> > > '42832', '42833', '42840', '42841', '42842', '42843', '4289') OR a.DX4
> IN
> > > ('39891', '4280', '4281', '42820', '42821', '42822', '42823', '42830',
> > > '42831', '42832', '42833', '42840', '42841', '42842', '42843', '4289')
> OR
> > > a.DX5 IN ('39891', '4280', '4281', '42820', '42821', '42822', '42823',
> > > '42830', '42831', '42832', '42833', '42840', '42841', '42842', '42843',
> > > '4289')) AND a.`year` BETWEEN 2009 AND 2013;
> > >
> > > For the view definition, I tried to convert_from to UTF and cast as
> > > Varchar.
> > > For integer type, I casted.
> > > All resulted in similar slow performance.
> > >
> > > Let me know if you have any insights.
> > >
> > > Thanks,
> > >
> > > Sungwook
> > >
> > >
> > > On Mon, Aug 24, 2015 at 7:20 PM, Aman Sinha <[email protected]>
> > wrote:
> > >
> > > > I was about to say that for IN lists of size 20 or more, Drill uses a
> > > more
> > > > efficient Values operator instead of OR conditions but then realized
> > the
> > > OR
> > > > filter is referencing 4 different columns : $1..$4 and each of those
> > > > individual lists is less than 20.  Sungwook,  can you please provide
> > the
> > > > SQL query and any view definitions or anything that goes with it ?
> It
> > is
> > > > difficult to figure out things without the full picture.
> > > > thanks,
> > > > Aman
> > > >
> > > > On Mon, Aug 24, 2015 at 5:10 PM, Ted Dunning <[email protected]>
> > > > wrote:
> > > >
> > > > > On Mon, Aug 24, 2015 at 4:50 PM, Sungwook Yoon <[email protected]
> >
> > > > wrote:
> > > > >
> > > > > > Still, the performance drop down due to OR filtering is just
> > > > > astounding...
> > > > > >
> > > > >
> > > > > That is what query optimizers are for and why getting them to work
> > well
> > > > is
> > > > > important.
> > > > >
> > > > > The difference in performance that you are observing is not
> > surprising
> > > > > given the redundant work that you are seeing. Using the OR operator
> > > > > prevents any significant short-circuiting and the repeated
> conversion
> > > > > operations that are happening make the evaluation much more
> expensive
> > > > than
> > > > > it would otherwise be (a dozen extra copies where only one is
> > needed).
> > > > >
> > > > > Other queries that can be subject to similar problems include
> common
> > > > table
> > > > > expressions that read the same (large) input file many times.  So
> > far,
> > > > > Drill doesn't optimize all such expressions well.
> > > > >
> > > >
> > >
> >
>

Reply via email to