CONVERT_FROMUTF8 is a trivial operation. I would be surprised if that was the bottleneck. See here [1] for the implementation. I'm guessing that the problem is simply that the best alternative on this structure would be a non-equality join. The way I look at this query is you have data set like the following:
dx1 dx2 dx3 dx4 dx5 # # # # # # # # # # # # # # # # # # # # # # # # # And you are doing a join: select * from baseTable t join dxTable dx on (dx1 = t.c1 OR dx2 = t.c2 OR dx3 = t.c3 OR dx4 = t.c4 OR dx5 = t.c5) That isn't a pretty algorithm in general. I'm guessing that the best opportunity is to convert the data to something where you can start taking advantage of hashing. For example, put all the values in a single list with each base table row key. For each value, prefix by column id. Then do a join on the new compound key and then get to distinct rows per base table row key. It isn't pretty but it would be much better than the current approach. [1] https://github.com/apache/drill/blob/master/exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/conv/UTF8ConvertFrom.java -- Jacques Nadeau CTO and Co-Founder, Dremio On Tue, Aug 25, 2015 at 7:48 AM, Jinfeng Ni <[email protected]> wrote: > I looked at the two physical plans you posted. Seems they are different : > the first one has multiple CONVERT_FROMUTF8() call, while the second does > not have. > > 1. > > Filter(condition=[AND(>=(CAST($0):INTEGER, 2009), <=(CAST($0):INTEGER, > 2013), OR(=(CONVERT_FROMUTF8($1), '39891'), =(CONVERT_FROMUTF8($1), > '4280'), =(CONVERT_FROMUTF8($1), '4281'), =(CONVERT_FROMUTF8($1), '42820'), > =(CONVERT_FROMUTF8($1), '42821'), =(CONVERT_FROMUTF8($1), '42822'), > =(CONVERT_FROMUTF8($1), '42823'), =(CONVERT_FROMUTF8($1), '42830'), > =(CONVERT_FROMUTF8($1), '42831'), =(CONVERT_FROMUTF8($1), '42832'), > =(CONVERT_FROMUTF8($1), '42833'), =(CONVERT_FROMUTF8($1), '42840'), > =(CONVERT_FROMUTF8($1), '42841'), =(CONVERT_FROMUTF8($1), '42842'), > =(CONVERT_FROMUTF8($1), '42843'), =(CONVERT_FROMUTF8($1), '4289'), > =(CONVERT_FROMUTF8($2), '39891'), =(CONVERT_FROMUTF8($2), '4280'), > =(CONVERT_FROMUTF8($2), '4281'), =(CONVERT_FROMUTF8($2), '42820'), > =(CONVERT_FROMUTF8($2), '42821'), =(CONVERT_FROMUTF8($2), '42822'), > =(CONVERT_FROMUTF8($2), '42823'), =(CONVERT_FROMUTF8($2), '42830'), > =(CONVERT_FROMUTF8($2), '42831'), =(CONVERT_FROMUTF8($2), '42832'), > =(CONVERT_FROMUTF8($2), '42833'), =(CONVERT_FROMUTF8($2), '42840'), > =(CONVERT_FROMUTF8($2), '42841'), =(CONVERT_FROMUTF8($2), '42842'), > =(CONVERT_FROMUTF8($2), '42843'), =(CONVERT_FROMUTF8($2), '4289')))]) : > > 2. > Filter(condition=[AND(>=($0, 2009), <=($0, 2013), > OR(=($1, '39891'), =($1, '4280'), =($1, '4281'), =($1, '42820'), =($1, > '42821'), =($1, '42822'), =($1, '42823'), =($1, '42830'), =($1, '42831'), > =($1, '42832'), =($1, '42833'), =($1, '42840'), =($1, '42841'), =($1, > '42842'), =($1, '42843'), =($1, '4289'), =($2, '39891'), =($2, '4280'), > =($2, '4281'), =($2, '42820'), =($2, '42821'), =($2, '42822'), =($2, > '42823'), =($2, '42830'), =($2, '42831'), =($2, '42832'), =($2, '42833'), > =($2, '42840'), =($2, '42841'), =($2, '42842'), =($2, '42843'), =($2, > '4289'), =($3, '39891'), =($3, '4280'), =($3, '4281'), =($3, '42820'), > =($3, '42821'), =($3, '42822'), =($3, '42823'), =($3, '42830'), =($3, > '42831'), =($3, '42832'), =($3, '42833'), =($3, '42840'), =($3, '42841'), > =($3, '42842'), =($3, '42843'), =($3, '4289'), =($4, '39891'), =($4, > '4280'), =($4, '4281'), =($4, '42820'), =($4, '42821'), =($4, '42822'), > =($4, '42823'), =($4, '42830'), =($4, '42831'), =($4, '42832'), =($4, > '42833'), =($4, '42840'), =($4, '42841'), =($4, '42842'), =($4, '42843'), > =($4, '4289'), =($5, '39891'), =($5, '4280'), =($5, '4281'), =($5, > '42820'), =($5, '42821'), =($5, '42822'), =($5, '42823'), =($5, '42830'), > =($5, '42831'), =($5, '42832'), =($5, '42833'), =($5, '42840'), =($5, > '42841'), =($5, '42842'), =($5, '42843'), =($5, '4289')))]) > > Can you please post the query PLUS the corresponding plan together? > > For the first plan, there are room for improvement, since Drill should push > the common expression : CONVERT_FROMUTF8($2) down into a project, and refer > the result of expression directly in Filter, which will avoid the repeated > evaluation in Filter. > > For the second plan, I do not see room for improvement. > > > > > On Tue, Aug 25, 2015 at 2:38 AM, Sungwook Yoon <[email protected]> wrote: > > > Aman, > > > > Sorry for slow reply. > > > > Here is the SQL query. > > > > SELECT count(1) FROM `o` a WHERE (a.DX1 IN ('39891', '4280', '4281', > > '42820', '42821', '42822', '42823', '42830', '42831', '42832', '42833', > > '42840', '42841', '42842', '42843', '4289') OR a.DX2 IN ('39891', '4280', > > '4281', '42820', '42821', '42822', '42823', '42830', '42831', '42832', > > '42833', '42840', '42841', '42842', '42843', '4289') OR a.DX3 IN > ('39891', > > '4280', '4281', '42820', '42821', '42822', '42823', '42830', '42831', > > '42832', '42833', '42840', '42841', '42842', '42843', '4289') OR a.DX4 IN > > ('39891', '4280', '4281', '42820', '42821', '42822', '42823', '42830', > > '42831', '42832', '42833', '42840', '42841', '42842', '42843', '4289') OR > > a.DX5 IN ('39891', '4280', '4281', '42820', '42821', '42822', '42823', > > '42830', '42831', '42832', '42833', '42840', '42841', '42842', '42843', > > '4289')) AND a.`year` BETWEEN 2009 AND 2013; > > > > For the view definition, I tried to convert_from to UTF and cast as > > Varchar. > > For integer type, I casted. > > All resulted in similar slow performance. > > > > Let me know if you have any insights. > > > > Thanks, > > > > Sungwook > > > > > > On Mon, Aug 24, 2015 at 7:20 PM, Aman Sinha <[email protected]> > wrote: > > > > > I was about to say that for IN lists of size 20 or more, Drill uses a > > more > > > efficient Values operator instead of OR conditions but then realized > the > > OR > > > filter is referencing 4 different columns : $1..$4 and each of those > > > individual lists is less than 20. Sungwook, can you please provide > the > > > SQL query and any view definitions or anything that goes with it ? It > is > > > difficult to figure out things without the full picture. > > > thanks, > > > Aman > > > > > > On Mon, Aug 24, 2015 at 5:10 PM, Ted Dunning <[email protected]> > > > wrote: > > > > > > > On Mon, Aug 24, 2015 at 4:50 PM, Sungwook Yoon <[email protected]> > > > wrote: > > > > > > > > > Still, the performance drop down due to OR filtering is just > > > > astounding... > > > > > > > > > > > > > That is what query optimizers are for and why getting them to work > well > > > is > > > > important. > > > > > > > > The difference in performance that you are observing is not > surprising > > > > given the redundant work that you are seeing. Using the OR operator > > > > prevents any significant short-circuiting and the repeated conversion > > > > operations that are happening make the evaluation much more expensive > > > than > > > > it would otherwise be (a dozen extra copies where only one is > needed). > > > > > > > > Other queries that can be subject to similar problems include common > > > table > > > > expressions that read the same (large) input file many times. So > far, > > > > Drill doesn't optimize all such expressions well. > > > > > > > > > >
