I still think a profile analysis would be good. Can you post the following on gist:
- profile for this query - source code for the slow filter operation (turn on debug logging to get this) thanks, Jacques -- Jacques Nadeau CTO and Co-Founder, Dremio On Tue, Aug 25, 2015 at 8:16 AM, Sungwook Yoon <[email protected]> wrote: > Understood,.. > I can see Jacques and Jinfeng's point of view. > > I would not bother think about this query .... but some claim that this > query ran orders of magnitude faster on some other sql engine... > > I don't know how.. they got that performance from this query.. > > Sungwook > > > > On Tue, Aug 25, 2015 at 10:04 AM, Jacques Nadeau <[email protected]> > wrote: > > > CONVERT_FROMUTF8 is a trivial operation. I would be surprised if that > was > > the bottleneck. See here [1] for the implementation. I'm guessing that > > the problem is simply that the best alternative on this structure would > be > > a non-equality join. The way I look at this query is you have data set > > like the following: > > > > dx1 dx2 dx3 dx4 dx5 > > # # # # # > > # # # # # > > # # # # # > > # # # # # > > # # # # # > > > > And you are doing a join: > > > > select * from baseTable t > > join dxTable dx on (dx1 = t.c1 OR dx2 = t.c2 OR dx3 = t.c3 OR dx4 = t.c4 > OR > > dx5 = t.c5) > > > > That isn't a pretty algorithm in general. I'm guessing that the best > > opportunity is to convert the data to something where you can start > taking > > advantage of hashing. For example, put all the values in a single list > > with each base table row key. For each value, prefix by column id. Then > > do a join on the new compound key and then get to distinct rows per base > > table row key. It isn't pretty but it would be much better than the > > current approach. > > > > > > [1] > > > > > https://github.com/apache/drill/blob/master/exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/conv/UTF8ConvertFrom.java > > > > > > -- > > Jacques Nadeau > > CTO and Co-Founder, Dremio > > > > On Tue, Aug 25, 2015 at 7:48 AM, Jinfeng Ni <[email protected]> > wrote: > > > > > I looked at the two physical plans you posted. Seems they are > different : > > > the first one has multiple CONVERT_FROMUTF8() call, while the second > does > > > not have. > > > > > > 1. > > > > > > Filter(condition=[AND(>=(CAST($0):INTEGER, 2009), <=(CAST($0):INTEGER, > > > 2013), OR(=(CONVERT_FROMUTF8($1), '39891'), =(CONVERT_FROMUTF8($1), > > > '4280'), =(CONVERT_FROMUTF8($1), '4281'), =(CONVERT_FROMUTF8($1), > > '42820'), > > > =(CONVERT_FROMUTF8($1), '42821'), =(CONVERT_FROMUTF8($1), '42822'), > > > =(CONVERT_FROMUTF8($1), '42823'), =(CONVERT_FROMUTF8($1), '42830'), > > > =(CONVERT_FROMUTF8($1), '42831'), =(CONVERT_FROMUTF8($1), '42832'), > > > =(CONVERT_FROMUTF8($1), '42833'), =(CONVERT_FROMUTF8($1), '42840'), > > > =(CONVERT_FROMUTF8($1), '42841'), =(CONVERT_FROMUTF8($1), '42842'), > > > =(CONVERT_FROMUTF8($1), '42843'), =(CONVERT_FROMUTF8($1), '4289'), > > > =(CONVERT_FROMUTF8($2), '39891'), =(CONVERT_FROMUTF8($2), '4280'), > > > =(CONVERT_FROMUTF8($2), '4281'), =(CONVERT_FROMUTF8($2), '42820'), > > > =(CONVERT_FROMUTF8($2), '42821'), =(CONVERT_FROMUTF8($2), '42822'), > > > =(CONVERT_FROMUTF8($2), '42823'), =(CONVERT_FROMUTF8($2), '42830'), > > > =(CONVERT_FROMUTF8($2), '42831'), =(CONVERT_FROMUTF8($2), '42832'), > > > =(CONVERT_FROMUTF8($2), '42833'), =(CONVERT_FROMUTF8($2), '42840'), > > > =(CONVERT_FROMUTF8($2), '42841'), =(CONVERT_FROMUTF8($2), '42842'), > > > =(CONVERT_FROMUTF8($2), '42843'), =(CONVERT_FROMUTF8($2), '4289')))]) : > > > > > > 2. > > > Filter(condition=[AND(>=($0, 2009), <=($0, 2013), > > > OR(=($1, '39891'), =($1, '4280'), =($1, '4281'), =($1, '42820'), =($1, > > > '42821'), =($1, '42822'), =($1, '42823'), =($1, '42830'), =($1, > '42831'), > > > =($1, '42832'), =($1, '42833'), =($1, '42840'), =($1, '42841'), =($1, > > > '42842'), =($1, '42843'), =($1, '4289'), =($2, '39891'), =($2, '4280'), > > > =($2, '4281'), =($2, '42820'), =($2, '42821'), =($2, '42822'), =($2, > > > '42823'), =($2, '42830'), =($2, '42831'), =($2, '42832'), =($2, > '42833'), > > > =($2, '42840'), =($2, '42841'), =($2, '42842'), =($2, '42843'), =($2, > > > '4289'), =($3, '39891'), =($3, '4280'), =($3, '4281'), =($3, '42820'), > > > =($3, '42821'), =($3, '42822'), =($3, '42823'), =($3, '42830'), =($3, > > > '42831'), =($3, '42832'), =($3, '42833'), =($3, '42840'), =($3, > '42841'), > > > =($3, '42842'), =($3, '42843'), =($3, '4289'), =($4, '39891'), =($4, > > > '4280'), =($4, '4281'), =($4, '42820'), =($4, '42821'), =($4, '42822'), > > > =($4, '42823'), =($4, '42830'), =($4, '42831'), =($4, '42832'), =($4, > > > '42833'), =($4, '42840'), =($4, '42841'), =($4, '42842'), =($4, > '42843'), > > > =($4, '4289'), =($5, '39891'), =($5, '4280'), =($5, '4281'), =($5, > > > '42820'), =($5, '42821'), =($5, '42822'), =($5, '42823'), =($5, > '42830'), > > > =($5, '42831'), =($5, '42832'), =($5, '42833'), =($5, '42840'), =($5, > > > '42841'), =($5, '42842'), =($5, '42843'), =($5, '4289')))]) > > > > > > Can you please post the query PLUS the corresponding plan together? > > > > > > For the first plan, there are room for improvement, since Drill should > > push > > > the common expression : CONVERT_FROMUTF8($2) down into a project, and > > refer > > > the result of expression directly in Filter, which will avoid the > > repeated > > > evaluation in Filter. > > > > > > For the second plan, I do not see room for improvement. > > > > > > > > > > > > > > > On Tue, Aug 25, 2015 at 2:38 AM, Sungwook Yoon <[email protected]> > > wrote: > > > > > > > Aman, > > > > > > > > Sorry for slow reply. > > > > > > > > Here is the SQL query. > > > > > > > > SELECT count(1) FROM `o` a WHERE (a.DX1 IN ('39891', '4280', '4281', > > > > '42820', '42821', '42822', '42823', '42830', '42831', '42832', > '42833', > > > > '42840', '42841', '42842', '42843', '4289') OR a.DX2 IN ('39891', > > '4280', > > > > '4281', '42820', '42821', '42822', '42823', '42830', '42831', > '42832', > > > > '42833', '42840', '42841', '42842', '42843', '4289') OR a.DX3 IN > > > ('39891', > > > > '4280', '4281', '42820', '42821', '42822', '42823', '42830', '42831', > > > > '42832', '42833', '42840', '42841', '42842', '42843', '4289') OR > a.DX4 > > IN > > > > ('39891', '4280', '4281', '42820', '42821', '42822', '42823', > '42830', > > > > '42831', '42832', '42833', '42840', '42841', '42842', '42843', > '4289') > > OR > > > > a.DX5 IN ('39891', '4280', '4281', '42820', '42821', '42822', > '42823', > > > > '42830', '42831', '42832', '42833', '42840', '42841', '42842', > '42843', > > > > '4289')) AND a.`year` BETWEEN 2009 AND 2013; > > > > > > > > For the view definition, I tried to convert_from to UTF and cast as > > > > Varchar. > > > > For integer type, I casted. > > > > All resulted in similar slow performance. > > > > > > > > Let me know if you have any insights. > > > > > > > > Thanks, > > > > > > > > Sungwook > > > > > > > > > > > > On Mon, Aug 24, 2015 at 7:20 PM, Aman Sinha <[email protected]> > > > wrote: > > > > > > > > > I was about to say that for IN lists of size 20 or more, Drill > uses a > > > > more > > > > > efficient Values operator instead of OR conditions but then > realized > > > the > > > > OR > > > > > filter is referencing 4 different columns : $1..$4 and each of > those > > > > > individual lists is less than 20. Sungwook, can you please > provide > > > the > > > > > SQL query and any view definitions or anything that goes with it ? > > It > > > is > > > > > difficult to figure out things without the full picture. > > > > > thanks, > > > > > Aman > > > > > > > > > > On Mon, Aug 24, 2015 at 5:10 PM, Ted Dunning < > [email protected]> > > > > > wrote: > > > > > > > > > > > On Mon, Aug 24, 2015 at 4:50 PM, Sungwook Yoon < > [email protected] > > > > > > > > wrote: > > > > > > > > > > > > > Still, the performance drop down due to OR filtering is just > > > > > > astounding... > > > > > > > > > > > > > > > > > > > That is what query optimizers are for and why getting them to > work > > > well > > > > > is > > > > > > important. > > > > > > > > > > > > The difference in performance that you are observing is not > > > surprising > > > > > > given the redundant work that you are seeing. Using the OR > operator > > > > > > prevents any significant short-circuiting and the repeated > > conversion > > > > > > operations that are happening make the evaluation much more > > expensive > > > > > than > > > > > > it would otherwise be (a dozen extra copies where only one is > > > needed). > > > > > > > > > > > > Other queries that can be subject to similar problems include > > common > > > > > table > > > > > > expressions that read the same (large) input file many times. So > > > far, > > > > > > Drill doesn't optimize all such expressions well. > > > > > > > > > > > > > > > > > > > > >
