Jinfeng, Still, the performance drop down due to OR filtering is just astounding...
Sungwook On Mon, Aug 24, 2015 at 3:33 PM, Jinfeng Ni <[email protected]> wrote: > Looks like the query has Parquet Scan, in stead of HBase Scan. For > Parquet, Drill currently > does not support filter push down into Scan operator yet. > > Can you check the query profile, and verify that Filter operator is the one > which takes the > most of the query time? > > For the filter operator, Drill uses short circuit evaluation[1] to speed up > the evaluation performance. > > - If the row does not qualify for 'year' between 2009 and 2013, then the > entire OR operator > would be skipped for that row. > > - If the row does qualify for 'year' between 2009 and 2013, it will come > down to OR operator. > As long as one branch of OR is evaluated to be true, the rest of OR's > branches will be skipped. > > In other words, Drill's execution does not compare each row against all the > values listed in the > in list. > > [1]. https://issues.apache.org/jira/browse/DRILL-937 > > On Mon, Aug 24, 2015 at 2:41 PM, Sungwook Yoon <[email protected]> wrote: > > > Here is how the physical plan looks like, > > Adding simple "or" operation is killing Drill performance! > > > > 00-00 Screen : rowType = RecordType(BIGINT EXPR$0): rowcount = 1.0, > > cumulative cost = {1.15525857736625E10 rows, 7.55869183540375E10 cpu, 0.0 > > io, 4096.0 network, 0.0 memory}, id = 7682 > > 00-01 StreamAgg(group=[{}], EXPR$0=[$SUM0($0)]) : rowType = > > RecordType(BIGINT EXPR$0): rowcount = 1.0, cumulative cost = > > {1.15525857735625E10 rows, 7.55869183539375E10 cpu, 0.0 io, 4096.0 > network, > > 0.0 memory}, id = 7681 > > 00-02 UnionExchange : rowType = RecordType(BIGINT EXPR$0): > rowcount > > = 1.0, cumulative cost = {1.15525857725625E10 rows, 7.55869183419375E10 > > cpu, 0.0 io, 4096.0 network, 0.0 memory}, id = 7680 > > 01-01 StreamAgg(group=[{}], EXPR$0=[COUNT()]) : rowType = > > RecordType(BIGINT EXPR$0): rowcount = 1.0, cumulative cost = > > {1.15525857715625E10 rows, 7.55869183339375E10 cpu, 0.0 io, 0.0 network, > > 0.0 memory}, id = 7679 > > 01-02 Project($f0=[1]) : rowType = RecordType(INTEGER $f0): > > rowcount = 3.300738791875E8, cumulative cost = {1.1222511892375E10 rows, > > 7.16260317836875E10 cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 7678 > > 01-03 SelectionVectorRemover : rowType = RecordType(ANY > year, > > ANY DX1, ANY DX2, ANY DX3, ANY DX4, ANY DX5): rowcount = > 3.300738791875E8, > > cumulative cost = {1.08924380131875E10 rows, 7.03057362669375E10 cpu, 0.0 > > io, 0.0 network, 0.0 memory}, id = 7677 > > 01-04 Filter(condition=[AND(>=($0, 2009), <=($0, 2013), > > OR(=($1, '39891'), =($1, '4280'), =($1, '4281'), =($1, '42820'), =($1, > > '42821'), =($1, '42822'), =($1, '42823'), =($1, '42830'), =($1, '42831'), > > =($1, '42832'), =($1, '42833'), =($1, '42840'), =($1, '42841'), =($1, > > '42842'), =($1, '42843'), =($1, '4289'), =($2, '39891'), =($2, '4280'), > > =($2, '4281'), =($2, '42820'), =($2, '42821'), =($2, '42822'), =($2, > > '42823'), =($2, '42830'), =($2, '42831'), =($2, '42832'), =($2, '42833'), > > =($2, '42840'), =($2, '42841'), =($2, '42842'), =($2, '42843'), =($2, > > '4289'), =($3, '39891'), =($3, '4280'), =($3, '4281'), =($3, '42820'), > > =($3, '42821'), =($3, '42822'), =($3, '42823'), =($3, '42830'), =($3, > > '42831'), =($3, '42832'), =($3, '42833'), =($3, '42840'), =($3, '42841'), > > =($3, '42842'), =($3, '42843'), =($3, '4289'), =($4, '39891'), =($4, > > '4280'), =($4, '4281'), =($4, '42820'), =($4, '42821'), =($4, '42822'), > > =($4, '42823'), =($4, '42830'), =($4, '42831'), =($4, '42832'), =($4, > > '42833'), =($4, '42840'), =($4, '42841'), =($4, '42842'), =($4, '42843'), > > =($4, '4289'), =($5, '39891'), =($5, '4280'), =($5, '4281'), =($5, > > '42820'), =($5, '42821'), =($5, '42822'), =($5, '42823'), =($5, '42830'), > > =($5, '42831'), =($5, '42832'), =($5, '42833'), =($5, '42840'), =($5, > > '42841'), =($5, '42842'), =($5, '42843'), =($5, '4289')))]) : rowType = > > RecordType(ANY year, ANY DX1, ANY DX2, ANY DX3, ANY DX4, ANY DX5): > rowcount > > = 3.300738791875E8, cumulative cost = {1.0562364134E10 rows, > > 6.997566238775E10 cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 7676 > > 01-05 Project(year=[$2], DX1=[$0], DX2=[$1], DX3=[$5], > > DX4=[$3], DX5=[$4]) : rowType = RecordType(ANY year, ANY DX1, ANY DX2, > ANY > > DX3, ANY DX4, ANY DX5): rowcount = 5.281182067E9, cumulative cost = > > {5.281182067E9 rows, 3.1687092402E10 cpu, 0.0 io, 0.0 network, 0.0 > memory}, > > id = 7675 > > 01-06 Scan(groupscan=[ParquetGroupScan > > [entries=[ReadEntryWithPath > > [path=maprfs:///drill/parquet/mktscan_o_ttt_save]], > > selectionRoot=maprfs:/drill/parquet/mktscan_o_ttt_save, numFiles=1, > > columns=[`year`, `DX1`, `DX2`, `DX3`, `DX4`, `DX5`]]]) : rowType = > > RecordType(ANY DX1, ANY DX2, ANY year, ANY DX4, ANY DX5, ANY DX3): > rowcount > > = 5.281182067E9, cumulative cost = {5.281182067E9 rows, 3.1687092402E10 > > cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 7674 > > > > > > Sungwook > > > > On Mon, Aug 24, 2015 at 4:18 PM, Jinfeng Ni <[email protected]> > wrote: > > > > > Can you post the plan including the HBaseScan part? I would like check > > if > > > the filter condition is pushed into HBaseScan or not. > > > > > > For HBaseScan, as I understand, currently, if part of the filter is > > pushed > > > into Scan operator, it will remain in the Filter operator, although the > > > filter in the Scan should have pruned out the rows which do not qualify > > for > > > the filter condition. > > > > > > > > > > > > > > > > > > On Mon, Aug 24, 2015 at 2:10 PM, Aman Sinha <[email protected]> > > wrote: > > > > > > > Indeed, it is not efficient. We are doing 16 invocations of > > > > CONVERT_FROMUTF8($1) and 16 invocations of CONVERT_FROMUTF8($2). > > > > Can you pls file a JIRA ? We should ideally be doing projection > > pushdown > > > > in conjunction with the filter pushdown in to the HBase scan and > > > > computing these functions only once. > > > > > > > > Aman > > > > > > > > > > > > On Mon, Aug 24, 2015 at 1:34 PM, Sungwook Yoon <[email protected]> > > > wrote: > > > > > > > > > Hi, > > > > > > > > > > I have a query, doing something like > > > > > > > > > > a in (v1, v2, v3, .... v15) > > > > > > > > > > The physical query plan looks like the following. > > > > > > > > > > Filter(condition=[AND(>=(CAST($0):INTEGER, 2009), > > <=(CAST($0):INTEGER, > > > > > 2013), OR(=(CONVERT_FROMUTF8($1), '39891'), =(CONVERT_FROMUTF8($1), > > > > > '4280'), =(CONVERT_FROMUTF8($1), '4281'), =(CONVERT_FROMUTF8($1), > > > > '42820'), > > > > > =(CONVERT_FROMUTF8($1), '42821'), =(CONVERT_FROMUTF8($1), '42822'), > > > > > =(CONVERT_FROMUTF8($1), '42823'), =(CONVERT_FROMUTF8($1), '42830'), > > > > > =(CONVERT_FROMUTF8($1), '42831'), =(CONVERT_FROMUTF8($1), '42832'), > > > > > =(CONVERT_FROMUTF8($1), '42833'), =(CONVERT_FROMUTF8($1), '42840'), > > > > > =(CONVERT_FROMUTF8($1), '42841'), =(CONVERT_FROMUTF8($1), '42842'), > > > > > =(CONVERT_FROMUTF8($1), '42843'), =(CONVERT_FROMUTF8($1), '4289'), > > > > > =(CONVERT_FROMUTF8($2), '39891'), =(CONVERT_FROMUTF8($2), '4280'), > > > > > =(CONVERT_FROMUTF8($2), '4281'), =(CONVERT_FROMUTF8($2), '42820'), > > > > > =(CONVERT_FROMUTF8($2), '42821'), =(CONVERT_FROMUTF8($2), '42822'), > > > > > =(CONVERT_FROMUTF8($2), '42823'), =(CONVERT_FROMUTF8($2), '42830'), > > > > > =(CONVERT_FROMUTF8($2), '42831'), =(CONVERT_FROMUTF8($2), '42832'), > > > > > =(CONVERT_FROMUTF8($2), '42833'), =(CONVERT_FROMUTF8($2), '42840'), > > > > > =(CONVERT_FROMUTF8($2), '42841'), =(CONVERT_FROMUTF8($2), '42842'), > > > > > =(CONVERT_FROMUTF8($2), '42843'), =(CONVERT_FROMUTF8($2), > > '4289')))]) : > > > > > rowType = RecordType(ANY year, ANY DX1, ANY DX2): rowcount = > > > > > 3.300738791875E8, cumulative cost = {1.0562364134E10 rows, > > > > > 5.413211618675E10 cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 7136 > > > > > > > > > > > > > > > In this plan, does the drill convert to string the same column to > > > > multiple > > > > > times as many as the values it is comparing against? > > > > > > > > > > From the performance, it looks like it is doing that ... > > > > > > > > > > Sungwook > > > > > > > > > > > > > > >
