Looks like the query has Parquet Scan, in stead of HBase Scan. For Parquet, Drill currently does not support filter push down into Scan operator yet.
Can you check the query profile, and verify that Filter operator is the one which takes the most of the query time? For the filter operator, Drill uses short circuit evaluation[1] to speed up the evaluation performance. - If the row does not qualify for 'year' between 2009 and 2013, then the entire OR operator would be skipped for that row. - If the row does qualify for 'year' between 2009 and 2013, it will come down to OR operator. As long as one branch of OR is evaluated to be true, the rest of OR's branches will be skipped. In other words, Drill's execution does not compare each row against all the values listed in the in list. [1]. https://issues.apache.org/jira/browse/DRILL-937 On Mon, Aug 24, 2015 at 2:41 PM, Sungwook Yoon <[email protected]> wrote: > Here is how the physical plan looks like, > Adding simple "or" operation is killing Drill performance! > > 00-00 Screen : rowType = RecordType(BIGINT EXPR$0): rowcount = 1.0, > cumulative cost = {1.15525857736625E10 rows, 7.55869183540375E10 cpu, 0.0 > io, 4096.0 network, 0.0 memory}, id = 7682 > 00-01 StreamAgg(group=[{}], EXPR$0=[$SUM0($0)]) : rowType = > RecordType(BIGINT EXPR$0): rowcount = 1.0, cumulative cost = > {1.15525857735625E10 rows, 7.55869183539375E10 cpu, 0.0 io, 4096.0 network, > 0.0 memory}, id = 7681 > 00-02 UnionExchange : rowType = RecordType(BIGINT EXPR$0): rowcount > = 1.0, cumulative cost = {1.15525857725625E10 rows, 7.55869183419375E10 > cpu, 0.0 io, 4096.0 network, 0.0 memory}, id = 7680 > 01-01 StreamAgg(group=[{}], EXPR$0=[COUNT()]) : rowType = > RecordType(BIGINT EXPR$0): rowcount = 1.0, cumulative cost = > {1.15525857715625E10 rows, 7.55869183339375E10 cpu, 0.0 io, 0.0 network, > 0.0 memory}, id = 7679 > 01-02 Project($f0=[1]) : rowType = RecordType(INTEGER $f0): > rowcount = 3.300738791875E8, cumulative cost = {1.1222511892375E10 rows, > 7.16260317836875E10 cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 7678 > 01-03 SelectionVectorRemover : rowType = RecordType(ANY year, > ANY DX1, ANY DX2, ANY DX3, ANY DX4, ANY DX5): rowcount = 3.300738791875E8, > cumulative cost = {1.08924380131875E10 rows, 7.03057362669375E10 cpu, 0.0 > io, 0.0 network, 0.0 memory}, id = 7677 > 01-04 Filter(condition=[AND(>=($0, 2009), <=($0, 2013), > OR(=($1, '39891'), =($1, '4280'), =($1, '4281'), =($1, '42820'), =($1, > '42821'), =($1, '42822'), =($1, '42823'), =($1, '42830'), =($1, '42831'), > =($1, '42832'), =($1, '42833'), =($1, '42840'), =($1, '42841'), =($1, > '42842'), =($1, '42843'), =($1, '4289'), =($2, '39891'), =($2, '4280'), > =($2, '4281'), =($2, '42820'), =($2, '42821'), =($2, '42822'), =($2, > '42823'), =($2, '42830'), =($2, '42831'), =($2, '42832'), =($2, '42833'), > =($2, '42840'), =($2, '42841'), =($2, '42842'), =($2, '42843'), =($2, > '4289'), =($3, '39891'), =($3, '4280'), =($3, '4281'), =($3, '42820'), > =($3, '42821'), =($3, '42822'), =($3, '42823'), =($3, '42830'), =($3, > '42831'), =($3, '42832'), =($3, '42833'), =($3, '42840'), =($3, '42841'), > =($3, '42842'), =($3, '42843'), =($3, '4289'), =($4, '39891'), =($4, > '4280'), =($4, '4281'), =($4, '42820'), =($4, '42821'), =($4, '42822'), > =($4, '42823'), =($4, '42830'), =($4, '42831'), =($4, '42832'), =($4, > '42833'), =($4, '42840'), =($4, '42841'), =($4, '42842'), =($4, '42843'), > =($4, '4289'), =($5, '39891'), =($5, '4280'), =($5, '4281'), =($5, > '42820'), =($5, '42821'), =($5, '42822'), =($5, '42823'), =($5, '42830'), > =($5, '42831'), =($5, '42832'), =($5, '42833'), =($5, '42840'), =($5, > '42841'), =($5, '42842'), =($5, '42843'), =($5, '4289')))]) : rowType = > RecordType(ANY year, ANY DX1, ANY DX2, ANY DX3, ANY DX4, ANY DX5): rowcount > = 3.300738791875E8, cumulative cost = {1.0562364134E10 rows, > 6.997566238775E10 cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 7676 > 01-05 Project(year=[$2], DX1=[$0], DX2=[$1], DX3=[$5], > DX4=[$3], DX5=[$4]) : rowType = RecordType(ANY year, ANY DX1, ANY DX2, ANY > DX3, ANY DX4, ANY DX5): rowcount = 5.281182067E9, cumulative cost = > {5.281182067E9 rows, 3.1687092402E10 cpu, 0.0 io, 0.0 network, 0.0 memory}, > id = 7675 > 01-06 Scan(groupscan=[ParquetGroupScan > [entries=[ReadEntryWithPath > [path=maprfs:///drill/parquet/mktscan_o_ttt_save]], > selectionRoot=maprfs:/drill/parquet/mktscan_o_ttt_save, numFiles=1, > columns=[`year`, `DX1`, `DX2`, `DX3`, `DX4`, `DX5`]]]) : rowType = > RecordType(ANY DX1, ANY DX2, ANY year, ANY DX4, ANY DX5, ANY DX3): rowcount > = 5.281182067E9, cumulative cost = {5.281182067E9 rows, 3.1687092402E10 > cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 7674 > > > Sungwook > > On Mon, Aug 24, 2015 at 4:18 PM, Jinfeng Ni <[email protected]> wrote: > > > Can you post the plan including the HBaseScan part? I would like check > if > > the filter condition is pushed into HBaseScan or not. > > > > For HBaseScan, as I understand, currently, if part of the filter is > pushed > > into Scan operator, it will remain in the Filter operator, although the > > filter in the Scan should have pruned out the rows which do not qualify > for > > the filter condition. > > > > > > > > > > > > On Mon, Aug 24, 2015 at 2:10 PM, Aman Sinha <[email protected]> > wrote: > > > > > Indeed, it is not efficient. We are doing 16 invocations of > > > CONVERT_FROMUTF8($1) and 16 invocations of CONVERT_FROMUTF8($2). > > > Can you pls file a JIRA ? We should ideally be doing projection > pushdown > > > in conjunction with the filter pushdown in to the HBase scan and > > > computing these functions only once. > > > > > > Aman > > > > > > > > > On Mon, Aug 24, 2015 at 1:34 PM, Sungwook Yoon <[email protected]> > > wrote: > > > > > > > Hi, > > > > > > > > I have a query, doing something like > > > > > > > > a in (v1, v2, v3, .... v15) > > > > > > > > The physical query plan looks like the following. > > > > > > > > Filter(condition=[AND(>=(CAST($0):INTEGER, 2009), > <=(CAST($0):INTEGER, > > > > 2013), OR(=(CONVERT_FROMUTF8($1), '39891'), =(CONVERT_FROMUTF8($1), > > > > '4280'), =(CONVERT_FROMUTF8($1), '4281'), =(CONVERT_FROMUTF8($1), > > > '42820'), > > > > =(CONVERT_FROMUTF8($1), '42821'), =(CONVERT_FROMUTF8($1), '42822'), > > > > =(CONVERT_FROMUTF8($1), '42823'), =(CONVERT_FROMUTF8($1), '42830'), > > > > =(CONVERT_FROMUTF8($1), '42831'), =(CONVERT_FROMUTF8($1), '42832'), > > > > =(CONVERT_FROMUTF8($1), '42833'), =(CONVERT_FROMUTF8($1), '42840'), > > > > =(CONVERT_FROMUTF8($1), '42841'), =(CONVERT_FROMUTF8($1), '42842'), > > > > =(CONVERT_FROMUTF8($1), '42843'), =(CONVERT_FROMUTF8($1), '4289'), > > > > =(CONVERT_FROMUTF8($2), '39891'), =(CONVERT_FROMUTF8($2), '4280'), > > > > =(CONVERT_FROMUTF8($2), '4281'), =(CONVERT_FROMUTF8($2), '42820'), > > > > =(CONVERT_FROMUTF8($2), '42821'), =(CONVERT_FROMUTF8($2), '42822'), > > > > =(CONVERT_FROMUTF8($2), '42823'), =(CONVERT_FROMUTF8($2), '42830'), > > > > =(CONVERT_FROMUTF8($2), '42831'), =(CONVERT_FROMUTF8($2), '42832'), > > > > =(CONVERT_FROMUTF8($2), '42833'), =(CONVERT_FROMUTF8($2), '42840'), > > > > =(CONVERT_FROMUTF8($2), '42841'), =(CONVERT_FROMUTF8($2), '42842'), > > > > =(CONVERT_FROMUTF8($2), '42843'), =(CONVERT_FROMUTF8($2), > '4289')))]) : > > > > rowType = RecordType(ANY year, ANY DX1, ANY DX2): rowcount = > > > > 3.300738791875E8, cumulative cost = {1.0562364134E10 rows, > > > > 5.413211618675E10 cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 7136 > > > > > > > > > > > > In this plan, does the drill convert to string the same column to > > > multiple > > > > times as many as the values it is comparing against? > > > > > > > > From the performance, it looks like it is doing that ... > > > > > > > > Sungwook > > > > > > > > > >
