Jinfeng,

Still, the performance drop down due to OR filtering is just astounding...

Sungwook


On Mon, Aug 24, 2015 at 3:33 PM, Jinfeng Ni <[email protected]> wrote:

> Looks like the query has Parquet Scan, in stead of HBase Scan.  For
> Parquet, Drill currently
>  does not support filter push down into Scan operator yet.
>
> Can you check the query profile, and verify that Filter operator is the one
> which takes the
> most of the query time?
>
> For the filter operator, Drill uses short circuit evaluation[1] to speed up
> the evaluation performance.
>
>    - If the row does not qualify for 'year' between 2009 and 2013, then the
> entire OR operator
> would be skipped for that row.
>
>    - If the row does qualify for 'year' between 2009 and 2013, it will come
> down to OR operator.
> As long as one branch of OR is evaluated to be true, the rest of OR's
> branches will be skipped.
>
> In other words, Drill's execution does not compare each row against all the
> values listed in the
> in list.
>
> [1]. https://issues.apache.org/jira/browse/DRILL-937
>
> On Mon, Aug 24, 2015 at 2:41 PM, Sungwook Yoon <[email protected]> wrote:
>
> > Here is how the physical plan looks like,
> > Adding simple "or" operation is killing Drill performance!
> >
> > 00-00    Screen : rowType = RecordType(BIGINT EXPR$0): rowcount = 1.0,
> > cumulative cost = {1.15525857736625E10 rows, 7.55869183540375E10 cpu, 0.0
> > io, 4096.0 network, 0.0 memory}, id = 7682
> > 00-01      StreamAgg(group=[{}], EXPR$0=[$SUM0($0)]) : rowType =
> > RecordType(BIGINT EXPR$0): rowcount = 1.0, cumulative cost =
> > {1.15525857735625E10 rows, 7.55869183539375E10 cpu, 0.0 io, 4096.0
> network,
> > 0.0 memory}, id = 7681
> > 00-02        UnionExchange : rowType = RecordType(BIGINT EXPR$0):
> rowcount
> > = 1.0, cumulative cost = {1.15525857725625E10 rows, 7.55869183419375E10
> > cpu, 0.0 io, 4096.0 network, 0.0 memory}, id = 7680
> > 01-01          StreamAgg(group=[{}], EXPR$0=[COUNT()]) : rowType =
> > RecordType(BIGINT EXPR$0): rowcount = 1.0, cumulative cost =
> > {1.15525857715625E10 rows, 7.55869183339375E10 cpu, 0.0 io, 0.0 network,
> > 0.0 memory}, id = 7679
> > 01-02            Project($f0=[1]) : rowType = RecordType(INTEGER $f0):
> > rowcount = 3.300738791875E8, cumulative cost = {1.1222511892375E10 rows,
> > 7.16260317836875E10 cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 7678
> > 01-03              SelectionVectorRemover : rowType = RecordType(ANY
> year,
> > ANY DX1, ANY DX2, ANY DX3, ANY DX4, ANY DX5): rowcount =
> 3.300738791875E8,
> > cumulative cost = {1.08924380131875E10 rows, 7.03057362669375E10 cpu, 0.0
> > io, 0.0 network, 0.0 memory}, id = 7677
> > 01-04                Filter(condition=[AND(>=($0, 2009), <=($0, 2013),
> > OR(=($1, '39891'), =($1, '4280'), =($1, '4281'), =($1, '42820'), =($1,
> > '42821'), =($1, '42822'), =($1, '42823'), =($1, '42830'), =($1, '42831'),
> > =($1, '42832'), =($1, '42833'), =($1, '42840'), =($1, '42841'), =($1,
> > '42842'), =($1, '42843'), =($1, '4289'), =($2, '39891'), =($2, '4280'),
> > =($2, '4281'), =($2, '42820'), =($2, '42821'), =($2, '42822'), =($2,
> > '42823'), =($2, '42830'), =($2, '42831'), =($2, '42832'), =($2, '42833'),
> > =($2, '42840'), =($2, '42841'), =($2, '42842'), =($2, '42843'), =($2,
> > '4289'), =($3, '39891'), =($3, '4280'), =($3, '4281'), =($3, '42820'),
> > =($3, '42821'), =($3, '42822'), =($3, '42823'), =($3, '42830'), =($3,
> > '42831'), =($3, '42832'), =($3, '42833'), =($3, '42840'), =($3, '42841'),
> > =($3, '42842'), =($3, '42843'), =($3, '4289'), =($4, '39891'), =($4,
> > '4280'), =($4, '4281'), =($4, '42820'), =($4, '42821'), =($4, '42822'),
> > =($4, '42823'), =($4, '42830'), =($4, '42831'), =($4, '42832'), =($4,
> > '42833'), =($4, '42840'), =($4, '42841'), =($4, '42842'), =($4, '42843'),
> > =($4, '4289'), =($5, '39891'), =($5, '4280'), =($5, '4281'), =($5,
> > '42820'), =($5, '42821'), =($5, '42822'), =($5, '42823'), =($5, '42830'),
> > =($5, '42831'), =($5, '42832'), =($5, '42833'), =($5, '42840'), =($5,
> > '42841'), =($5, '42842'), =($5, '42843'), =($5, '4289')))]) : rowType =
> > RecordType(ANY year, ANY DX1, ANY DX2, ANY DX3, ANY DX4, ANY DX5):
> rowcount
> > = 3.300738791875E8, cumulative cost = {1.0562364134E10 rows,
> > 6.997566238775E10 cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 7676
> > 01-05                  Project(year=[$2], DX1=[$0], DX2=[$1], DX3=[$5],
> > DX4=[$3], DX5=[$4]) : rowType = RecordType(ANY year, ANY DX1, ANY DX2,
> ANY
> > DX3, ANY DX4, ANY DX5): rowcount = 5.281182067E9, cumulative cost =
> > {5.281182067E9 rows, 3.1687092402E10 cpu, 0.0 io, 0.0 network, 0.0
> memory},
> > id = 7675
> > 01-06                    Scan(groupscan=[ParquetGroupScan
> > [entries=[ReadEntryWithPath
> > [path=maprfs:///drill/parquet/mktscan_o_ttt_save]],
> > selectionRoot=maprfs:/drill/parquet/mktscan_o_ttt_save, numFiles=1,
> > columns=[`year`, `DX1`, `DX2`, `DX3`, `DX4`, `DX5`]]]) : rowType =
> > RecordType(ANY DX1, ANY DX2, ANY year, ANY DX4, ANY DX5, ANY DX3):
> rowcount
> > = 5.281182067E9, cumulative cost = {5.281182067E9 rows, 3.1687092402E10
> > cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 7674
> >
> >
> > Sungwook
> >
> > On Mon, Aug 24, 2015 at 4:18 PM, Jinfeng Ni <[email protected]>
> wrote:
> >
> > > Can you post the plan including the HBaseScan part?  I would like check
> > if
> > > the filter condition is pushed into HBaseScan or not.
> > >
> > > For HBaseScan, as I understand, currently, if part of the filter is
> > pushed
> > > into Scan operator, it will remain in the Filter operator, although the
> > > filter in the Scan should have pruned out the rows which do not qualify
> > for
> > > the filter condition.
> > >
> > >
> > >
> > >
> > >
> > > On Mon, Aug 24, 2015 at 2:10 PM, Aman Sinha <[email protected]>
> > wrote:
> > >
> > > > Indeed, it is not efficient. We are doing 16 invocations of
> > > > CONVERT_FROMUTF8($1)  and 16 invocations of CONVERT_FROMUTF8($2).
> > > > Can you pls file a JIRA ?  We should ideally be doing projection
> > pushdown
> > > > in conjunction with the filter pushdown in to the HBase scan and
> > > > computing these functions only once.
> > > >
> > > > Aman
> > > >
> > > >
> > > > On Mon, Aug 24, 2015 at 1:34 PM, Sungwook Yoon <[email protected]>
> > > wrote:
> > > >
> > > > > Hi,
> > > > >
> > > > > I have a query, doing something like
> > > > >
> > > > > a in (v1, v2, v3, .... v15)
> > > > >
> > > > > The physical query plan looks like the following.
> > > > >
> > > > > Filter(condition=[AND(>=(CAST($0):INTEGER, 2009),
> > <=(CAST($0):INTEGER,
> > > > > 2013), OR(=(CONVERT_FROMUTF8($1), '39891'), =(CONVERT_FROMUTF8($1),
> > > > > '4280'), =(CONVERT_FROMUTF8($1), '4281'), =(CONVERT_FROMUTF8($1),
> > > > '42820'),
> > > > > =(CONVERT_FROMUTF8($1), '42821'), =(CONVERT_FROMUTF8($1), '42822'),
> > > > > =(CONVERT_FROMUTF8($1), '42823'), =(CONVERT_FROMUTF8($1), '42830'),
> > > > > =(CONVERT_FROMUTF8($1), '42831'), =(CONVERT_FROMUTF8($1), '42832'),
> > > > > =(CONVERT_FROMUTF8($1), '42833'), =(CONVERT_FROMUTF8($1), '42840'),
> > > > > =(CONVERT_FROMUTF8($1), '42841'), =(CONVERT_FROMUTF8($1), '42842'),
> > > > > =(CONVERT_FROMUTF8($1), '42843'), =(CONVERT_FROMUTF8($1), '4289'),
> > > > > =(CONVERT_FROMUTF8($2), '39891'), =(CONVERT_FROMUTF8($2), '4280'),
> > > > > =(CONVERT_FROMUTF8($2), '4281'), =(CONVERT_FROMUTF8($2), '42820'),
> > > > > =(CONVERT_FROMUTF8($2), '42821'), =(CONVERT_FROMUTF8($2), '42822'),
> > > > > =(CONVERT_FROMUTF8($2), '42823'), =(CONVERT_FROMUTF8($2), '42830'),
> > > > > =(CONVERT_FROMUTF8($2), '42831'), =(CONVERT_FROMUTF8($2), '42832'),
> > > > > =(CONVERT_FROMUTF8($2), '42833'), =(CONVERT_FROMUTF8($2), '42840'),
> > > > > =(CONVERT_FROMUTF8($2), '42841'), =(CONVERT_FROMUTF8($2), '42842'),
> > > > > =(CONVERT_FROMUTF8($2), '42843'), =(CONVERT_FROMUTF8($2),
> > '4289')))]) :
> > > > > rowType = RecordType(ANY year, ANY DX1, ANY DX2): rowcount =
> > > > > 3.300738791875E8, cumulative cost = {1.0562364134E10 rows,
> > > > > 5.413211618675E10 cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 7136
> > > > >
> > > > >
> > > > > In this plan, does the drill convert to string the same column to
> > > > multiple
> > > > > times as many as the values it is comparing against?
> > > > >
> > > > > From the performance, it looks like it is doing that ...
> > > > >
> > > > > Sungwook
> > > > >
> > > >
> > >
> >
>

Reply via email to