Here is how the physical plan looks like,
Adding simple "or" operation is killing Drill performance!
00-00 Screen : rowType = RecordType(BIGINT EXPR$0): rowcount = 1.0,
cumulative cost = {1.15525857736625E10 rows, 7.55869183540375E10 cpu, 0.0
io, 4096.0 network, 0.0 memory}, id = 7682
00-01 StreamAgg(group=[{}], EXPR$0=[$SUM0($0)]) : rowType =
RecordType(BIGINT EXPR$0): rowcount = 1.0, cumulative cost =
{1.15525857735625E10 rows, 7.55869183539375E10 cpu, 0.0 io, 4096.0 network,
0.0 memory}, id = 7681
00-02 UnionExchange : rowType = RecordType(BIGINT EXPR$0): rowcount
= 1.0, cumulative cost = {1.15525857725625E10 rows, 7.55869183419375E10
cpu, 0.0 io, 4096.0 network, 0.0 memory}, id = 7680
01-01 StreamAgg(group=[{}], EXPR$0=[COUNT()]) : rowType =
RecordType(BIGINT EXPR$0): rowcount = 1.0, cumulative cost =
{1.15525857715625E10 rows, 7.55869183339375E10 cpu, 0.0 io, 0.0 network,
0.0 memory}, id = 7679
01-02 Project($f0=[1]) : rowType = RecordType(INTEGER $f0):
rowcount = 3.300738791875E8, cumulative cost = {1.1222511892375E10 rows,
7.16260317836875E10 cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 7678
01-03 SelectionVectorRemover : rowType = RecordType(ANY year,
ANY DX1, ANY DX2, ANY DX3, ANY DX4, ANY DX5): rowcount = 3.300738791875E8,
cumulative cost = {1.08924380131875E10 rows, 7.03057362669375E10 cpu, 0.0
io, 0.0 network, 0.0 memory}, id = 7677
01-04 Filter(condition=[AND(>=($0, 2009), <=($0, 2013),
OR(=($1, '39891'), =($1, '4280'), =($1, '4281'), =($1, '42820'), =($1,
'42821'), =($1, '42822'), =($1, '42823'), =($1, '42830'), =($1, '42831'),
=($1, '42832'), =($1, '42833'), =($1, '42840'), =($1, '42841'), =($1,
'42842'), =($1, '42843'), =($1, '4289'), =($2, '39891'), =($2, '4280'),
=($2, '4281'), =($2, '42820'), =($2, '42821'), =($2, '42822'), =($2,
'42823'), =($2, '42830'), =($2, '42831'), =($2, '42832'), =($2, '42833'),
=($2, '42840'), =($2, '42841'), =($2, '42842'), =($2, '42843'), =($2,
'4289'), =($3, '39891'), =($3, '4280'), =($3, '4281'), =($3, '42820'),
=($3, '42821'), =($3, '42822'), =($3, '42823'), =($3, '42830'), =($3,
'42831'), =($3, '42832'), =($3, '42833'), =($3, '42840'), =($3, '42841'),
=($3, '42842'), =($3, '42843'), =($3, '4289'), =($4, '39891'), =($4,
'4280'), =($4, '4281'), =($4, '42820'), =($4, '42821'), =($4, '42822'),
=($4, '42823'), =($4, '42830'), =($4, '42831'), =($4, '42832'), =($4,
'42833'), =($4, '42840'), =($4, '42841'), =($4, '42842'), =($4, '42843'),
=($4, '4289'), =($5, '39891'), =($5, '4280'), =($5, '4281'), =($5,
'42820'), =($5, '42821'), =($5, '42822'), =($5, '42823'), =($5, '42830'),
=($5, '42831'), =($5, '42832'), =($5, '42833'), =($5, '42840'), =($5,
'42841'), =($5, '42842'), =($5, '42843'), =($5, '4289')))]) : rowType =
RecordType(ANY year, ANY DX1, ANY DX2, ANY DX3, ANY DX4, ANY DX5): rowcount
= 3.300738791875E8, cumulative cost = {1.0562364134E10 rows,
6.997566238775E10 cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 7676
01-05 Project(year=[$2], DX1=[$0], DX2=[$1], DX3=[$5],
DX4=[$3], DX5=[$4]) : rowType = RecordType(ANY year, ANY DX1, ANY DX2, ANY
DX3, ANY DX4, ANY DX5): rowcount = 5.281182067E9, cumulative cost =
{5.281182067E9 rows, 3.1687092402E10 cpu, 0.0 io, 0.0 network, 0.0 memory},
id = 7675
01-06 Scan(groupscan=[ParquetGroupScan
[entries=[ReadEntryWithPath
[path=maprfs:///drill/parquet/mktscan_o_ttt_save]],
selectionRoot=maprfs:/drill/parquet/mktscan_o_ttt_save, numFiles=1,
columns=[`year`, `DX1`, `DX2`, `DX3`, `DX4`, `DX5`]]]) : rowType =
RecordType(ANY DX1, ANY DX2, ANY year, ANY DX4, ANY DX5, ANY DX3): rowcount
= 5.281182067E9, cumulative cost = {5.281182067E9 rows, 3.1687092402E10
cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 7674
Sungwook
On Mon, Aug 24, 2015 at 4:18 PM, Jinfeng Ni <[email protected]> wrote:
> Can you post the plan including the HBaseScan part? I would like check if
> the filter condition is pushed into HBaseScan or not.
>
> For HBaseScan, as I understand, currently, if part of the filter is pushed
> into Scan operator, it will remain in the Filter operator, although the
> filter in the Scan should have pruned out the rows which do not qualify for
> the filter condition.
>
>
>
>
>
> On Mon, Aug 24, 2015 at 2:10 PM, Aman Sinha <[email protected]> wrote:
>
> > Indeed, it is not efficient. We are doing 16 invocations of
> > CONVERT_FROMUTF8($1) and 16 invocations of CONVERT_FROMUTF8($2).
> > Can you pls file a JIRA ? We should ideally be doing projection pushdown
> > in conjunction with the filter pushdown in to the HBase scan and
> > computing these functions only once.
> >
> > Aman
> >
> >
> > On Mon, Aug 24, 2015 at 1:34 PM, Sungwook Yoon <[email protected]>
> wrote:
> >
> > > Hi,
> > >
> > > I have a query, doing something like
> > >
> > > a in (v1, v2, v3, .... v15)
> > >
> > > The physical query plan looks like the following.
> > >
> > > Filter(condition=[AND(>=(CAST($0):INTEGER, 2009), <=(CAST($0):INTEGER,
> > > 2013), OR(=(CONVERT_FROMUTF8($1), '39891'), =(CONVERT_FROMUTF8($1),
> > > '4280'), =(CONVERT_FROMUTF8($1), '4281'), =(CONVERT_FROMUTF8($1),
> > '42820'),
> > > =(CONVERT_FROMUTF8($1), '42821'), =(CONVERT_FROMUTF8($1), '42822'),
> > > =(CONVERT_FROMUTF8($1), '42823'), =(CONVERT_FROMUTF8($1), '42830'),
> > > =(CONVERT_FROMUTF8($1), '42831'), =(CONVERT_FROMUTF8($1), '42832'),
> > > =(CONVERT_FROMUTF8($1), '42833'), =(CONVERT_FROMUTF8($1), '42840'),
> > > =(CONVERT_FROMUTF8($1), '42841'), =(CONVERT_FROMUTF8($1), '42842'),
> > > =(CONVERT_FROMUTF8($1), '42843'), =(CONVERT_FROMUTF8($1), '4289'),
> > > =(CONVERT_FROMUTF8($2), '39891'), =(CONVERT_FROMUTF8($2), '4280'),
> > > =(CONVERT_FROMUTF8($2), '4281'), =(CONVERT_FROMUTF8($2), '42820'),
> > > =(CONVERT_FROMUTF8($2), '42821'), =(CONVERT_FROMUTF8($2), '42822'),
> > > =(CONVERT_FROMUTF8($2), '42823'), =(CONVERT_FROMUTF8($2), '42830'),
> > > =(CONVERT_FROMUTF8($2), '42831'), =(CONVERT_FROMUTF8($2), '42832'),
> > > =(CONVERT_FROMUTF8($2), '42833'), =(CONVERT_FROMUTF8($2), '42840'),
> > > =(CONVERT_FROMUTF8($2), '42841'), =(CONVERT_FROMUTF8($2), '42842'),
> > > =(CONVERT_FROMUTF8($2), '42843'), =(CONVERT_FROMUTF8($2), '4289')))]) :
> > > rowType = RecordType(ANY year, ANY DX1, ANY DX2): rowcount =
> > > 3.300738791875E8, cumulative cost = {1.0562364134E10 rows,
> > > 5.413211618675E10 cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 7136
> > >
> > >
> > > In this plan, does the drill convert to string the same column to
> > multiple
> > > times as many as the values it is comparing against?
> > >
> > > From the performance, it looks like it is doing that ...
> > >
> > > Sungwook
> > >
> >
>