Team, I created a feature request
https://issues.apache.org/jira/browse/DRILL-3710 to make the "20" magic
number configurable so that we do not need to add junk/duplicate in-list
elements.

On Tue, Aug 25, 2015 at 4:49 PM, Sungwook Yoon <[email protected]> wrote:

> This trick definitely helps...
> Not as fast as I want it to be.. but.. it helps...
>
> Sungwook
>
>
> On Tue, Aug 25, 2015 at 4:25 PM, Jinfeng Ni <[email protected]> wrote:
>
> > Hi Sungwook,
> >
> > I looked at the query profile.  It did show that the bottleneck is Filter
> > operator,  where Filter operator,
> > running on 4-node EC2 cluster (?),  took about 82 minutes to process
> total
> > approximately 2.5 billion
> > records. I'm not sure how powerful each Node's CPU. But seems the query
> is
> > CPU-bounded in the
> > Filter operator.
> >
> > There is some way to improve the query.  As Vicky and Aman mentioned, we
> > could convert the IN LIST
> > predicate to a JOIN op with a VALUES.  By converting to JOIN with VALUES
> > op, we will build a hashtable
> > for the In-List, and the in list predicate becomes a hash probe, which is
> > much more efficient than doing multiple comparison.
> >
> > The problem is that currently the planner will do such transformation
> under
> > a threshold, which is set to 20.
> > Your query has fewer than 20 items, which currently disqualify for this
> > transformation.
> >
> > A work around is to repeat the values in the IN LIST, so that it exceed
> 20
> > (Adding duplicate in IN list would
> > not change query result).
> >
> > A long term solution is to make the threshold configurable, so that user
> > could tune the threshold for this kind
> > of transformation. You may file a JIRA for this.
> >
> > For instance,
> >
> > explain plan for select n_name, n_nationkey from cp.`tpch/nation.parquet`
> > where n_nationkey in (1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
> 1,
> > 1, 1, 1, 1, 1, 1, 1,1);
> >
> > 00-00    Screen
> > 00-01      Project(n_name=[$0], n_nationkey=[$1])
> > 00-02        Project(n_name=[$1], n_nationkey=[$0])
> > 00-03          HashJoin(condition=[=($2, $3)], joinType=[inner])
> > 00-05            Project(n_nationkey=[$1], n_name=[$0], f2=[$1])
> > 00-07              Scan(groupscan=[ParquetGroupScan
> > [entries=[ReadEntryWithPath [path=classpath:/tpch/nation.parquet]],
> > selectionRoot=classpath:/tpch/nation.parquet, numFiles=1,
> > columns=[`n_nationkey`, `n_name`]]])
> > 00-04            HashAgg(group=[{0}])
> > 00-06              Values
> >
> > explain plan for select n_name, n_nationkey from cp.`tpch/nation.parquet`
> > where n_nationkey in (1, 2, 3, 4, 5);
> > +------+------+
> > | text | json |
> > +------+------+
> > | 00-00    Screen
> > 00-01      Project(n_name=[$1], n_nationkey=[$0])
> > 00-02        SelectionVectorRemover
> > 00-03          Filter(condition=[OR(=($0, 1), =($0, 2), =($0, 3), =($0,
> 4),
> > =($0, 5))])
> > 00-04            Project(n_nationkey=[$1], n_name=[$0])
> > 00-05              Scan(groupscan=[ParquetGroupScan
> > [entries=[ReadEntryWithPath [path=classpath:/tpch/nation.parquet]],
> > selectionRoot=classpath:/tpch/nation.parquet, numFiles=1,
> > columns=[`n_nationkey`, `n_name`]]])
> >
> > The first one has > 20 values in the list, which will convert to JOIN
> > VALUES. The second one
> > has < 20 values, which will be converted to ORed = predicates.
> >
> >  Please let me if the workaround makes any difference.
> >
>

Reply via email to