Team, I created a feature request https://issues.apache.org/jira/browse/DRILL-3710 to make the "20" magic number configurable so that we do not need to add junk/duplicate in-list elements.
On Tue, Aug 25, 2015 at 4:49 PM, Sungwook Yoon <[email protected]> wrote: > This trick definitely helps... > Not as fast as I want it to be.. but.. it helps... > > Sungwook > > > On Tue, Aug 25, 2015 at 4:25 PM, Jinfeng Ni <[email protected]> wrote: > > > Hi Sungwook, > > > > I looked at the query profile. It did show that the bottleneck is Filter > > operator, where Filter operator, > > running on 4-node EC2 cluster (?), took about 82 minutes to process > total > > approximately 2.5 billion > > records. I'm not sure how powerful each Node's CPU. But seems the query > is > > CPU-bounded in the > > Filter operator. > > > > There is some way to improve the query. As Vicky and Aman mentioned, we > > could convert the IN LIST > > predicate to a JOIN op with a VALUES. By converting to JOIN with VALUES > > op, we will build a hashtable > > for the In-List, and the in list predicate becomes a hash probe, which is > > much more efficient than doing multiple comparison. > > > > The problem is that currently the planner will do such transformation > under > > a threshold, which is set to 20. > > Your query has fewer than 20 items, which currently disqualify for this > > transformation. > > > > A work around is to repeat the values in the IN LIST, so that it exceed > 20 > > (Adding duplicate in IN list would > > not change query result). > > > > A long term solution is to make the threshold configurable, so that user > > could tune the threshold for this kind > > of transformation. You may file a JIRA for this. > > > > For instance, > > > > explain plan for select n_name, n_nationkey from cp.`tpch/nation.parquet` > > where n_nationkey in (1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, > 1, > > 1, 1, 1, 1, 1, 1, 1,1); > > > > 00-00 Screen > > 00-01 Project(n_name=[$0], n_nationkey=[$1]) > > 00-02 Project(n_name=[$1], n_nationkey=[$0]) > > 00-03 HashJoin(condition=[=($2, $3)], joinType=[inner]) > > 00-05 Project(n_nationkey=[$1], n_name=[$0], f2=[$1]) > > 00-07 Scan(groupscan=[ParquetGroupScan > > [entries=[ReadEntryWithPath [path=classpath:/tpch/nation.parquet]], > > selectionRoot=classpath:/tpch/nation.parquet, numFiles=1, > > columns=[`n_nationkey`, `n_name`]]]) > > 00-04 HashAgg(group=[{0}]) > > 00-06 Values > > > > explain plan for select n_name, n_nationkey from cp.`tpch/nation.parquet` > > where n_nationkey in (1, 2, 3, 4, 5); > > +------+------+ > > | text | json | > > +------+------+ > > | 00-00 Screen > > 00-01 Project(n_name=[$1], n_nationkey=[$0]) > > 00-02 SelectionVectorRemover > > 00-03 Filter(condition=[OR(=($0, 1), =($0, 2), =($0, 3), =($0, > 4), > > =($0, 5))]) > > 00-04 Project(n_nationkey=[$1], n_name=[$0]) > > 00-05 Scan(groupscan=[ParquetGroupScan > > [entries=[ReadEntryWithPath [path=classpath:/tpch/nation.parquet]], > > selectionRoot=classpath:/tpch/nation.parquet, numFiles=1, > > columns=[`n_nationkey`, `n_name`]]]) > > > > The first one has > 20 values in the list, which will convert to JOIN > > VALUES. The second one > > has < 20 values, which will be converted to ORed = predicates. > > > > Please let me if the workaround makes any difference. > > >
