This trick definitely helps... Not as fast as I want it to be.. but.. it helps...
Sungwook On Tue, Aug 25, 2015 at 4:25 PM, Jinfeng Ni <[email protected]> wrote: > Hi Sungwook, > > I looked at the query profile. It did show that the bottleneck is Filter > operator, where Filter operator, > running on 4-node EC2 cluster (?), took about 82 minutes to process total > approximately 2.5 billion > records. I'm not sure how powerful each Node's CPU. But seems the query is > CPU-bounded in the > Filter operator. > > There is some way to improve the query. As Vicky and Aman mentioned, we > could convert the IN LIST > predicate to a JOIN op with a VALUES. By converting to JOIN with VALUES > op, we will build a hashtable > for the In-List, and the in list predicate becomes a hash probe, which is > much more efficient than doing multiple comparison. > > The problem is that currently the planner will do such transformation under > a threshold, which is set to 20. > Your query has fewer than 20 items, which currently disqualify for this > transformation. > > A work around is to repeat the values in the IN LIST, so that it exceed 20 > (Adding duplicate in IN list would > not change query result). > > A long term solution is to make the threshold configurable, so that user > could tune the threshold for this kind > of transformation. You may file a JIRA for this. > > For instance, > > explain plan for select n_name, n_nationkey from cp.`tpch/nation.parquet` > where n_nationkey in (1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, > 1, 1, 1, 1, 1, 1, 1,1); > > 00-00 Screen > 00-01 Project(n_name=[$0], n_nationkey=[$1]) > 00-02 Project(n_name=[$1], n_nationkey=[$0]) > 00-03 HashJoin(condition=[=($2, $3)], joinType=[inner]) > 00-05 Project(n_nationkey=[$1], n_name=[$0], f2=[$1]) > 00-07 Scan(groupscan=[ParquetGroupScan > [entries=[ReadEntryWithPath [path=classpath:/tpch/nation.parquet]], > selectionRoot=classpath:/tpch/nation.parquet, numFiles=1, > columns=[`n_nationkey`, `n_name`]]]) > 00-04 HashAgg(group=[{0}]) > 00-06 Values > > explain plan for select n_name, n_nationkey from cp.`tpch/nation.parquet` > where n_nationkey in (1, 2, 3, 4, 5); > +------+------+ > | text | json | > +------+------+ > | 00-00 Screen > 00-01 Project(n_name=[$1], n_nationkey=[$0]) > 00-02 SelectionVectorRemover > 00-03 Filter(condition=[OR(=($0, 1), =($0, 2), =($0, 3), =($0, 4), > =($0, 5))]) > 00-04 Project(n_nationkey=[$1], n_name=[$0]) > 00-05 Scan(groupscan=[ParquetGroupScan > [entries=[ReadEntryWithPath [path=classpath:/tpch/nation.parquet]], > selectionRoot=classpath:/tpch/nation.parquet, numFiles=1, > columns=[`n_nationkey`, `n_name`]]]) > > The first one has > 20 values in the list, which will convert to JOIN > VALUES. The second one > has < 20 values, which will be converted to ORed = predicates. > > Please let me if the workaround makes any difference. >
