Hi Sungwook,

I looked at the query profile.  It did show that the bottleneck is Filter
operator,  where Filter operator,
running on 4-node EC2 cluster (?),  took about 82 minutes to process total
approximately 2.5 billion
records. I'm not sure how powerful each Node's CPU. But seems the query is
CPU-bounded in the
Filter operator.

There is some way to improve the query.  As Vicky and Aman mentioned, we
could convert the IN LIST
predicate to a JOIN op with a VALUES.  By converting to JOIN with VALUES
op, we will build a hashtable
for the In-List, and the in list predicate becomes a hash probe, which is
much more efficient than doing multiple comparison.

The problem is that currently the planner will do such transformation under
a threshold, which is set to 20.
Your query has fewer than 20 items, which currently disqualify for this
transformation.

A work around is to repeat the values in the IN LIST, so that it exceed 20
(Adding duplicate in IN list would
not change query result).

A long term solution is to make the threshold configurable, so that user
could tune the threshold for this kind
of transformation. You may file a JIRA for this.

For instance,

explain plan for select n_name, n_nationkey from cp.`tpch/nation.parquet`
where n_nationkey in (1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1,1);

00-00    Screen
00-01      Project(n_name=[$0], n_nationkey=[$1])
00-02        Project(n_name=[$1], n_nationkey=[$0])
00-03          HashJoin(condition=[=($2, $3)], joinType=[inner])
00-05            Project(n_nationkey=[$1], n_name=[$0], f2=[$1])
00-07              Scan(groupscan=[ParquetGroupScan
[entries=[ReadEntryWithPath [path=classpath:/tpch/nation.parquet]],
selectionRoot=classpath:/tpch/nation.parquet, numFiles=1,
columns=[`n_nationkey`, `n_name`]]])
00-04            HashAgg(group=[{0}])
00-06              Values

explain plan for select n_name, n_nationkey from cp.`tpch/nation.parquet`
where n_nationkey in (1, 2, 3, 4, 5);
+------+------+
| text | json |
+------+------+
| 00-00    Screen
00-01      Project(n_name=[$1], n_nationkey=[$0])
00-02        SelectionVectorRemover
00-03          Filter(condition=[OR(=($0, 1), =($0, 2), =($0, 3), =($0, 4),
=($0, 5))])
00-04            Project(n_nationkey=[$1], n_name=[$0])
00-05              Scan(groupscan=[ParquetGroupScan
[entries=[ReadEntryWithPath [path=classpath:/tpch/nation.parquet]],
selectionRoot=classpath:/tpch/nation.parquet, numFiles=1,
columns=[`n_nationkey`, `n_name`]]])

The first one has > 20 values in the list, which will convert to JOIN
VALUES. The second one
has < 20 values, which will be converted to ORed = predicates.

 Please let me if the workaround makes any difference.

Reply via email to