This trick definitely helps...
Not as fast as I want it to be.. but.. it helps...

Sungwook


On Tue, Aug 25, 2015 at 4:25 PM, Jinfeng Ni <[email protected]> wrote:

> Hi Sungwook,
>
> I looked at the query profile.  It did show that the bottleneck is Filter
> operator,  where Filter operator,
> running on 4-node EC2 cluster (?),  took about 82 minutes to process total
> approximately 2.5 billion
> records. I'm not sure how powerful each Node's CPU. But seems the query is
> CPU-bounded in the
> Filter operator.
>
> There is some way to improve the query.  As Vicky and Aman mentioned, we
> could convert the IN LIST
> predicate to a JOIN op with a VALUES.  By converting to JOIN with VALUES
> op, we will build a hashtable
> for the In-List, and the in list predicate becomes a hash probe, which is
> much more efficient than doing multiple comparison.
>
> The problem is that currently the planner will do such transformation under
> a threshold, which is set to 20.
> Your query has fewer than 20 items, which currently disqualify for this
> transformation.
>
> A work around is to repeat the values in the IN LIST, so that it exceed 20
> (Adding duplicate in IN list would
> not change query result).
>
> A long term solution is to make the threshold configurable, so that user
> could tune the threshold for this kind
> of transformation. You may file a JIRA for this.
>
> For instance,
>
> explain plan for select n_name, n_nationkey from cp.`tpch/nation.parquet`
> where n_nationkey in (1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
> 1, 1, 1, 1, 1, 1, 1,1);
>
> 00-00    Screen
> 00-01      Project(n_name=[$0], n_nationkey=[$1])
> 00-02        Project(n_name=[$1], n_nationkey=[$0])
> 00-03          HashJoin(condition=[=($2, $3)], joinType=[inner])
> 00-05            Project(n_nationkey=[$1], n_name=[$0], f2=[$1])
> 00-07              Scan(groupscan=[ParquetGroupScan
> [entries=[ReadEntryWithPath [path=classpath:/tpch/nation.parquet]],
> selectionRoot=classpath:/tpch/nation.parquet, numFiles=1,
> columns=[`n_nationkey`, `n_name`]]])
> 00-04            HashAgg(group=[{0}])
> 00-06              Values
>
> explain plan for select n_name, n_nationkey from cp.`tpch/nation.parquet`
> where n_nationkey in (1, 2, 3, 4, 5);
> +------+------+
> | text | json |
> +------+------+
> | 00-00    Screen
> 00-01      Project(n_name=[$1], n_nationkey=[$0])
> 00-02        SelectionVectorRemover
> 00-03          Filter(condition=[OR(=($0, 1), =($0, 2), =($0, 3), =($0, 4),
> =($0, 5))])
> 00-04            Project(n_nationkey=[$1], n_name=[$0])
> 00-05              Scan(groupscan=[ParquetGroupScan
> [entries=[ReadEntryWithPath [path=classpath:/tpch/nation.parquet]],
> selectionRoot=classpath:/tpch/nation.parquet, numFiles=1,
> columns=[`n_nationkey`, `n_name`]]])
>
> The first one has > 20 values in the list, which will convert to JOIN
> VALUES. The second one
> has < 20 values, which will be converted to ORed = predicates.
>
>  Please let me if the workaround makes any difference.
>

Reply via email to