Yongzhi Chen created HIVE-12189:
-----------------------------------

             Summary: The list in pushdownPreds of ppd.ExprWalkerInfo should 
not be allowed to grow very large
                 Key: HIVE-12189
                 URL: https://issues.apache.org/jira/browse/HIVE-12189
             Project: Hive
          Issue Type: Bug
          Components: Logical Optimizer
    Affects Versions: 1.1.0, 2.0.0
            Reporter: Yongzhi Chen
            Assignee: Yongzhi Chen


Some queries are very slow in compile time, for example following query
{noformat}
select * from tt1 nf 
join tt2 a1 on (nf.col1 = a1.col1 and nf.hdp_databaseid = a1.hdp_databaseid) 
join tt3 a2 on        (a2.col2 = a1.col2 and a2.col3 = nf.col3 and 
a2.hdp_databaseid = nf.hdp_databaseid) 
join tt4 a3 on              (a3.col4 = a2.col4 and a3.col3 = a2.col3) 
join tt5 a4 on     (a4.col4 = a2.col4 and a4.col5 = a2.col5 and a4.col3 = 
a2.col3 and a4.hdp_databaseid = nf.hdp_databaseid) 
join tt6 a5 on              (a5.col3 = a2.col3 and a5.col2 = a2.col2 and 
a5.hdp_databaseid = nf.hdp_databaseid) 
JOIN tt7 a6 ON (a2.col3 = a6.col3 and a2.col2 = a6.col2 and a6.hdp_databaseid = 
nf.hdp_databaseid) 
JOIN tt8 a7 ON (a2.col3 = a7.col3 and a2.col2 = a7.col2 and a7.hdp_databaseid = 
nf.hdp_databaseid)
where nf.hdp_databaseid = 102 limit 10;
{noformat}
takes around 120 seconds to compile in hive 1.1 when
hive.mapred.mode=strict;
hive.optimize.ppd=true;
and hive is not in test mode.
All the above tables are tables with one column as partition. But all the 
tables are empty table. If the tables are not empty, it is reported that the 
compile so slow that it looks like hive is hanging. 
In hive 2.0, the compile is much faster, explain takes 6.6 seconds. But it is 
still a lot of time. One of the problem slows ppd down is that list in 
pushdownPreds can grow very large which makes extractPushdownPreds bad 
performance:
{noformat}
public static ExprWalkerInfo extractPushdownPreds(OpWalkerInfo opContext,
    Operator<? extends OperatorDesc> op, List<ExprNodeDesc> preds)
{noformat}
During run the query above, in the following break point preds  has size of 
12051, and most entry of the list is: GenericUDFOPEqual(Column[hdp_databaseid], 
Const int 102), GenericUDFOPEqual(Column[hdp_databaseid], Const int 102), 
GenericUDFOPEqual(Column[hdp_databaseid], Const int 102), 
GenericUDFOPEqual(Column[hdp_databaseid], Const int 102), ....
Following code in extractPushdownPreds will clone all the nodes in preds and do 
the walk. Hive 2.0 is faster because HIVE-11652 makes startWalking much faster, 
but we still clone thousands of nodes with same expression. Should we store so 
many same predicates in the list or just one is good enough?  

{noformat}
    List<Node> startNodes = new ArrayList<Node>();
    List<ExprNodeDesc> clonedPreds = new ArrayList<ExprNodeDesc>();
    for (ExprNodeDesc node : preds) {
      ExprNodeDesc clone = node.clone();
      clonedPreds.add(clone);
      exprContext.getNewToOldExprMap().put(clone, node);
    }
    startNodes.addAll(clonedPreds);

    egw.startWalking(startNodes, null);

{noformat}

Should we change java/org/apache/hadoop/hive/ql/ppd/ExprWalkerInfo.java
method 
public void addFinalCandidate(String alias, ExprNodeDesc expr) 
and
public void addPushDowns(String alias, List<ExprNodeDesc> pushDowns) 

to only add expr which is not in the PushDown list for an alias?




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to