Yongzhi Chen created HIVE-12189: ----------------------------------- Summary: The list in pushdownPreds of ppd.ExprWalkerInfo should not be allowed to grow very large Key: HIVE-12189 URL: https://issues.apache.org/jira/browse/HIVE-12189 Project: Hive Issue Type: Bug Components: Logical Optimizer Affects Versions: 1.1.0, 2.0.0 Reporter: Yongzhi Chen Assignee: Yongzhi Chen
Some queries are very slow in compile time, for example following query {noformat} select * from tt1 nf join tt2 a1 on (nf.col1 = a1.col1 and nf.hdp_databaseid = a1.hdp_databaseid) join tt3 a2 on (a2.col2 = a1.col2 and a2.col3 = nf.col3 and a2.hdp_databaseid = nf.hdp_databaseid) join tt4 a3 on (a3.col4 = a2.col4 and a3.col3 = a2.col3) join tt5 a4 on (a4.col4 = a2.col4 and a4.col5 = a2.col5 and a4.col3 = a2.col3 and a4.hdp_databaseid = nf.hdp_databaseid) join tt6 a5 on (a5.col3 = a2.col3 and a5.col2 = a2.col2 and a5.hdp_databaseid = nf.hdp_databaseid) JOIN tt7 a6 ON (a2.col3 = a6.col3 and a2.col2 = a6.col2 and a6.hdp_databaseid = nf.hdp_databaseid) JOIN tt8 a7 ON (a2.col3 = a7.col3 and a2.col2 = a7.col2 and a7.hdp_databaseid = nf.hdp_databaseid) where nf.hdp_databaseid = 102 limit 10; {noformat} takes around 120 seconds to compile in hive 1.1 when hive.mapred.mode=strict; hive.optimize.ppd=true; and hive is not in test mode. All the above tables are tables with one column as partition. But all the tables are empty table. If the tables are not empty, it is reported that the compile so slow that it looks like hive is hanging. In hive 2.0, the compile is much faster, explain takes 6.6 seconds. But it is still a lot of time. One of the problem slows ppd down is that list in pushdownPreds can grow very large which makes extractPushdownPreds bad performance: {noformat} public static ExprWalkerInfo extractPushdownPreds(OpWalkerInfo opContext, Operator<? extends OperatorDesc> op, List<ExprNodeDesc> preds) {noformat} During run the query above, in the following break point preds has size of 12051, and most entry of the list is: GenericUDFOPEqual(Column[hdp_databaseid], Const int 102), GenericUDFOPEqual(Column[hdp_databaseid], Const int 102), GenericUDFOPEqual(Column[hdp_databaseid], Const int 102), GenericUDFOPEqual(Column[hdp_databaseid], Const int 102), .... Following code in extractPushdownPreds will clone all the nodes in preds and do the walk. Hive 2.0 is faster because HIVE-11652 makes startWalking much faster, but we still clone thousands of nodes with same expression. Should we store so many same predicates in the list or just one is good enough? {noformat} List<Node> startNodes = new ArrayList<Node>(); List<ExprNodeDesc> clonedPreds = new ArrayList<ExprNodeDesc>(); for (ExprNodeDesc node : preds) { ExprNodeDesc clone = node.clone(); clonedPreds.add(clone); exprContext.getNewToOldExprMap().put(clone, node); } startNodes.addAll(clonedPreds); egw.startWalking(startNodes, null); {noformat} Should we change java/org/apache/hadoop/hive/ql/ppd/ExprWalkerInfo.java method public void addFinalCandidate(String alias, ExprNodeDesc expr) and public void addPushDowns(String alias, List<ExprNodeDesc> pushDowns) to only add expr which is not in the PushDown list for an alias? -- This message was sent by Atlassian JIRA (v6.3.4#6332)