Rajesh Balamohan created HIVE-23788:
---------------------------------------

             Summary: FilterStatsRule misestimate causes hashtable computation 
to rehash often
                 Key: HIVE-23788
                 URL: https://issues.apache.org/jira/browse/HIVE-23788
             Project: Hive
          Issue Type: Improvement
            Reporter: Rajesh Balamohan


Depending on available statistics, FilterStatsRule estimates the rows as 
numRows/3 at times. This causes, lower keyCount to be projected for hashtable 
computation causing rehashing often.

[https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/stats/annotation/StatsRulesProcFactory.java#L952]

[https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/stats/annotation/StatsRulesProcFactory.java#L1192]

E.g TPCDS Q74 @ 10TB. But as part of evaluating "t_s_firstyear.year_total > 0, 
t_w_secyear.year_total / t_w_firstyear.year_total , t_s_secyear.year_total / 
t_s_firstyear.year_total " conditions, it projects 1/3rd of the rows causing 
rehashing of hashtable in downstream vertex.

May have to check whether stats can be projected for these columns correctly.

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to