Stamatis Zampetakis created HIVE-23684:
------------------------------------------

             Summary: Large underestimation in NDV stats when input and join 
cardinality ratio is big
                 Key: HIVE-23684
                 URL: https://issues.apache.org/jira/browse/HIVE-23684
             Project: Hive
          Issue Type: Bug
            Reporter: Stamatis Zampetakis
            Assignee: Stamatis Zampetakis


Large underestimations of NDV values may occur after a join operation since the 
current logic will decrease the original NDV values proportionally.

The 
[code|https://github.com/apache/hive/blob/1271d08a3c51c021fa710449f8748b8cdb12b70f/ql/src/java/org/apache/hadoop/hive/ql/optimizer/stats/annotation/StatsRulesProcFactory.java#L2558]
 compares the number of rows of each relation before the join with the number 
of rows after the join and extracts a ratio for each side. Based on this ratio 
it adapts (reduces) the NDV accordingly.

Consider for instance the following query:
{code:sql}
select inv_warehouse_sk
     , inv_item_sk
     , stddev_samp(inv_quantity_on_hand) stdev
     , avg(inv_quantity_on_hand)         mean
from inventory
   , date_dim
where inv_date_sk = d_date_sk
  and d_year = 1999
  and d_moy = 2
group by inv_warehouse_sk, inv_item_sk;
{code}
For the sake of the discussion, I outline below some relevant stats (from 
TPCDS30tb):
 T(inventory) = 1627857000
 T(date_dim) = 73049
 T(inventory JOIN date_dim[d_year=1999 AND d_moy=2]) = 24948000
 V(inventory, inv_date_sk) = 261
 V(inventory, inv_item_sk) = 420000
 V(inventory, inv_warehouse_sk) = 27
 V(date_dim, inv, d_date_sk) = 73049

For instance, in this query the join between inventory and date_dim has ~24M 
rows while inventory has ~1.5B so the NDV of the columns coming from inventory 
are reduced by a factor of ~100 so we end up with V(JOIN, inv_item_sk) = ~6K 
while the real one is 231000.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to