Stamatis Zampetakis created HIVE-23684: ------------------------------------------
Summary: Large underestimation in NDV stats when input and join cardinality ratio is big Key: HIVE-23684 URL: https://issues.apache.org/jira/browse/HIVE-23684 Project: Hive Issue Type: Bug Reporter: Stamatis Zampetakis Assignee: Stamatis Zampetakis Large underestimations of NDV values may occur after a join operation since the current logic will decrease the original NDV values proportionally. The [code|https://github.com/apache/hive/blob/1271d08a3c51c021fa710449f8748b8cdb12b70f/ql/src/java/org/apache/hadoop/hive/ql/optimizer/stats/annotation/StatsRulesProcFactory.java#L2558] compares the number of rows of each relation before the join with the number of rows after the join and extracts a ratio for each side. Based on this ratio it adapts (reduces) the NDV accordingly. Consider for instance the following query: {code:sql} select inv_warehouse_sk , inv_item_sk , stddev_samp(inv_quantity_on_hand) stdev , avg(inv_quantity_on_hand) mean from inventory , date_dim where inv_date_sk = d_date_sk and d_year = 1999 and d_moy = 2 group by inv_warehouse_sk, inv_item_sk; {code} For the sake of the discussion, I outline below some relevant stats (from TPCDS30tb): T(inventory) = 1627857000 T(date_dim) = 73049 T(inventory JOIN date_dim[d_year=1999 AND d_moy=2]) = 24948000 V(inventory, inv_date_sk) = 261 V(inventory, inv_item_sk) = 420000 V(inventory, inv_warehouse_sk) = 27 V(date_dim, inv, d_date_sk) = 73049 For instance, in this query the join between inventory and date_dim has ~24M rows while inventory has ~1.5B so the NDV of the columns coming from inventory are reduced by a factor of ~100 so we end up with V(JOIN, inv_item_sk) = ~6K while the real one is 231000. -- This message was sent by Atlassian Jira (v8.3.4#803005)