[jira] [Updated] (HIVE-9647) Discrepancy in cardinality estimates between partitioned and un-partitioned tables

2015-04-07 Thread Pengcheng Xiong (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-9647?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pengcheng Xiong updated HIVE-9647:
--
Attachment: HIVE-9647.03.patch

address the test failures. Thanks for [~thejas]'s help.

> Discrepancy in cardinality estimates between partitioned and un-partitioned 
> tables 
> ---
>
> Key: HIVE-9647
> URL: https://issues.apache.org/jira/browse/HIVE-9647
> Project: Hive
>  Issue Type: Bug
>  Components: CBO
>Affects Versions: 0.14.0
>Reporter: Mostafa Mokhtar
>Assignee: Pengcheng Xiong
> Fix For: 1.2.0
>
> Attachments: HIVE-9647.01.patch, HIVE-9647.02.patch, 
> HIVE-9647.03.patch
>
>
> High-level summary
> HiveRelMdSelectivity.computeInnerJoinSelectivity relies on per column number 
> of distinct value to estimate join selectivity.
> The way statistics are aggregated for partitioned tables results in 
> discrepancy in number of distinct values which results in different plans 
> between partitioned and un-partitioned schemas.
> The table below summarizes the NDVs in computeInnerJoinSelectivity which are 
> used to estimate selectivity of joins.
> ||Column  ||Partitioned count distincts|| Un-Partitioned count 
> distincts 
> |sr_customer_sk   |71,245 |1,415,625|
> |sr_item_sk   |38,846|62,562|
> |sr_ticket_number |71,245 |34,931,085|
> |ss_customer_sk   |88,476|1,415,625|
> |ss_item_sk   |38,846|62,562|
> |ss_ticket_number|100,756 |56,256,175|
>   
> The discrepancy is because NDV calculation for a partitioned table assumes 
> that the NDV range is contained within each partition and is calculates as 
> "select max(NUM_DISTINCTS) from PART_COL_STATS” .
> This is problematic for columns like ticket number which are naturally 
> increasing with the partitioned date column ss_sold_date_sk.
> Suggestions
> Use Hyper Log Log as suggested by Gopal, there is an HLL implementation for 
> HBASE co-porccessors which we can use as a reference here 
> Using the global stats from TAB_COL_STATS and the per partition stats from 
> PART_COL_STATS extrapolate the NDV for the qualified partitions as in :
> Max ( (NUM_DISTINCTS from TAB_COL_STATS) x (Number of qualified partitions) / 
> (Number of Partitions), max(NUM_DISTINCTS) from PART_COL_STATS))
> More details
> While doing TPC-DS Partitioned vs. Un-Partitioned runs I noticed that many of 
> the plans are different, then I dumped the CBO logical plan and I found that 
> join estimates are drastically different
> Unpartitioned schema :
> {code}
> 2015-02-10 11:33:27,624 DEBUG [main]: parse.SemanticAnalyzer 
> (SemanticAnalyzer.java:apply(12624)) - Plan After Join Reordering:
> HiveProjectRel(store_sales_quantitycount=[$0], store_sales_quantityave=[$1], 
> store_sales_quantitystdev=[$2], store_sales_quantitycov=[/($2, $1)], 
> as_store_returns_quantitycount=[$3], as_store_returns_quantityave=[$4], 
> as_store_returns_quantitystdev=[$5], store_returns_quantitycov=[/($5, $4)]): 
> rowcount = 1.0, cumulative cost = {6.056835407771381E8 rows, 0.0 cpu, 0.0 
> io}, id = 2956
>   HiveAggregateRel(group=[{}], agg#0=[count($0)], agg#1=[avg($0)], 
> agg#2=[stddev_samp($0)], agg#3=[count($1)], agg#4=[avg($1)], 
> agg#5=[stddev_samp($1)]): rowcount = 1.0, cumulative cost = 
> {6.056835407771381E8 rows, 0.0 cpu, 0.0 io}, id = 2954
> HiveProjectRel($f0=[$4], $f1=[$8]): rowcount = 40.05611776795562, 
> cumulative cost = {6.056835407771381E8 rows, 0.0 cpu, 0.0 io}, id = 2952
>   HiveProjectRel(ss_sold_date_sk=[$0], ss_item_sk=[$1], 
> ss_customer_sk=[$2], ss_ticket_number=[$3], ss_quantity=[$4], 
> sr_item_sk=[$5], sr_customer_sk=[$6], sr_ticket_number=[$7], 
> sr_return_quantity=[$8], d_date_sk=[$9], d_quarter_name=[$10]): rowcount = 
> 40.05611776795562, cumulative cost = {6.056835407771381E8 rows, 0.0 cpu, 0.0 
> io}, id = 2982
> HiveJoinRel(condition=[=($9, $0)], joinType=[inner]): rowcount = 
> 40.05611776795562, cumulative cost = {6.056835407771381E8 rows, 0.0 cpu, 0.0 
> io}, id = 2980
>   HiveJoinRel(condition=[AND(AND(=($2, $6), =($1, $5)), =($3, $7))], 
> joinType=[inner]): rowcount = 28880.460910696, cumulative cost = 
> {6.05654559E8 rows, 0.0 cpu, 0.0 io}, id = 2964
> HiveProjectRel(ss_sold_date_sk=[$0], ss_item_sk=[$2], 
> ss_customer_sk=[$3], ss_ticket_number=[$9], ss_quantity=[$10]): rowcount = 
> 5.50076554E8, cumulative cost = {0.0 rows, 0.0 cpu, 0.0 io}, id = 2920
>   HiveTableScanRel(table=[[tpcds_bin_orc_200.store_sales]]): 
> rowcount = 5.50076554E8, cumulative cost = {0}, id = 2822
> HiveProjectRel(sr_item_sk=[$2], sr_customer_sk=[$3], 
> sr_ticket_number=[$9], sr_return_quantity=[$10]): rowcount = 5.5578005E7, 
> cumulative cost = {0.0 rows, 0.0 cpu, 0.0 io}, id = 29

[jira] [Updated] (HIVE-9647) Discrepancy in cardinality estimates between partitioned and un-partitioned tables

2015-04-06 Thread Pengcheng Xiong (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-9647?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pengcheng Xiong updated HIVE-9647:
--
Attachment: HIVE-9647.02.patch

address the metastore configuration flag

> Discrepancy in cardinality estimates between partitioned and un-partitioned 
> tables 
> ---
>
> Key: HIVE-9647
> URL: https://issues.apache.org/jira/browse/HIVE-9647
> Project: Hive
>  Issue Type: Bug
>  Components: CBO
>Affects Versions: 0.14.0
>Reporter: Mostafa Mokhtar
>Assignee: Pengcheng Xiong
> Fix For: 1.2.0
>
> Attachments: HIVE-9647.01.patch, HIVE-9647.02.patch
>
>
> High-level summary
> HiveRelMdSelectivity.computeInnerJoinSelectivity relies on per column number 
> of distinct value to estimate join selectivity.
> The way statistics are aggregated for partitioned tables results in 
> discrepancy in number of distinct values which results in different plans 
> between partitioned and un-partitioned schemas.
> The table below summarizes the NDVs in computeInnerJoinSelectivity which are 
> used to estimate selectivity of joins.
> ||Column  ||Partitioned count distincts|| Un-Partitioned count 
> distincts 
> |sr_customer_sk   |71,245 |1,415,625|
> |sr_item_sk   |38,846|62,562|
> |sr_ticket_number |71,245 |34,931,085|
> |ss_customer_sk   |88,476|1,415,625|
> |ss_item_sk   |38,846|62,562|
> |ss_ticket_number|100,756 |56,256,175|
>   
> The discrepancy is because NDV calculation for a partitioned table assumes 
> that the NDV range is contained within each partition and is calculates as 
> "select max(NUM_DISTINCTS) from PART_COL_STATS” .
> This is problematic for columns like ticket number which are naturally 
> increasing with the partitioned date column ss_sold_date_sk.
> Suggestions
> Use Hyper Log Log as suggested by Gopal, there is an HLL implementation for 
> HBASE co-porccessors which we can use as a reference here 
> Using the global stats from TAB_COL_STATS and the per partition stats from 
> PART_COL_STATS extrapolate the NDV for the qualified partitions as in :
> Max ( (NUM_DISTINCTS from TAB_COL_STATS) x (Number of qualified partitions) / 
> (Number of Partitions), max(NUM_DISTINCTS) from PART_COL_STATS))
> More details
> While doing TPC-DS Partitioned vs. Un-Partitioned runs I noticed that many of 
> the plans are different, then I dumped the CBO logical plan and I found that 
> join estimates are drastically different
> Unpartitioned schema :
> {code}
> 2015-02-10 11:33:27,624 DEBUG [main]: parse.SemanticAnalyzer 
> (SemanticAnalyzer.java:apply(12624)) - Plan After Join Reordering:
> HiveProjectRel(store_sales_quantitycount=[$0], store_sales_quantityave=[$1], 
> store_sales_quantitystdev=[$2], store_sales_quantitycov=[/($2, $1)], 
> as_store_returns_quantitycount=[$3], as_store_returns_quantityave=[$4], 
> as_store_returns_quantitystdev=[$5], store_returns_quantitycov=[/($5, $4)]): 
> rowcount = 1.0, cumulative cost = {6.056835407771381E8 rows, 0.0 cpu, 0.0 
> io}, id = 2956
>   HiveAggregateRel(group=[{}], agg#0=[count($0)], agg#1=[avg($0)], 
> agg#2=[stddev_samp($0)], agg#3=[count($1)], agg#4=[avg($1)], 
> agg#5=[stddev_samp($1)]): rowcount = 1.0, cumulative cost = 
> {6.056835407771381E8 rows, 0.0 cpu, 0.0 io}, id = 2954
> HiveProjectRel($f0=[$4], $f1=[$8]): rowcount = 40.05611776795562, 
> cumulative cost = {6.056835407771381E8 rows, 0.0 cpu, 0.0 io}, id = 2952
>   HiveProjectRel(ss_sold_date_sk=[$0], ss_item_sk=[$1], 
> ss_customer_sk=[$2], ss_ticket_number=[$3], ss_quantity=[$4], 
> sr_item_sk=[$5], sr_customer_sk=[$6], sr_ticket_number=[$7], 
> sr_return_quantity=[$8], d_date_sk=[$9], d_quarter_name=[$10]): rowcount = 
> 40.05611776795562, cumulative cost = {6.056835407771381E8 rows, 0.0 cpu, 0.0 
> io}, id = 2982
> HiveJoinRel(condition=[=($9, $0)], joinType=[inner]): rowcount = 
> 40.05611776795562, cumulative cost = {6.056835407771381E8 rows, 0.0 cpu, 0.0 
> io}, id = 2980
>   HiveJoinRel(condition=[AND(AND(=($2, $6), =($1, $5)), =($3, $7))], 
> joinType=[inner]): rowcount = 28880.460910696, cumulative cost = 
> {6.05654559E8 rows, 0.0 cpu, 0.0 io}, id = 2964
> HiveProjectRel(ss_sold_date_sk=[$0], ss_item_sk=[$2], 
> ss_customer_sk=[$3], ss_ticket_number=[$9], ss_quantity=[$10]): rowcount = 
> 5.50076554E8, cumulative cost = {0.0 rows, 0.0 cpu, 0.0 io}, id = 2920
>   HiveTableScanRel(table=[[tpcds_bin_orc_200.store_sales]]): 
> rowcount = 5.50076554E8, cumulative cost = {0}, id = 2822
> HiveProjectRel(sr_item_sk=[$2], sr_customer_sk=[$3], 
> sr_ticket_number=[$9], sr_return_quantity=[$10]): rowcount = 5.5578005E7, 
> cumulative cost = {0.0 rows, 0.0 cpu, 0.0 io}, id = 2923
>   HiveTableScanRel(ta