[jira] [Commented] (HIVE-9647) Discrepancy in cardinality estimates between partitioned and un-partitioned tables
[ https://issues.apache.org/jira/browse/HIVE-9647?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14488189#comment-14488189 ] Ashutosh Chauhan commented on HIVE-9647: +1 Discrepancy in cardinality estimates between partitioned and un-partitioned tables --- Key: HIVE-9647 URL: https://issues.apache.org/jira/browse/HIVE-9647 Project: Hive Issue Type: Bug Components: CBO Affects Versions: 0.14.0 Reporter: Mostafa Mokhtar Assignee: Pengcheng Xiong Fix For: 1.2.0 Attachments: HIVE-9647.01.patch, HIVE-9647.02.patch, HIVE-9647.03.patch High-level summary HiveRelMdSelectivity.computeInnerJoinSelectivity relies on per column number of distinct value to estimate join selectivity. The way statistics are aggregated for partitioned tables results in discrepancy in number of distinct values which results in different plans between partitioned and un-partitioned schemas. The table below summarizes the NDVs in computeInnerJoinSelectivity which are used to estimate selectivity of joins. ||Column ||Partitioned count distincts|| Un-Partitioned count distincts |sr_customer_sk |71,245 |1,415,625| |sr_item_sk |38,846|62,562| |sr_ticket_number |71,245 |34,931,085| |ss_customer_sk |88,476|1,415,625| |ss_item_sk |38,846|62,562| |ss_ticket_number|100,756 |56,256,175| The discrepancy is because NDV calculation for a partitioned table assumes that the NDV range is contained within each partition and is calculates as select max(NUM_DISTINCTS) from PART_COL_STATS” . This is problematic for columns like ticket number which are naturally increasing with the partitioned date column ss_sold_date_sk. Suggestions Use Hyper Log Log as suggested by Gopal, there is an HLL implementation for HBASE co-porccessors which we can use as a reference here Using the global stats from TAB_COL_STATS and the per partition stats from PART_COL_STATS extrapolate the NDV for the qualified partitions as in : Max ( (NUM_DISTINCTS from TAB_COL_STATS) x (Number of qualified partitions) / (Number of Partitions), max(NUM_DISTINCTS) from PART_COL_STATS)) More details While doing TPC-DS Partitioned vs. Un-Partitioned runs I noticed that many of the plans are different, then I dumped the CBO logical plan and I found that join estimates are drastically different Unpartitioned schema : {code} 2015-02-10 11:33:27,624 DEBUG [main]: parse.SemanticAnalyzer (SemanticAnalyzer.java:apply(12624)) - Plan After Join Reordering: HiveProjectRel(store_sales_quantitycount=[$0], store_sales_quantityave=[$1], store_sales_quantitystdev=[$2], store_sales_quantitycov=[/($2, $1)], as_store_returns_quantitycount=[$3], as_store_returns_quantityave=[$4], as_store_returns_quantitystdev=[$5], store_returns_quantitycov=[/($5, $4)]): rowcount = 1.0, cumulative cost = {6.056835407771381E8 rows, 0.0 cpu, 0.0 io}, id = 2956 HiveAggregateRel(group=[{}], agg#0=[count($0)], agg#1=[avg($0)], agg#2=[stddev_samp($0)], agg#3=[count($1)], agg#4=[avg($1)], agg#5=[stddev_samp($1)]): rowcount = 1.0, cumulative cost = {6.056835407771381E8 rows, 0.0 cpu, 0.0 io}, id = 2954 HiveProjectRel($f0=[$4], $f1=[$8]): rowcount = 40.05611776795562, cumulative cost = {6.056835407771381E8 rows, 0.0 cpu, 0.0 io}, id = 2952 HiveProjectRel(ss_sold_date_sk=[$0], ss_item_sk=[$1], ss_customer_sk=[$2], ss_ticket_number=[$3], ss_quantity=[$4], sr_item_sk=[$5], sr_customer_sk=[$6], sr_ticket_number=[$7], sr_return_quantity=[$8], d_date_sk=[$9], d_quarter_name=[$10]): rowcount = 40.05611776795562, cumulative cost = {6.056835407771381E8 rows, 0.0 cpu, 0.0 io}, id = 2982 HiveJoinRel(condition=[=($9, $0)], joinType=[inner]): rowcount = 40.05611776795562, cumulative cost = {6.056835407771381E8 rows, 0.0 cpu, 0.0 io}, id = 2980 HiveJoinRel(condition=[AND(AND(=($2, $6), =($1, $5)), =($3, $7))], joinType=[inner]): rowcount = 28880.460910696, cumulative cost = {6.05654559E8 rows, 0.0 cpu, 0.0 io}, id = 2964 HiveProjectRel(ss_sold_date_sk=[$0], ss_item_sk=[$2], ss_customer_sk=[$3], ss_ticket_number=[$9], ss_quantity=[$10]): rowcount = 5.50076554E8, cumulative cost = {0.0 rows, 0.0 cpu, 0.0 io}, id = 2920 HiveTableScanRel(table=[[tpcds_bin_orc_200.store_sales]]): rowcount = 5.50076554E8, cumulative cost = {0}, id = 2822 HiveProjectRel(sr_item_sk=[$2], sr_customer_sk=[$3], sr_ticket_number=[$9], sr_return_quantity=[$10]): rowcount = 5.5578005E7, cumulative cost = {0.0 rows, 0.0 cpu, 0.0 io}, id = 2923 HiveTableScanRel(table=[[tpcds_bin_orc_200.store_returns]]): rowcount = 5.5578005E7, cumulative
[jira] [Commented] (HIVE-9647) Discrepancy in cardinality estimates between partitioned and un-partitioned tables
[ https://issues.apache.org/jira/browse/HIVE-9647?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14486266#comment-14486266 ] Hive QA commented on HIVE-9647: --- {color:red}Overall{color}: -1 at least one tests failed Here are the results of testing the latest attachment: https://issues.apache.org/jira/secure/attachment/12723797/HIVE-9647.03.patch {color:red}ERROR:{color} -1 due to 14 failed/errored test(s), 8666 tests executed *Failed tests:* {noformat} TestMinimrCliDriver-bucketmapjoin6.q-constprog_partitioner.q-infer_bucket_sort_dyn_part.q-and-1-more - did not produce a TEST-*.xml file TestMinimrCliDriver-external_table_with_space_in_location_path.q-infer_bucket_sort_merge.q-auto_sortmerge_join_16.q-and-1-more - did not produce a TEST-*.xml file TestMinimrCliDriver-groupby2.q-import_exported_table.q-bucketizedhiveinputformat.q-and-1-more - did not produce a TEST-*.xml file TestMinimrCliDriver-index_bitmap3.q-stats_counter_partitioned.q-temp_table_external.q-and-1-more - did not produce a TEST-*.xml file TestMinimrCliDriver-infer_bucket_sort_map_operators.q-join1.q-bucketmapjoin7.q-and-1-more - did not produce a TEST-*.xml file TestMinimrCliDriver-infer_bucket_sort_num_buckets.q-disable_merge_for_bucketing.q-uber_reduce.q-and-1-more - did not produce a TEST-*.xml file TestMinimrCliDriver-infer_bucket_sort_reducers_power_two.q-scriptfile1.q-scriptfile1_win.q-and-1-more - did not produce a TEST-*.xml file TestMinimrCliDriver-leftsemijoin_mr.q-load_hdfs_file_with_space_in_the_name.q-root_dir_external_table.q-and-1-more - did not produce a TEST-*.xml file TestMinimrCliDriver-list_bucket_dml_10.q-bucket_num_reducers.q-bucket6.q-and-1-more - did not produce a TEST-*.xml file TestMinimrCliDriver-load_fs2.q-file_with_header_footer.q-ql_rewrite_gbtoidx_cbo_1.q-and-1-more - did not produce a TEST-*.xml file TestMinimrCliDriver-parallel_orderby.q-reduce_deduplicate.q-ql_rewrite_gbtoidx_cbo_2.q-and-1-more - did not produce a TEST-*.xml file TestMinimrCliDriver-ql_rewrite_gbtoidx.q-smb_mapjoin_8.q - did not produce a TEST-*.xml file TestMinimrCliDriver-schemeAuthority2.q-bucket4.q-input16_cc.q-and-1-more - did not produce a TEST-*.xml file org.apache.hive.hcatalog.api.TestHCatClient.testPartitionSpecRegistrationWithCustomSchema {noformat} Test results: http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-TRUNK-Build//testReport Console output: http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-TRUNK-Build//console Test logs: http://ec2-174-129-184-35.compute-1.amazonaws.com/logs/PreCommit-HIVE-TRUNK-Build-/ Messages: {noformat} Executing org.apache.hive.ptest.execution.PrepPhase Executing org.apache.hive.ptest.execution.ExecutionPhase Executing org.apache.hive.ptest.execution.ReportingPhase Tests exited with: TestsFailedException: 14 tests failed {noformat} This message is automatically generated. ATTACHMENT ID: 12723797 - PreCommit-HIVE-TRUNK-Build Discrepancy in cardinality estimates between partitioned and un-partitioned tables --- Key: HIVE-9647 URL: https://issues.apache.org/jira/browse/HIVE-9647 Project: Hive Issue Type: Bug Components: CBO Affects Versions: 0.14.0 Reporter: Mostafa Mokhtar Assignee: Pengcheng Xiong Fix For: 1.2.0 Attachments: HIVE-9647.01.patch, HIVE-9647.02.patch, HIVE-9647.03.patch High-level summary HiveRelMdSelectivity.computeInnerJoinSelectivity relies on per column number of distinct value to estimate join selectivity. The way statistics are aggregated for partitioned tables results in discrepancy in number of distinct values which results in different plans between partitioned and un-partitioned schemas. The table below summarizes the NDVs in computeInnerJoinSelectivity which are used to estimate selectivity of joins. ||Column ||Partitioned count distincts|| Un-Partitioned count distincts |sr_customer_sk |71,245 |1,415,625| |sr_item_sk |38,846|62,562| |sr_ticket_number |71,245 |34,931,085| |ss_customer_sk |88,476|1,415,625| |ss_item_sk |38,846|62,562| |ss_ticket_number|100,756 |56,256,175| The discrepancy is because NDV calculation for a partitioned table assumes that the NDV range is contained within each partition and is calculates as select max(NUM_DISTINCTS) from PART_COL_STATS” . This is problematic for columns like ticket number which are naturally increasing with the partitioned date column ss_sold_date_sk. Suggestions Use Hyper Log Log as suggested by Gopal, there is an HLL implementation for HBASE co-porccessors which we can use as a reference here Using the global stats from TAB_COL_STATS and the per
[jira] [Commented] (HIVE-9647) Discrepancy in cardinality estimates between partitioned and un-partitioned tables
[ https://issues.apache.org/jira/browse/HIVE-9647?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14486381#comment-14486381 ] Pengcheng Xiong commented on HIVE-9647: --- The failed test TestHCatClient is unrelated and it passed on my laptop. [~ashutoshc], could you please take a look? I think it is ready to go. Thanks. Discrepancy in cardinality estimates between partitioned and un-partitioned tables --- Key: HIVE-9647 URL: https://issues.apache.org/jira/browse/HIVE-9647 Project: Hive Issue Type: Bug Components: CBO Affects Versions: 0.14.0 Reporter: Mostafa Mokhtar Assignee: Pengcheng Xiong Fix For: 1.2.0 Attachments: HIVE-9647.01.patch, HIVE-9647.02.patch, HIVE-9647.03.patch High-level summary HiveRelMdSelectivity.computeInnerJoinSelectivity relies on per column number of distinct value to estimate join selectivity. The way statistics are aggregated for partitioned tables results in discrepancy in number of distinct values which results in different plans between partitioned and un-partitioned schemas. The table below summarizes the NDVs in computeInnerJoinSelectivity which are used to estimate selectivity of joins. ||Column ||Partitioned count distincts|| Un-Partitioned count distincts |sr_customer_sk |71,245 |1,415,625| |sr_item_sk |38,846|62,562| |sr_ticket_number |71,245 |34,931,085| |ss_customer_sk |88,476|1,415,625| |ss_item_sk |38,846|62,562| |ss_ticket_number|100,756 |56,256,175| The discrepancy is because NDV calculation for a partitioned table assumes that the NDV range is contained within each partition and is calculates as select max(NUM_DISTINCTS) from PART_COL_STATS” . This is problematic for columns like ticket number which are naturally increasing with the partitioned date column ss_sold_date_sk. Suggestions Use Hyper Log Log as suggested by Gopal, there is an HLL implementation for HBASE co-porccessors which we can use as a reference here Using the global stats from TAB_COL_STATS and the per partition stats from PART_COL_STATS extrapolate the NDV for the qualified partitions as in : Max ( (NUM_DISTINCTS from TAB_COL_STATS) x (Number of qualified partitions) / (Number of Partitions), max(NUM_DISTINCTS) from PART_COL_STATS)) More details While doing TPC-DS Partitioned vs. Un-Partitioned runs I noticed that many of the plans are different, then I dumped the CBO logical plan and I found that join estimates are drastically different Unpartitioned schema : {code} 2015-02-10 11:33:27,624 DEBUG [main]: parse.SemanticAnalyzer (SemanticAnalyzer.java:apply(12624)) - Plan After Join Reordering: HiveProjectRel(store_sales_quantitycount=[$0], store_sales_quantityave=[$1], store_sales_quantitystdev=[$2], store_sales_quantitycov=[/($2, $1)], as_store_returns_quantitycount=[$3], as_store_returns_quantityave=[$4], as_store_returns_quantitystdev=[$5], store_returns_quantitycov=[/($5, $4)]): rowcount = 1.0, cumulative cost = {6.056835407771381E8 rows, 0.0 cpu, 0.0 io}, id = 2956 HiveAggregateRel(group=[{}], agg#0=[count($0)], agg#1=[avg($0)], agg#2=[stddev_samp($0)], agg#3=[count($1)], agg#4=[avg($1)], agg#5=[stddev_samp($1)]): rowcount = 1.0, cumulative cost = {6.056835407771381E8 rows, 0.0 cpu, 0.0 io}, id = 2954 HiveProjectRel($f0=[$4], $f1=[$8]): rowcount = 40.05611776795562, cumulative cost = {6.056835407771381E8 rows, 0.0 cpu, 0.0 io}, id = 2952 HiveProjectRel(ss_sold_date_sk=[$0], ss_item_sk=[$1], ss_customer_sk=[$2], ss_ticket_number=[$3], ss_quantity=[$4], sr_item_sk=[$5], sr_customer_sk=[$6], sr_ticket_number=[$7], sr_return_quantity=[$8], d_date_sk=[$9], d_quarter_name=[$10]): rowcount = 40.05611776795562, cumulative cost = {6.056835407771381E8 rows, 0.0 cpu, 0.0 io}, id = 2982 HiveJoinRel(condition=[=($9, $0)], joinType=[inner]): rowcount = 40.05611776795562, cumulative cost = {6.056835407771381E8 rows, 0.0 cpu, 0.0 io}, id = 2980 HiveJoinRel(condition=[AND(AND(=($2, $6), =($1, $5)), =($3, $7))], joinType=[inner]): rowcount = 28880.460910696, cumulative cost = {6.05654559E8 rows, 0.0 cpu, 0.0 io}, id = 2964 HiveProjectRel(ss_sold_date_sk=[$0], ss_item_sk=[$2], ss_customer_sk=[$3], ss_ticket_number=[$9], ss_quantity=[$10]): rowcount = 5.50076554E8, cumulative cost = {0.0 rows, 0.0 cpu, 0.0 io}, id = 2920 HiveTableScanRel(table=[[tpcds_bin_orc_200.store_sales]]): rowcount = 5.50076554E8, cumulative cost = {0}, id = 2822 HiveProjectRel(sr_item_sk=[$2], sr_customer_sk=[$3], sr_ticket_number=[$9], sr_return_quantity=[$10]): rowcount = 5.5578005E7, cumulative cost = {0.0
[jira] [Commented] (HIVE-9647) Discrepancy in cardinality estimates between partitioned and un-partitioned tables
[ https://issues.apache.org/jira/browse/HIVE-9647?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14482507#comment-14482507 ] Hive QA commented on HIVE-9647: --- {color:red}Overall{color}: -1 at least one tests failed Here are the results of testing the latest attachment: https://issues.apache.org/jira/secure/attachment/12723440/HIVE-9647.02.patch {color:red}ERROR:{color} -1 due to 15 failed/errored test(s), 8653 tests executed *Failed tests:* {noformat} TestMinimrCliDriver-bucketmapjoin6.q-constprog_partitioner.q-infer_bucket_sort_dyn_part.q-and-1-more - did not produce a TEST-*.xml file TestMinimrCliDriver-external_table_with_space_in_location_path.q-infer_bucket_sort_merge.q-auto_sortmerge_join_16.q-and-1-more - did not produce a TEST-*.xml file TestMinimrCliDriver-groupby2.q-import_exported_table.q-bucketizedhiveinputformat.q-and-1-more - did not produce a TEST-*.xml file TestMinimrCliDriver-index_bitmap3.q-stats_counter_partitioned.q-temp_table_external.q-and-1-more - did not produce a TEST-*.xml file TestMinimrCliDriver-infer_bucket_sort_map_operators.q-join1.q-bucketmapjoin7.q-and-1-more - did not produce a TEST-*.xml file TestMinimrCliDriver-infer_bucket_sort_num_buckets.q-disable_merge_for_bucketing.q-uber_reduce.q-and-1-more - did not produce a TEST-*.xml file TestMinimrCliDriver-infer_bucket_sort_reducers_power_two.q-scriptfile1.q-scriptfile1_win.q-and-1-more - did not produce a TEST-*.xml file TestMinimrCliDriver-leftsemijoin_mr.q-bucket5.q-root_dir_external_table.q-and-1-more - did not produce a TEST-*.xml file TestMinimrCliDriver-list_bucket_dml_10.q-bucket_num_reducers.q-bucket6.q-and-1-more - did not produce a TEST-*.xml file TestMinimrCliDriver-load_fs2.q-file_with_header_footer.q-ql_rewrite_gbtoidx_cbo_1.q-and-1-more - did not produce a TEST-*.xml file TestMinimrCliDriver-ql_rewrite_gbtoidx.q-parallel_orderby.q-reduce_deduplicate.q-and-1-more - did not produce a TEST-*.xml file TestMinimrCliDriver-schemeAuthority2.q-infer_bucket_sort_bucketed_table.q-bucket4.q-and-1-more - did not produce a TEST-*.xml file TestMinimrCliDriver-smb_mapjoin_8.q - did not produce a TEST-*.xml file org.apache.hadoop.hive.ql.security.authorization.plugin.sqlstd.TestSQLStdHiveAccessControllerHS2.testConfigProcessing org.apache.hadoop.hive.ql.security.authorization.plugin.sqlstd.TestSQLStdHiveAccessControllerHS2.testConfigProcessingCustomSetWhitelistAppend {noformat} Test results: http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-TRUNK-Build/3301/testReport Console output: http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-TRUNK-Build/3301/console Test logs: http://ec2-174-129-184-35.compute-1.amazonaws.com/logs/PreCommit-HIVE-TRUNK-Build-3301/ Messages: {noformat} Executing org.apache.hive.ptest.execution.PrepPhase Executing org.apache.hive.ptest.execution.ExecutionPhase Executing org.apache.hive.ptest.execution.ReportingPhase Tests exited with: TestsFailedException: 15 tests failed {noformat} This message is automatically generated. ATTACHMENT ID: 12723440 - PreCommit-HIVE-TRUNK-Build Discrepancy in cardinality estimates between partitioned and un-partitioned tables --- Key: HIVE-9647 URL: https://issues.apache.org/jira/browse/HIVE-9647 Project: Hive Issue Type: Bug Components: CBO Affects Versions: 0.14.0 Reporter: Mostafa Mokhtar Assignee: Pengcheng Xiong Fix For: 1.2.0 Attachments: HIVE-9647.01.patch, HIVE-9647.02.patch High-level summary HiveRelMdSelectivity.computeInnerJoinSelectivity relies on per column number of distinct value to estimate join selectivity. The way statistics are aggregated for partitioned tables results in discrepancy in number of distinct values which results in different plans between partitioned and un-partitioned schemas. The table below summarizes the NDVs in computeInnerJoinSelectivity which are used to estimate selectivity of joins. ||Column ||Partitioned count distincts|| Un-Partitioned count distincts |sr_customer_sk |71,245 |1,415,625| |sr_item_sk |38,846|62,562| |sr_ticket_number |71,245 |34,931,085| |ss_customer_sk |88,476|1,415,625| |ss_item_sk |38,846|62,562| |ss_ticket_number|100,756 |56,256,175| The discrepancy is because NDV calculation for a partitioned table assumes that the NDV range is contained within each partition and is calculates as select max(NUM_DISTINCTS) from PART_COL_STATS” . This is problematic for columns like ticket number which are naturally increasing with the partitioned date column ss_sold_date_sk. Suggestions Use Hyper Log Log as suggested by Gopal, there is an HLL implementation for HBASE