[jira] [Commented] (HIVE-9647) Discrepancy in cardinality estimates between partitioned and un-partitioned tables

2015-04-09 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-9647?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14488189#comment-14488189
 ] 

Ashutosh Chauhan commented on HIVE-9647:


+1

 Discrepancy in cardinality estimates between partitioned and un-partitioned 
 tables 
 ---

 Key: HIVE-9647
 URL: https://issues.apache.org/jira/browse/HIVE-9647
 Project: Hive
  Issue Type: Bug
  Components: CBO
Affects Versions: 0.14.0
Reporter: Mostafa Mokhtar
Assignee: Pengcheng Xiong
 Fix For: 1.2.0

 Attachments: HIVE-9647.01.patch, HIVE-9647.02.patch, 
 HIVE-9647.03.patch


 High-level summary
 HiveRelMdSelectivity.computeInnerJoinSelectivity relies on per column number 
 of distinct value to estimate join selectivity.
 The way statistics are aggregated for partitioned tables results in 
 discrepancy in number of distinct values which results in different plans 
 between partitioned and un-partitioned schemas.
 The table below summarizes the NDVs in computeInnerJoinSelectivity which are 
 used to estimate selectivity of joins.
 ||Column  ||Partitioned count distincts|| Un-Partitioned count 
 distincts 
 |sr_customer_sk   |71,245 |1,415,625|
 |sr_item_sk   |38,846|62,562|
 |sr_ticket_number |71,245 |34,931,085|
 |ss_customer_sk   |88,476|1,415,625|
 |ss_item_sk   |38,846|62,562|
 |ss_ticket_number|100,756 |56,256,175|
   
 The discrepancy is because NDV calculation for a partitioned table assumes 
 that the NDV range is contained within each partition and is calculates as 
 select max(NUM_DISTINCTS) from PART_COL_STATS” .
 This is problematic for columns like ticket number which are naturally 
 increasing with the partitioned date column ss_sold_date_sk.
 Suggestions
 Use Hyper Log Log as suggested by Gopal, there is an HLL implementation for 
 HBASE co-porccessors which we can use as a reference here 
 Using the global stats from TAB_COL_STATS and the per partition stats from 
 PART_COL_STATS extrapolate the NDV for the qualified partitions as in :
 Max ( (NUM_DISTINCTS from TAB_COL_STATS) x (Number of qualified partitions) / 
 (Number of Partitions), max(NUM_DISTINCTS) from PART_COL_STATS))
 More details
 While doing TPC-DS Partitioned vs. Un-Partitioned runs I noticed that many of 
 the plans are different, then I dumped the CBO logical plan and I found that 
 join estimates are drastically different
 Unpartitioned schema :
 {code}
 2015-02-10 11:33:27,624 DEBUG [main]: parse.SemanticAnalyzer 
 (SemanticAnalyzer.java:apply(12624)) - Plan After Join Reordering:
 HiveProjectRel(store_sales_quantitycount=[$0], store_sales_quantityave=[$1], 
 store_sales_quantitystdev=[$2], store_sales_quantitycov=[/($2, $1)], 
 as_store_returns_quantitycount=[$3], as_store_returns_quantityave=[$4], 
 as_store_returns_quantitystdev=[$5], store_returns_quantitycov=[/($5, $4)]): 
 rowcount = 1.0, cumulative cost = {6.056835407771381E8 rows, 0.0 cpu, 0.0 
 io}, id = 2956
   HiveAggregateRel(group=[{}], agg#0=[count($0)], agg#1=[avg($0)], 
 agg#2=[stddev_samp($0)], agg#3=[count($1)], agg#4=[avg($1)], 
 agg#5=[stddev_samp($1)]): rowcount = 1.0, cumulative cost = 
 {6.056835407771381E8 rows, 0.0 cpu, 0.0 io}, id = 2954
 HiveProjectRel($f0=[$4], $f1=[$8]): rowcount = 40.05611776795562, 
 cumulative cost = {6.056835407771381E8 rows, 0.0 cpu, 0.0 io}, id = 2952
   HiveProjectRel(ss_sold_date_sk=[$0], ss_item_sk=[$1], 
 ss_customer_sk=[$2], ss_ticket_number=[$3], ss_quantity=[$4], 
 sr_item_sk=[$5], sr_customer_sk=[$6], sr_ticket_number=[$7], 
 sr_return_quantity=[$8], d_date_sk=[$9], d_quarter_name=[$10]): rowcount = 
 40.05611776795562, cumulative cost = {6.056835407771381E8 rows, 0.0 cpu, 0.0 
 io}, id = 2982
 HiveJoinRel(condition=[=($9, $0)], joinType=[inner]): rowcount = 
 40.05611776795562, cumulative cost = {6.056835407771381E8 rows, 0.0 cpu, 0.0 
 io}, id = 2980
   HiveJoinRel(condition=[AND(AND(=($2, $6), =($1, $5)), =($3, $7))], 
 joinType=[inner]): rowcount = 28880.460910696, cumulative cost = 
 {6.05654559E8 rows, 0.0 cpu, 0.0 io}, id = 2964
 HiveProjectRel(ss_sold_date_sk=[$0], ss_item_sk=[$2], 
 ss_customer_sk=[$3], ss_ticket_number=[$9], ss_quantity=[$10]): rowcount = 
 5.50076554E8, cumulative cost = {0.0 rows, 0.0 cpu, 0.0 io}, id = 2920
   HiveTableScanRel(table=[[tpcds_bin_orc_200.store_sales]]): 
 rowcount = 5.50076554E8, cumulative cost = {0}, id = 2822
 HiveProjectRel(sr_item_sk=[$2], sr_customer_sk=[$3], 
 sr_ticket_number=[$9], sr_return_quantity=[$10]): rowcount = 5.5578005E7, 
 cumulative cost = {0.0 rows, 0.0 cpu, 0.0 io}, id = 2923
   HiveTableScanRel(table=[[tpcds_bin_orc_200.store_returns]]): 
 rowcount = 5.5578005E7, cumulative 

[jira] [Commented] (HIVE-9647) Discrepancy in cardinality estimates between partitioned and un-partitioned tables

2015-04-08 Thread Hive QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-9647?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14486266#comment-14486266
 ] 

Hive QA commented on HIVE-9647:
---



{color:red}Overall{color}: -1 at least one tests failed

Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12723797/HIVE-9647.03.patch

{color:red}ERROR:{color} -1 due to 14 failed/errored test(s), 8666 tests 
executed
*Failed tests:*
{noformat}
TestMinimrCliDriver-bucketmapjoin6.q-constprog_partitioner.q-infer_bucket_sort_dyn_part.q-and-1-more
 - did not produce a TEST-*.xml file
TestMinimrCliDriver-external_table_with_space_in_location_path.q-infer_bucket_sort_merge.q-auto_sortmerge_join_16.q-and-1-more
 - did not produce a TEST-*.xml file
TestMinimrCliDriver-groupby2.q-import_exported_table.q-bucketizedhiveinputformat.q-and-1-more
 - did not produce a TEST-*.xml file
TestMinimrCliDriver-index_bitmap3.q-stats_counter_partitioned.q-temp_table_external.q-and-1-more
 - did not produce a TEST-*.xml file
TestMinimrCliDriver-infer_bucket_sort_map_operators.q-join1.q-bucketmapjoin7.q-and-1-more
 - did not produce a TEST-*.xml file
TestMinimrCliDriver-infer_bucket_sort_num_buckets.q-disable_merge_for_bucketing.q-uber_reduce.q-and-1-more
 - did not produce a TEST-*.xml file
TestMinimrCliDriver-infer_bucket_sort_reducers_power_two.q-scriptfile1.q-scriptfile1_win.q-and-1-more
 - did not produce a TEST-*.xml file
TestMinimrCliDriver-leftsemijoin_mr.q-load_hdfs_file_with_space_in_the_name.q-root_dir_external_table.q-and-1-more
 - did not produce a TEST-*.xml file
TestMinimrCliDriver-list_bucket_dml_10.q-bucket_num_reducers.q-bucket6.q-and-1-more
 - did not produce a TEST-*.xml file
TestMinimrCliDriver-load_fs2.q-file_with_header_footer.q-ql_rewrite_gbtoidx_cbo_1.q-and-1-more
 - did not produce a TEST-*.xml file
TestMinimrCliDriver-parallel_orderby.q-reduce_deduplicate.q-ql_rewrite_gbtoidx_cbo_2.q-and-1-more
 - did not produce a TEST-*.xml file
TestMinimrCliDriver-ql_rewrite_gbtoidx.q-smb_mapjoin_8.q - did not produce a 
TEST-*.xml file
TestMinimrCliDriver-schemeAuthority2.q-bucket4.q-input16_cc.q-and-1-more - did 
not produce a TEST-*.xml file
org.apache.hive.hcatalog.api.TestHCatClient.testPartitionSpecRegistrationWithCustomSchema
{noformat}

Test results: 
http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-TRUNK-Build//testReport
Console output: 
http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-TRUNK-Build//console
Test logs: 
http://ec2-174-129-184-35.compute-1.amazonaws.com/logs/PreCommit-HIVE-TRUNK-Build-/

Messages:
{noformat}
Executing org.apache.hive.ptest.execution.PrepPhase
Executing org.apache.hive.ptest.execution.ExecutionPhase
Executing org.apache.hive.ptest.execution.ReportingPhase
Tests exited with: TestsFailedException: 14 tests failed
{noformat}

This message is automatically generated.

ATTACHMENT ID: 12723797 - PreCommit-HIVE-TRUNK-Build

 Discrepancy in cardinality estimates between partitioned and un-partitioned 
 tables 
 ---

 Key: HIVE-9647
 URL: https://issues.apache.org/jira/browse/HIVE-9647
 Project: Hive
  Issue Type: Bug
  Components: CBO
Affects Versions: 0.14.0
Reporter: Mostafa Mokhtar
Assignee: Pengcheng Xiong
 Fix For: 1.2.0

 Attachments: HIVE-9647.01.patch, HIVE-9647.02.patch, 
 HIVE-9647.03.patch


 High-level summary
 HiveRelMdSelectivity.computeInnerJoinSelectivity relies on per column number 
 of distinct value to estimate join selectivity.
 The way statistics are aggregated for partitioned tables results in 
 discrepancy in number of distinct values which results in different plans 
 between partitioned and un-partitioned schemas.
 The table below summarizes the NDVs in computeInnerJoinSelectivity which are 
 used to estimate selectivity of joins.
 ||Column  ||Partitioned count distincts|| Un-Partitioned count 
 distincts 
 |sr_customer_sk   |71,245 |1,415,625|
 |sr_item_sk   |38,846|62,562|
 |sr_ticket_number |71,245 |34,931,085|
 |ss_customer_sk   |88,476|1,415,625|
 |ss_item_sk   |38,846|62,562|
 |ss_ticket_number|100,756 |56,256,175|
   
 The discrepancy is because NDV calculation for a partitioned table assumes 
 that the NDV range is contained within each partition and is calculates as 
 select max(NUM_DISTINCTS) from PART_COL_STATS” .
 This is problematic for columns like ticket number which are naturally 
 increasing with the partitioned date column ss_sold_date_sk.
 Suggestions
 Use Hyper Log Log as suggested by Gopal, there is an HLL implementation for 
 HBASE co-porccessors which we can use as a reference here 
 Using the global stats from TAB_COL_STATS and the per 

[jira] [Commented] (HIVE-9647) Discrepancy in cardinality estimates between partitioned and un-partitioned tables

2015-04-08 Thread Pengcheng Xiong (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-9647?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14486381#comment-14486381
 ] 

Pengcheng Xiong commented on HIVE-9647:
---

The failed test TestHCatClient is unrelated and it passed on my laptop. 
[~ashutoshc], could you please take a look? I think it is ready to go. Thanks.

 Discrepancy in cardinality estimates between partitioned and un-partitioned 
 tables 
 ---

 Key: HIVE-9647
 URL: https://issues.apache.org/jira/browse/HIVE-9647
 Project: Hive
  Issue Type: Bug
  Components: CBO
Affects Versions: 0.14.0
Reporter: Mostafa Mokhtar
Assignee: Pengcheng Xiong
 Fix For: 1.2.0

 Attachments: HIVE-9647.01.patch, HIVE-9647.02.patch, 
 HIVE-9647.03.patch


 High-level summary
 HiveRelMdSelectivity.computeInnerJoinSelectivity relies on per column number 
 of distinct value to estimate join selectivity.
 The way statistics are aggregated for partitioned tables results in 
 discrepancy in number of distinct values which results in different plans 
 between partitioned and un-partitioned schemas.
 The table below summarizes the NDVs in computeInnerJoinSelectivity which are 
 used to estimate selectivity of joins.
 ||Column  ||Partitioned count distincts|| Un-Partitioned count 
 distincts 
 |sr_customer_sk   |71,245 |1,415,625|
 |sr_item_sk   |38,846|62,562|
 |sr_ticket_number |71,245 |34,931,085|
 |ss_customer_sk   |88,476|1,415,625|
 |ss_item_sk   |38,846|62,562|
 |ss_ticket_number|100,756 |56,256,175|
   
 The discrepancy is because NDV calculation for a partitioned table assumes 
 that the NDV range is contained within each partition and is calculates as 
 select max(NUM_DISTINCTS) from PART_COL_STATS” .
 This is problematic for columns like ticket number which are naturally 
 increasing with the partitioned date column ss_sold_date_sk.
 Suggestions
 Use Hyper Log Log as suggested by Gopal, there is an HLL implementation for 
 HBASE co-porccessors which we can use as a reference here 
 Using the global stats from TAB_COL_STATS and the per partition stats from 
 PART_COL_STATS extrapolate the NDV for the qualified partitions as in :
 Max ( (NUM_DISTINCTS from TAB_COL_STATS) x (Number of qualified partitions) / 
 (Number of Partitions), max(NUM_DISTINCTS) from PART_COL_STATS))
 More details
 While doing TPC-DS Partitioned vs. Un-Partitioned runs I noticed that many of 
 the plans are different, then I dumped the CBO logical plan and I found that 
 join estimates are drastically different
 Unpartitioned schema :
 {code}
 2015-02-10 11:33:27,624 DEBUG [main]: parse.SemanticAnalyzer 
 (SemanticAnalyzer.java:apply(12624)) - Plan After Join Reordering:
 HiveProjectRel(store_sales_quantitycount=[$0], store_sales_quantityave=[$1], 
 store_sales_quantitystdev=[$2], store_sales_quantitycov=[/($2, $1)], 
 as_store_returns_quantitycount=[$3], as_store_returns_quantityave=[$4], 
 as_store_returns_quantitystdev=[$5], store_returns_quantitycov=[/($5, $4)]): 
 rowcount = 1.0, cumulative cost = {6.056835407771381E8 rows, 0.0 cpu, 0.0 
 io}, id = 2956
   HiveAggregateRel(group=[{}], agg#0=[count($0)], agg#1=[avg($0)], 
 agg#2=[stddev_samp($0)], agg#3=[count($1)], agg#4=[avg($1)], 
 agg#5=[stddev_samp($1)]): rowcount = 1.0, cumulative cost = 
 {6.056835407771381E8 rows, 0.0 cpu, 0.0 io}, id = 2954
 HiveProjectRel($f0=[$4], $f1=[$8]): rowcount = 40.05611776795562, 
 cumulative cost = {6.056835407771381E8 rows, 0.0 cpu, 0.0 io}, id = 2952
   HiveProjectRel(ss_sold_date_sk=[$0], ss_item_sk=[$1], 
 ss_customer_sk=[$2], ss_ticket_number=[$3], ss_quantity=[$4], 
 sr_item_sk=[$5], sr_customer_sk=[$6], sr_ticket_number=[$7], 
 sr_return_quantity=[$8], d_date_sk=[$9], d_quarter_name=[$10]): rowcount = 
 40.05611776795562, cumulative cost = {6.056835407771381E8 rows, 0.0 cpu, 0.0 
 io}, id = 2982
 HiveJoinRel(condition=[=($9, $0)], joinType=[inner]): rowcount = 
 40.05611776795562, cumulative cost = {6.056835407771381E8 rows, 0.0 cpu, 0.0 
 io}, id = 2980
   HiveJoinRel(condition=[AND(AND(=($2, $6), =($1, $5)), =($3, $7))], 
 joinType=[inner]): rowcount = 28880.460910696, cumulative cost = 
 {6.05654559E8 rows, 0.0 cpu, 0.0 io}, id = 2964
 HiveProjectRel(ss_sold_date_sk=[$0], ss_item_sk=[$2], 
 ss_customer_sk=[$3], ss_ticket_number=[$9], ss_quantity=[$10]): rowcount = 
 5.50076554E8, cumulative cost = {0.0 rows, 0.0 cpu, 0.0 io}, id = 2920
   HiveTableScanRel(table=[[tpcds_bin_orc_200.store_sales]]): 
 rowcount = 5.50076554E8, cumulative cost = {0}, id = 2822
 HiveProjectRel(sr_item_sk=[$2], sr_customer_sk=[$3], 
 sr_ticket_number=[$9], sr_return_quantity=[$10]): rowcount = 5.5578005E7, 
 cumulative cost = {0.0 

[jira] [Commented] (HIVE-9647) Discrepancy in cardinality estimates between partitioned and un-partitioned tables

2015-04-06 Thread Hive QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-9647?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14482507#comment-14482507
 ] 

Hive QA commented on HIVE-9647:
---



{color:red}Overall{color}: -1 at least one tests failed

Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12723440/HIVE-9647.02.patch

{color:red}ERROR:{color} -1 due to 15 failed/errored test(s), 8653 tests 
executed
*Failed tests:*
{noformat}
TestMinimrCliDriver-bucketmapjoin6.q-constprog_partitioner.q-infer_bucket_sort_dyn_part.q-and-1-more
 - did not produce a TEST-*.xml file
TestMinimrCliDriver-external_table_with_space_in_location_path.q-infer_bucket_sort_merge.q-auto_sortmerge_join_16.q-and-1-more
 - did not produce a TEST-*.xml file
TestMinimrCliDriver-groupby2.q-import_exported_table.q-bucketizedhiveinputformat.q-and-1-more
 - did not produce a TEST-*.xml file
TestMinimrCliDriver-index_bitmap3.q-stats_counter_partitioned.q-temp_table_external.q-and-1-more
 - did not produce a TEST-*.xml file
TestMinimrCliDriver-infer_bucket_sort_map_operators.q-join1.q-bucketmapjoin7.q-and-1-more
 - did not produce a TEST-*.xml file
TestMinimrCliDriver-infer_bucket_sort_num_buckets.q-disable_merge_for_bucketing.q-uber_reduce.q-and-1-more
 - did not produce a TEST-*.xml file
TestMinimrCliDriver-infer_bucket_sort_reducers_power_two.q-scriptfile1.q-scriptfile1_win.q-and-1-more
 - did not produce a TEST-*.xml file
TestMinimrCliDriver-leftsemijoin_mr.q-bucket5.q-root_dir_external_table.q-and-1-more
 - did not produce a TEST-*.xml file
TestMinimrCliDriver-list_bucket_dml_10.q-bucket_num_reducers.q-bucket6.q-and-1-more
 - did not produce a TEST-*.xml file
TestMinimrCliDriver-load_fs2.q-file_with_header_footer.q-ql_rewrite_gbtoidx_cbo_1.q-and-1-more
 - did not produce a TEST-*.xml file
TestMinimrCliDriver-ql_rewrite_gbtoidx.q-parallel_orderby.q-reduce_deduplicate.q-and-1-more
 - did not produce a TEST-*.xml file
TestMinimrCliDriver-schemeAuthority2.q-infer_bucket_sort_bucketed_table.q-bucket4.q-and-1-more
 - did not produce a TEST-*.xml file
TestMinimrCliDriver-smb_mapjoin_8.q - did not produce a TEST-*.xml file
org.apache.hadoop.hive.ql.security.authorization.plugin.sqlstd.TestSQLStdHiveAccessControllerHS2.testConfigProcessing
org.apache.hadoop.hive.ql.security.authorization.plugin.sqlstd.TestSQLStdHiveAccessControllerHS2.testConfigProcessingCustomSetWhitelistAppend
{noformat}

Test results: 
http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-TRUNK-Build/3301/testReport
Console output: 
http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-TRUNK-Build/3301/console
Test logs: 
http://ec2-174-129-184-35.compute-1.amazonaws.com/logs/PreCommit-HIVE-TRUNK-Build-3301/

Messages:
{noformat}
Executing org.apache.hive.ptest.execution.PrepPhase
Executing org.apache.hive.ptest.execution.ExecutionPhase
Executing org.apache.hive.ptest.execution.ReportingPhase
Tests exited with: TestsFailedException: 15 tests failed
{noformat}

This message is automatically generated.

ATTACHMENT ID: 12723440 - PreCommit-HIVE-TRUNK-Build

 Discrepancy in cardinality estimates between partitioned and un-partitioned 
 tables 
 ---

 Key: HIVE-9647
 URL: https://issues.apache.org/jira/browse/HIVE-9647
 Project: Hive
  Issue Type: Bug
  Components: CBO
Affects Versions: 0.14.0
Reporter: Mostafa Mokhtar
Assignee: Pengcheng Xiong
 Fix For: 1.2.0

 Attachments: HIVE-9647.01.patch, HIVE-9647.02.patch


 High-level summary
 HiveRelMdSelectivity.computeInnerJoinSelectivity relies on per column number 
 of distinct value to estimate join selectivity.
 The way statistics are aggregated for partitioned tables results in 
 discrepancy in number of distinct values which results in different plans 
 between partitioned and un-partitioned schemas.
 The table below summarizes the NDVs in computeInnerJoinSelectivity which are 
 used to estimate selectivity of joins.
 ||Column  ||Partitioned count distincts|| Un-Partitioned count 
 distincts 
 |sr_customer_sk   |71,245 |1,415,625|
 |sr_item_sk   |38,846|62,562|
 |sr_ticket_number |71,245 |34,931,085|
 |ss_customer_sk   |88,476|1,415,625|
 |ss_item_sk   |38,846|62,562|
 |ss_ticket_number|100,756 |56,256,175|
   
 The discrepancy is because NDV calculation for a partitioned table assumes 
 that the NDV range is contained within each partition and is calculates as 
 select max(NUM_DISTINCTS) from PART_COL_STATS” .
 This is problematic for columns like ticket number which are naturally 
 increasing with the partitioned date column ss_sold_date_sk.
 Suggestions
 Use Hyper Log Log as suggested by Gopal, there is an HLL implementation for 
 HBASE