[jira] [Commented] (HIVE-19586) Optimize Count(distinct X) pushdown based on the storage capabilities
[ https://issues.apache.org/jira/browse/HIVE-19586?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16499334#comment-16499334 ] Hive QA commented on HIVE-19586: Here are the results of testing the latest attachment: https://issues.apache.org/jira/secure/attachment/12925917/HIVE-19586.6.patch {color:green}SUCCESS:{color} +1 due to 1 test(s) being added or modified. {color:green}SUCCESS:{color} +1 due to 14443 tests passed Test results: https://builds.apache.org/job/PreCommit-HIVE-Build/11465/testReport Console output: https://builds.apache.org/job/PreCommit-HIVE-Build/11465/console Test logs: http://104.198.109.242/logs/PreCommit-HIVE-Build-11465/ Messages: {noformat} Executing org.apache.hive.ptest.execution.TestCheckPhase Executing org.apache.hive.ptest.execution.PrepPhase Executing org.apache.hive.ptest.execution.YetusPhase Executing org.apache.hive.ptest.execution.ExecutionPhase Executing org.apache.hive.ptest.execution.ReportingPhase {noformat} This message is automatically generated. ATTACHMENT ID: 12925917 - PreCommit-HIVE-Build > Optimize Count(distinct X) pushdown based on the storage capabilities > -- > > Key: HIVE-19586 > URL: https://issues.apache.org/jira/browse/HIVE-19586 > Project: Hive > Issue Type: Improvement > Components: Druid integration, Logical Optimizer >Reporter: slim bouguerra >Assignee: slim bouguerra >Priority: Major > Attachments: HIVE-19586.2.patch, HIVE-19586.3.patch, > HIVE-19586.3.patch, HIVE-19586.4.patch, HIVE-19586.5.patch, > HIVE-19586.6.patch, HIVE-19586.patch > > > h1. Goal > Provide a way to rewrite queries with combination of COUNT(Distinct) and > Aggregates like SUM as a series of Group By. > This can be useful to push down to Druid queries like > {code} > select count(DISTINCT interval_marker), count (distinct dim), sum(num_l) > FROM druid_test_table GROUP BY `__time`, `zone` ; > {code} > In general this can be useful to be used in cases where storage handlers can > not perform count (distinct column) > h1. How to do it. > Use the Calcite rule {code} > org.apache.calcite.rel.rules.AggregateExpandDistinctAggregatesRule{code} that > breaks down Count distinct to a single Group by with Grouping sets or > multiple series of Group by that might be linked with Joins if multiple > counts are present. > FYI today Hive does have a similar rule {code} > org.apache.hadoop.hive.ql.optimizer.calcite.rules.HiveExpandDistinctAggregatesRule{code}, > but it only provides a rewrite to Grouping sets based plan. > I am planing to use the actual Calcite rule, [~ashutoshc] any concerns or > caveats to be aware of? > h2. Concerns/questions > Need to have a way to switch between Grouping sets or Simple chained group by > based on the plan cost. For instance for Druid based scan it makes always > sense (at least today) to push down a series of Group by and stitch result > sets in Hive later (as oppose to scan everything). > But this might be not true for other storage handler that can handle Grouping > sets it is better to push down the Grouping sets as one table scan. > Am still unsure how i can lean on the cost optimizer to select the best plan, > [~ashutoshc]/[~jcamachorodriguez] any inputs? -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HIVE-19586) Optimize Count(distinct X) pushdown based on the storage capabilities
[ https://issues.apache.org/jira/browse/HIVE-19586?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16499329#comment-16499329 ] Hive QA commented on HIVE-19586: | (/) *{color:green}+1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | || || || || {color:brown} master Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 7m 2s{color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 58s{color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 34s{color} | {color:green} master passed {color} | | {color:blue}0{color} | {color:blue} findbugs {color} | {color:blue} 3m 32s{color} | {color:blue} ql in master has 2278 extant Findbugs warnings. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 49s{color} | {color:green} master passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 1m 14s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 53s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 53s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 35s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 3m 42s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 51s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 11s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 20m 49s{color} | {color:black} {color} | \\ \\ || Subsystem || Report/Notes || | Optional Tests | asflicense javac javadoc findbugs checkstyle compile | | uname | Linux hiveptest-server-upstream 3.16.0-4-amd64 #1 SMP Debian 3.16.36-1+deb8u1 (2016-09-03) x86_64 GNU/Linux | | Build tool | maven | | Personality | /data/hiveptest/working/yetus_PreCommit-HIVE-Build-11465/dev-support/hive-personality.sh | | git revision | master / 3bccc4e | | Default Java | 1.8.0_111 | | findbugs | v3.0.0 | | modules | C: ql U: ql | | Console output | http://104.198.109.242/logs//PreCommit-HIVE-Build-11465/yetus.txt | | Powered by | Apache Yetushttp://yetus.apache.org | This message was automatically generated. > Optimize Count(distinct X) pushdown based on the storage capabilities > -- > > Key: HIVE-19586 > URL: https://issues.apache.org/jira/browse/HIVE-19586 > Project: Hive > Issue Type: Improvement > Components: Druid integration, Logical Optimizer >Reporter: slim bouguerra >Assignee: slim bouguerra >Priority: Major > Attachments: HIVE-19586.2.patch, HIVE-19586.3.patch, > HIVE-19586.3.patch, HIVE-19586.4.patch, HIVE-19586.5.patch, > HIVE-19586.6.patch, HIVE-19586.patch > > > h1. Goal > Provide a way to rewrite queries with combination of COUNT(Distinct) and > Aggregates like SUM as a series of Group By. > This can be useful to push down to Druid queries like > {code} > select count(DISTINCT interval_marker), count (distinct dim), sum(num_l) > FROM druid_test_table GROUP BY `__time`, `zone` ; > {code} > In general this can be useful to be used in cases where storage handlers can > not perform count (distinct column) > h1. How to do it. > Use the Calcite rule {code} > org.apache.calcite.rel.rules.AggregateExpandDistinctAggregatesRule{code} that > breaks down Count distinct to a single Group by with Grouping sets or > multiple series of Group by that might be linked with Joins if multiple > counts are present. > FYI today Hive does have a similar rule {code} > org.apache.hadoop.hive.ql.optimizer.calcite.rules.HiveExpandDistinctAggregatesRule{code}, > but it only provides a rewrite to Grouping sets based plan. > I am planing to use the actual Calcite rule, [~ashutoshc] any concerns or > caveats to be
[jira] [Commented] (HIVE-19586) Optimize Count(distinct X) pushdown based on the storage capabilities
[ https://issues.apache.org/jira/browse/HIVE-19586?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16496857#comment-16496857 ] Prasanth Jayachandran commented on HIVE-19586: -- Other thought would be to use RetryTestRunner.. if it fails even after that disable it. > Optimize Count(distinct X) pushdown based on the storage capabilities > -- > > Key: HIVE-19586 > URL: https://issues.apache.org/jira/browse/HIVE-19586 > Project: Hive > Issue Type: Improvement > Components: Druid integration, Logical Optimizer >Reporter: slim bouguerra >Assignee: slim bouguerra >Priority: Major > Attachments: HIVE-19586.2.patch, HIVE-19586.3.patch, > HIVE-19586.3.patch, HIVE-19586.4.patch, HIVE-19586.5.patch, > HIVE-19586.6.patch, HIVE-19586.patch > > > h1. Goal > Provide a way to rewrite queries with combination of COUNT(Distinct) and > Aggregates like SUM as a series of Group By. > This can be useful to push down to Druid queries like > {code} > select count(DISTINCT interval_marker), count (distinct dim), sum(num_l) > FROM druid_test_table GROUP BY `__time`, `zone` ; > {code} > In general this can be useful to be used in cases where storage handlers can > not perform count (distinct column) > h1. How to do it. > Use the Calcite rule {code} > org.apache.calcite.rel.rules.AggregateExpandDistinctAggregatesRule{code} that > breaks down Count distinct to a single Group by with Grouping sets or > multiple series of Group by that might be linked with Joins if multiple > counts are present. > FYI today Hive does have a similar rule {code} > org.apache.hadoop.hive.ql.optimizer.calcite.rules.HiveExpandDistinctAggregatesRule{code}, > but it only provides a rewrite to Grouping sets based plan. > I am planing to use the actual Calcite rule, [~ashutoshc] any concerns or > caveats to be aware of? > h2. Concerns/questions > Need to have a way to switch between Grouping sets or Simple chained group by > based on the plan cost. For instance for Druid based scan it makes always > sense (at least today) to push down a series of Group by and stitch result > sets in Hive later (as oppose to scan everything). > But this might be not true for other storage handler that can handle Grouping > sets it is better to push down the Grouping sets as one table scan. > Am still unsure how i can lean on the cost optimizer to select the best plan, > [~ashutoshc]/[~jcamachorodriguez] any inputs? -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HIVE-19586) Optimize Count(distinct X) pushdown based on the storage capabilities
[ https://issues.apache.org/jira/browse/HIVE-19586?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16496728#comment-16496728 ] Jesus Camacho Rodriguez commented on HIVE-19586: There seems to be a history of failures due to flakiness with TestSSL: https://issues.apache.org/jira/issues/?jql=%20(%20summary%20~%20%22TestSSL%22%20or%20description%20~%20%22TestSSL%22%20or%20description%20~%20%22TestSSL.q%22%20)%0Aand%20project%20%3D%20hive%20order%20by%20updated%20desc One of the tests within TestSSL is already disabled as part of HIVE-19509 (testSSLFetchHttp). I think it should be OK to disable TestSSL and explore why the test is failing intermittently in a new issue. [~prasanth_j], other thoughts? > Optimize Count(distinct X) pushdown based on the storage capabilities > -- > > Key: HIVE-19586 > URL: https://issues.apache.org/jira/browse/HIVE-19586 > Project: Hive > Issue Type: Improvement > Components: Druid integration, Logical Optimizer >Reporter: slim bouguerra >Assignee: slim bouguerra >Priority: Major > Attachments: HIVE-19586.2.patch, HIVE-19586.3.patch, > HIVE-19586.3.patch, HIVE-19586.4.patch, HIVE-19586.5.patch, > HIVE-19586.6.patch, HIVE-19586.patch > > > h1. Goal > Provide a way to rewrite queries with combination of COUNT(Distinct) and > Aggregates like SUM as a series of Group By. > This can be useful to push down to Druid queries like > {code} > select count(DISTINCT interval_marker), count (distinct dim), sum(num_l) > FROM druid_test_table GROUP BY `__time`, `zone` ; > {code} > In general this can be useful to be used in cases where storage handlers can > not perform count (distinct column) > h1. How to do it. > Use the Calcite rule {code} > org.apache.calcite.rel.rules.AggregateExpandDistinctAggregatesRule{code} that > breaks down Count distinct to a single Group by with Grouping sets or > multiple series of Group by that might be linked with Joins if multiple > counts are present. > FYI today Hive does have a similar rule {code} > org.apache.hadoop.hive.ql.optimizer.calcite.rules.HiveExpandDistinctAggregatesRule{code}, > but it only provides a rewrite to Grouping sets based plan. > I am planing to use the actual Calcite rule, [~ashutoshc] any concerns or > caveats to be aware of? > h2. Concerns/questions > Need to have a way to switch between Grouping sets or Simple chained group by > based on the plan cost. For instance for Druid based scan it makes always > sense (at least today) to push down a series of Group by and stitch result > sets in Hive later (as oppose to scan everything). > But this might be not true for other storage handler that can handle Grouping > sets it is better to push down the Grouping sets as one table scan. > Am still unsure how i can lean on the cost optimizer to select the best plan, > [~ashutoshc]/[~jcamachorodriguez] any inputs? -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HIVE-19586) Optimize Count(distinct X) pushdown based on the storage capabilities
[ https://issues.apache.org/jira/browse/HIVE-19586?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16496707#comment-16496707 ] Ashutosh Chauhan commented on HIVE-19586: - TestSSL has been flaky in the past. [~jcamachorodriguez] Shall we disable this test? > Optimize Count(distinct X) pushdown based on the storage capabilities > -- > > Key: HIVE-19586 > URL: https://issues.apache.org/jira/browse/HIVE-19586 > Project: Hive > Issue Type: Improvement > Components: Druid integration, Logical Optimizer >Reporter: slim bouguerra >Assignee: slim bouguerra >Priority: Major > Attachments: HIVE-19586.2.patch, HIVE-19586.3.patch, > HIVE-19586.3.patch, HIVE-19586.4.patch, HIVE-19586.5.patch, > HIVE-19586.6.patch, HIVE-19586.patch > > > h1. Goal > Provide a way to rewrite queries with combination of COUNT(Distinct) and > Aggregates like SUM as a series of Group By. > This can be useful to push down to Druid queries like > {code} > select count(DISTINCT interval_marker), count (distinct dim), sum(num_l) > FROM druid_test_table GROUP BY `__time`, `zone` ; > {code} > In general this can be useful to be used in cases where storage handlers can > not perform count (distinct column) > h1. How to do it. > Use the Calcite rule {code} > org.apache.calcite.rel.rules.AggregateExpandDistinctAggregatesRule{code} that > breaks down Count distinct to a single Group by with Grouping sets or > multiple series of Group by that might be linked with Joins if multiple > counts are present. > FYI today Hive does have a similar rule {code} > org.apache.hadoop.hive.ql.optimizer.calcite.rules.HiveExpandDistinctAggregatesRule{code}, > but it only provides a rewrite to Grouping sets based plan. > I am planing to use the actual Calcite rule, [~ashutoshc] any concerns or > caveats to be aware of? > h2. Concerns/questions > Need to have a way to switch between Grouping sets or Simple chained group by > based on the plan cost. For instance for Druid based scan it makes always > sense (at least today) to push down a series of Group by and stitch result > sets in Hive later (as oppose to scan everything). > But this might be not true for other storage handler that can handle Grouping > sets it is better to push down the Grouping sets as one table scan. > Am still unsure how i can lean on the cost optimizer to select the best plan, > [~ashutoshc]/[~jcamachorodriguez] any inputs? -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HIVE-19586) Optimize Count(distinct X) pushdown based on the storage capabilities
[ https://issues.apache.org/jira/browse/HIVE-19586?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16496405#comment-16496405 ] Hive QA commented on HIVE-19586: Here are the results of testing the latest attachment: https://issues.apache.org/jira/secure/attachment/12925344/HIVE-19586.5.patch {color:green}SUCCESS:{color} +1 due to 1 test(s) being added or modified. {color:red}ERROR:{color} -1 due to 1 failed/errored test(s), 14419 tests executed *Failed tests:* {noformat} org.apache.hive.jdbc.TestSSL.testSSLConnectionWithURL (batchId=241) {noformat} Test results: https://builds.apache.org/job/PreCommit-HIVE-Build/11380/testReport Console output: https://builds.apache.org/job/PreCommit-HIVE-Build/11380/console Test logs: http://104.198.109.242/logs/PreCommit-HIVE-Build-11380/ Messages: {noformat} Executing org.apache.hive.ptest.execution.TestCheckPhase Executing org.apache.hive.ptest.execution.PrepPhase Executing org.apache.hive.ptest.execution.YetusPhase Executing org.apache.hive.ptest.execution.ExecutionPhase Executing org.apache.hive.ptest.execution.ReportingPhase Tests exited with: TestsFailedException: 1 tests failed {noformat} This message is automatically generated. ATTACHMENT ID: 12925344 - PreCommit-HIVE-Build > Optimize Count(distinct X) pushdown based on the storage capabilities > -- > > Key: HIVE-19586 > URL: https://issues.apache.org/jira/browse/HIVE-19586 > Project: Hive > Issue Type: Improvement > Components: Druid integration, Logical Optimizer >Reporter: slim bouguerra >Assignee: slim bouguerra >Priority: Major > Attachments: HIVE-19586.2.patch, HIVE-19586.3.patch, > HIVE-19586.3.patch, HIVE-19586.4.patch, HIVE-19586.5.patch, HIVE-19586.patch > > > h1. Goal > Provide a way to rewrite queries with combination of COUNT(Distinct) and > Aggregates like SUM as a series of Group By. > This can be useful to push down to Druid queries like > {code} > select count(DISTINCT interval_marker), count (distinct dim), sum(num_l) > FROM druid_test_table GROUP BY `__time`, `zone` ; > {code} > In general this can be useful to be used in cases where storage handlers can > not perform count (distinct column) > h1. How to do it. > Use the Calcite rule {code} > org.apache.calcite.rel.rules.AggregateExpandDistinctAggregatesRule{code} that > breaks down Count distinct to a single Group by with Grouping sets or > multiple series of Group by that might be linked with Joins if multiple > counts are present. > FYI today Hive does have a similar rule {code} > org.apache.hadoop.hive.ql.optimizer.calcite.rules.HiveExpandDistinctAggregatesRule{code}, > but it only provides a rewrite to Grouping sets based plan. > I am planing to use the actual Calcite rule, [~ashutoshc] any concerns or > caveats to be aware of? > h2. Concerns/questions > Need to have a way to switch between Grouping sets or Simple chained group by > based on the plan cost. For instance for Druid based scan it makes always > sense (at least today) to push down a series of Group by and stitch result > sets in Hive later (as oppose to scan everything). > But this might be not true for other storage handler that can handle Grouping > sets it is better to push down the Grouping sets as one table scan. > Am still unsure how i can lean on the cost optimizer to select the best plan, > [~ashutoshc]/[~jcamachorodriguez] any inputs? -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HIVE-19586) Optimize Count(distinct X) pushdown based on the storage capabilities
[ https://issues.apache.org/jira/browse/HIVE-19586?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16496389#comment-16496389 ] Hive QA commented on HIVE-19586: | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | || || || || {color:brown} master Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 6m 59s{color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 57s{color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 36s{color} | {color:green} master passed {color} | | {color:blue}0{color} | {color:blue} findbugs {color} | {color:blue} 3m 33s{color} | {color:blue} ql in master has 2333 extant Findbugs warnings. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 49s{color} | {color:green} master passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 1m 12s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 56s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 56s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 35s{color} | {color:green} the patch passed {color} | | {color:red}-1{color} | {color:red} whitespace {color} | {color:red} 0m 0s{color} | {color:red} The patch has 6 line(s) that end in whitespace. Use git apply --whitespace=fix <>. Refer https://git-scm.com/docs/git-apply {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 3m 42s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 51s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 12s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 20m 50s{color} | {color:black} {color} | \\ \\ || Subsystem || Report/Notes || | Optional Tests | asflicense javac javadoc findbugs checkstyle compile | | uname | Linux hiveptest-server-upstream 3.16.0-4-amd64 #1 SMP Debian 3.16.36-1+deb8u1 (2016-09-03) x86_64 GNU/Linux | | Build tool | maven | | Personality | /data/hiveptest/working/yetus_PreCommit-HIVE-Build-11380/dev-support/hive-personality.sh | | git revision | master / cab1e60 | | Default Java | 1.8.0_111 | | findbugs | v3.0.0 | | whitespace | http://104.198.109.242/logs//PreCommit-HIVE-Build-11380/yetus/whitespace-eol.txt | | modules | C: ql U: ql | | Console output | http://104.198.109.242/logs//PreCommit-HIVE-Build-11380/yetus.txt | | Powered by | Apache Yetushttp://yetus.apache.org | This message was automatically generated. > Optimize Count(distinct X) pushdown based on the storage capabilities > -- > > Key: HIVE-19586 > URL: https://issues.apache.org/jira/browse/HIVE-19586 > Project: Hive > Issue Type: Improvement > Components: Druid integration, Logical Optimizer >Reporter: slim bouguerra >Assignee: slim bouguerra >Priority: Major > Attachments: HIVE-19586.2.patch, HIVE-19586.3.patch, > HIVE-19586.3.patch, HIVE-19586.4.patch, HIVE-19586.5.patch, HIVE-19586.patch > > > h1. Goal > Provide a way to rewrite queries with combination of COUNT(Distinct) and > Aggregates like SUM as a series of Group By. > This can be useful to push down to Druid queries like > {code} > select count(DISTINCT interval_marker), count (distinct dim), sum(num_l) > FROM druid_test_table GROUP BY `__time`, `zone` ; > {code} > In general this can be useful to be used in cases where storage handlers can > not perform count (distinct column) > h1. How to do it. > Use the Calcite rule {code} > org.apache.calcite.rel.rules.AggregateExpandDistinctAggregatesRule{code} that > breaks down Count distinct to a single Group by with Grouping sets or > multiple series of Group by that might be linked with Joins if multiple > counts are present. > FYI today Hive does have a similar rule {code} > org.apache.hadoop.hive.ql.optimizer.calcite.rules.HiveExpandDistinctAggregatesRule{code}
[jira] [Commented] (HIVE-19586) Optimize Count(distinct X) pushdown based on the storage capabilities
[ https://issues.apache.org/jira/browse/HIVE-19586?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16492259#comment-16492259 ] Hive QA commented on HIVE-19586: Here are the results of testing the latest attachment: https://issues.apache.org/jira/secure/attachment/12925144/HIVE-19586.4.patch {color:red}ERROR:{color} -1 due to build exiting with an error Test results: https://builds.apache.org/job/PreCommit-HIVE-Build/11278/testReport Console output: https://builds.apache.org/job/PreCommit-HIVE-Build/11278/console Test logs: http://104.198.109.242/logs/PreCommit-HIVE-Build-11278/ Messages: {noformat} Executing org.apache.hive.ptest.execution.TestCheckPhase Executing org.apache.hive.ptest.execution.PrepPhase Tests exited with: NonZeroExitCodeException Command 'bash /data/hiveptest/working/scratch/source-prep.sh' failed with exit status 1 and output '+ date '+%Y-%m-%d %T.%3N' 2018-05-28 04:26:57.868 + [[ -n /usr/lib/jvm/java-8-openjdk-amd64 ]] + export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64 + JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64 + export PATH=/usr/lib/jvm/java-8-openjdk-amd64/bin/:/usr/local/bin:/usr/bin:/bin:/usr/local/games:/usr/games + PATH=/usr/lib/jvm/java-8-openjdk-amd64/bin/:/usr/local/bin:/usr/bin:/bin:/usr/local/games:/usr/games + export 'ANT_OPTS=-Xmx1g -XX:MaxPermSize=256m ' + ANT_OPTS='-Xmx1g -XX:MaxPermSize=256m ' + export 'MAVEN_OPTS=-Xmx1g ' + MAVEN_OPTS='-Xmx1g ' + cd /data/hiveptest/working/ + tee /data/hiveptest/logs/PreCommit-HIVE-Build-11278/source-prep.txt + [[ false == \t\r\u\e ]] + mkdir -p maven ivy + [[ git = \s\v\n ]] + [[ git = \g\i\t ]] + [[ -z master ]] + [[ -d apache-github-source-source ]] + [[ ! -d apache-github-source-source/.git ]] + [[ ! -d apache-github-source-source ]] + date '+%Y-%m-%d %T.%3N' 2018-05-28 04:26:57.871 + cd apache-github-source-source + git fetch origin + git reset --hard HEAD HEAD is now at 43945fb Vectorization: Turning on vectorization in escape_crlf produces wrong results (Haifeng Chen, reviewed by Matt McCline) + git clean -f -d + git checkout master Already on 'master' Your branch is up-to-date with 'origin/master'. + git reset --hard origin/master HEAD is now at 43945fb Vectorization: Turning on vectorization in escape_crlf produces wrong results (Haifeng Chen, reviewed by Matt McCline) + git merge --ff-only origin/master Already up-to-date. + date '+%Y-%m-%d %T.%3N' 2018-05-28 04:26:59.066 + rm -rf ../yetus_PreCommit-HIVE-Build-11278 + mkdir ../yetus_PreCommit-HIVE-Build-11278 + git gc + cp -R . ../yetus_PreCommit-HIVE-Build-11278 + mkdir /data/hiveptest/logs/PreCommit-HIVE-Build-11278/yetus + patchCommandPath=/data/hiveptest/working/scratch/smart-apply-patch.sh + patchFilePath=/data/hiveptest/working/scratch/build.patch + [[ -f /data/hiveptest/working/scratch/build.patch ]] + chmod +x /data/hiveptest/working/scratch/smart-apply-patch.sh + /data/hiveptest/working/scratch/smart-apply-patch.sh /data/hiveptest/working/scratch/build.patch error: patch failed: ql/src/test/queries/clientpositive/druidmini_expressions.q:51 Falling back to three-way merge... Applied patch to 'ql/src/test/queries/clientpositive/druidmini_expressions.q' with conflicts. error: patch failed: ql/src/test/results/clientpositive/druid/druidmini_expressions.q.out:257 Falling back to three-way merge... Applied patch to 'ql/src/test/results/clientpositive/druid/druidmini_expressions.q.out' with conflicts. Going to apply patch with: git apply -p0 /data/hiveptest/working/scratch/build.patch:531: trailing whitespace. Map 1 /data/hiveptest/working/scratch/build.patch:557: trailing whitespace. Reducer 2 /data/hiveptest/working/scratch/build.patch:599: trailing whitespace. Map 1 /data/hiveptest/working/scratch/build.patch:625: trailing whitespace. Reducer 2 /data/hiveptest/working/scratch/build.patch:667: trailing whitespace. Map 1 error: patch failed: ql/src/test/queries/clientpositive/druidmini_expressions.q:51 Falling back to three-way merge... Applied patch to 'ql/src/test/queries/clientpositive/druidmini_expressions.q' with conflicts. error: patch failed: ql/src/test/results/clientpositive/druid/druidmini_expressions.q.out:257 Falling back to three-way merge... Applied patch to 'ql/src/test/results/clientpositive/druid/druidmini_expressions.q.out' with conflicts. U ql/src/test/queries/clientpositive/druidmini_expressions.q U ql/src/test/results/clientpositive/druid/druidmini_expressions.q.out warning: squelched 12 whitespace errors warning: 17 lines add whitespace errors. + exit 1 ' {noformat} This message is automatically generated. ATTACHMENT ID: 12925144 - PreCommit-HIVE-Build > Optimize Count(distinct X) pushdown based on the storage capabilities > -- > > Key: HIVE-19586 > URL: https://issues.apache.org/jira/browse/HIVE-1
[jira] [Commented] (HIVE-19586) Optimize Count(distinct X) pushdown based on the storage capabilities
[ https://issues.apache.org/jira/browse/HIVE-19586?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16490295#comment-16490295 ] Hive QA commented on HIVE-19586: Here are the results of testing the latest attachment: https://issues.apache.org/jira/secure/attachment/12924745/HIVE-19586.3.patch {color:red}ERROR:{color} -1 due to build exiting with an error Test results: https://builds.apache.org/job/PreCommit-HIVE-Build/11198/testReport Console output: https://builds.apache.org/job/PreCommit-HIVE-Build/11198/console Test logs: http://104.198.109.242/logs/PreCommit-HIVE-Build-11198/ Messages: {noformat} Executing org.apache.hive.ptest.execution.TestCheckPhase Executing org.apache.hive.ptest.execution.PrepPhase Tests exited with: NonZeroExitCodeException Command 'bash /data/hiveptest/working/scratch/source-prep.sh' failed with exit status 1 and output '+ date '+%Y-%m-%d %T.%3N' 2018-05-25 06:28:46.442 + [[ -n /usr/lib/jvm/java-8-openjdk-amd64 ]] + export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64 + JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64 + export PATH=/usr/lib/jvm/java-8-openjdk-amd64/bin/:/usr/local/bin:/usr/bin:/bin:/usr/local/games:/usr/games + PATH=/usr/lib/jvm/java-8-openjdk-amd64/bin/:/usr/local/bin:/usr/bin:/bin:/usr/local/games:/usr/games + export 'ANT_OPTS=-Xmx1g -XX:MaxPermSize=256m ' + ANT_OPTS='-Xmx1g -XX:MaxPermSize=256m ' + export 'MAVEN_OPTS=-Xmx1g ' + MAVEN_OPTS='-Xmx1g ' + cd /data/hiveptest/working/ + tee /data/hiveptest/logs/PreCommit-HIVE-Build-11198/source-prep.txt + [[ false == \t\r\u\e ]] + mkdir -p maven ivy + [[ git = \s\v\n ]] + [[ git = \g\i\t ]] + [[ -z master ]] + [[ -d apache-github-source-source ]] + [[ ! -d apache-github-source-source/.git ]] + [[ ! -d apache-github-source-source ]] + date '+%Y-%m-%d %T.%3N' 2018-05-25 06:28:46.444 + cd apache-github-source-source + git fetch origin >From https://github.com/apache/hive a6832e6..c358ef5 master -> origin/master fe3b15e..a93d1bd branch-3 -> origin/branch-3 + git reset --hard HEAD HEAD is now at a6832e6 HIVE-19557: stats: filters for dates are not taking advantage of min/max values (Zoltan Haindrich reviewed by Ashutosh Chauhan) + git clean -f -d + git checkout master Already on 'master' Your branch is behind 'origin/master' by 1 commit, and can be fast-forwarded. (use "git pull" to update your local branch) + git reset --hard origin/master HEAD is now at c358ef5 HIVE-19632: Remove webapps directory from standalone jar (Prasanth Jayachandran reviewed by Thejas Nair) + git merge --ff-only origin/master Already up-to-date. + date '+%Y-%m-%d %T.%3N' 2018-05-25 06:28:47.278 + rm -rf ../yetus_PreCommit-HIVE-Build-11198 + mkdir ../yetus_PreCommit-HIVE-Build-11198 + git gc + cp -R . ../yetus_PreCommit-HIVE-Build-11198 + mkdir /data/hiveptest/logs/PreCommit-HIVE-Build-11198/yetus + patchCommandPath=/data/hiveptest/working/scratch/smart-apply-patch.sh + patchFilePath=/data/hiveptest/working/scratch/build.patch + [[ -f /data/hiveptest/working/scratch/build.patch ]] + chmod +x /data/hiveptest/working/scratch/smart-apply-patch.sh + /data/hiveptest/working/scratch/smart-apply-patch.sh /data/hiveptest/working/scratch/build.patch error: patch failed: ql/src/test/queries/clientpositive/druidmini_expressions.q:1 Falling back to three-way merge... Applied patch to 'ql/src/test/queries/clientpositive/druidmini_expressions.q' with conflicts. error: patch failed: ql/src/test/results/clientpositive/druid/druidmini_expressions.q.out:257 Falling back to three-way merge... Applied patch to 'ql/src/test/results/clientpositive/druid/druidmini_expressions.q.out' with conflicts. Going to apply patch with: git apply -p0 /data/hiveptest/working/scratch/build.patch:536: trailing whitespace. Map 1 /data/hiveptest/working/scratch/build.patch:562: trailing whitespace. Reducer 2 /data/hiveptest/working/scratch/build.patch:604: trailing whitespace. Map 1 /data/hiveptest/working/scratch/build.patch:630: trailing whitespace. Reducer 2 /data/hiveptest/working/scratch/build.patch:672: trailing whitespace. Map 1 error: patch failed: ql/src/test/queries/clientpositive/druidmini_expressions.q:1 Falling back to three-way merge... Applied patch to 'ql/src/test/queries/clientpositive/druidmini_expressions.q' with conflicts. error: patch failed: ql/src/test/results/clientpositive/druid/druidmini_expressions.q.out:257 Falling back to three-way merge... Applied patch to 'ql/src/test/results/clientpositive/druid/druidmini_expressions.q.out' with conflicts. U ql/src/test/queries/clientpositive/druidmini_expressions.q U ql/src/test/results/clientpositive/druid/druidmini_expressions.q.out warning: squelched 12 whitespace errors warning: 17 lines add whitespace errors. + exit 1 ' {noformat} This message is automatically generated. ATTACHMENT ID: 12924745 - PreCommit-HIVE-Build > Optimize Count(distinct X) pushdown base
[jira] [Commented] (HIVE-19586) Optimize Count(distinct X) pushdown based on the storage capabilities
[ https://issues.apache.org/jira/browse/HIVE-19586?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16487142#comment-16487142 ] slim bouguerra commented on HIVE-19586: --- done > Optimize Count(distinct X) pushdown based on the storage capabilities > -- > > Key: HIVE-19586 > URL: https://issues.apache.org/jira/browse/HIVE-19586 > Project: Hive > Issue Type: Improvement > Components: Druid integration, Logical Optimizer >Reporter: slim bouguerra >Assignee: slim bouguerra >Priority: Major > Attachments: HIVE-19586.2.patch, HIVE-19586.3.patch, > HIVE-19586.3.patch, HIVE-19586.patch > > > h1. Goal > Provide a way to rewrite queries with combination of COUNT(Distinct) and > Aggregates like SUM as a series of Group By. > This can be useful to push down to Druid queries like > {code} > select count(DISTINCT interval_marker), count (distinct dim), sum(num_l) > FROM druid_test_table GROUP BY `__time`, `zone` ; > {code} > In general this can be useful to be used in cases where storage handlers can > not perform count (distinct column) > h1. How to do it. > Use the Calcite rule {code} > org.apache.calcite.rel.rules.AggregateExpandDistinctAggregatesRule{code} that > breaks down Count distinct to a single Group by with Grouping sets or > multiple series of Group by that might be linked with Joins if multiple > counts are present. > FYI today Hive does have a similar rule {code} > org.apache.hadoop.hive.ql.optimizer.calcite.rules.HiveExpandDistinctAggregatesRule{code}, > but it only provides a rewrite to Grouping sets based plan. > I am planing to use the actual Calcite rule, [~ashutoshc] any concerns or > caveats to be aware of? > h2. Concerns/questions > Need to have a way to switch between Grouping sets or Simple chained group by > based on the plan cost. For instance for Druid based scan it makes always > sense (at least today) to push down a series of Group by and stitch result > sets in Hive later (as oppose to scan everything). > But this might be not true for other storage handler that can handle Grouping > sets it is better to push down the Grouping sets as one table scan. > Am still unsure how i can lean on the cost optimizer to select the best plan, > [~ashutoshc]/[~jcamachorodriguez] any inputs? -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HIVE-19586) Optimize Count(distinct X) pushdown based on the storage capabilities
[ https://issues.apache.org/jira/browse/HIVE-19586?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16486521#comment-16486521 ] Ashutosh Chauhan commented on HIVE-19586: - [~bslim] Can you please reupload the patch for Hive QA run. > Optimize Count(distinct X) pushdown based on the storage capabilities > -- > > Key: HIVE-19586 > URL: https://issues.apache.org/jira/browse/HIVE-19586 > Project: Hive > Issue Type: Improvement > Components: Druid integration, Logical Optimizer >Reporter: slim bouguerra >Assignee: slim bouguerra >Priority: Major > Attachments: HIVE-19586.2.patch, HIVE-19586.3.patch, HIVE-19586.patch > > > h1. Goal > Provide a way to rewrite queries with combination of COUNT(Distinct) and > Aggregates like SUM as a series of Group By. > This can be useful to push down to Druid queries like > {code} > select count(DISTINCT interval_marker), count (distinct dim), sum(num_l) > FROM druid_test_table GROUP BY `__time`, `zone` ; > {code} > In general this can be useful to be used in cases where storage handlers can > not perform count (distinct column) > h1. How to do it. > Use the Calcite rule {code} > org.apache.calcite.rel.rules.AggregateExpandDistinctAggregatesRule{code} that > breaks down Count distinct to a single Group by with Grouping sets or > multiple series of Group by that might be linked with Joins if multiple > counts are present. > FYI today Hive does have a similar rule {code} > org.apache.hadoop.hive.ql.optimizer.calcite.rules.HiveExpandDistinctAggregatesRule{code}, > but it only provides a rewrite to Grouping sets based plan. > I am planing to use the actual Calcite rule, [~ashutoshc] any concerns or > caveats to be aware of? > h2. Concerns/questions > Need to have a way to switch between Grouping sets or Simple chained group by > based on the plan cost. For instance for Druid based scan it makes always > sense (at least today) to push down a series of Group by and stitch result > sets in Hive later (as oppose to scan everything). > But this might be not true for other storage handler that can handle Grouping > sets it is better to push down the Grouping sets as one table scan. > Am still unsure how i can lean on the cost optimizer to select the best plan, > [~ashutoshc]/[~jcamachorodriguez] any inputs? -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HIVE-19586) Optimize Count(distinct X) pushdown based on the storage capabilities
[ https://issues.apache.org/jira/browse/HIVE-19586?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16481232#comment-16481232 ] Ashutosh Chauhan commented on HIVE-19586: - +1 > Optimize Count(distinct X) pushdown based on the storage capabilities > -- > > Key: HIVE-19586 > URL: https://issues.apache.org/jira/browse/HIVE-19586 > Project: Hive > Issue Type: Improvement > Components: Druid integration, Logical Optimizer >Reporter: slim bouguerra >Assignee: slim bouguerra >Priority: Major > Attachments: HIVE-19586.2.patch, HIVE-19586.3.patch, HIVE-19586.patch > > > h1. Goal > Provide a way to rewrite queries with combination of COUNT(Distinct) and > Aggregates like SUM as a series of Group By. > This can be useful to push down to Druid queries like > {code} > select count(DISTINCT interval_marker), count (distinct dim), sum(num_l) > FROM druid_test_table GROUP BY `__time`, `zone` ; > {code} > In general this can be useful to be used in cases where storage handlers can > not perform count (distinct column) > h1. How to do it. > Use the Calcite rule {code} > org.apache.calcite.rel.rules.AggregateExpandDistinctAggregatesRule{code} that > breaks down Count distinct to a single Group by with Grouping sets or > multiple series of Group by that might be linked with Joins if multiple > counts are present. > FYI today Hive does have a similar rule {code} > org.apache.hadoop.hive.ql.optimizer.calcite.rules.HiveExpandDistinctAggregatesRule{code}, > but it only provides a rewrite to Grouping sets based plan. > I am planing to use the actual Calcite rule, [~ashutoshc] any concerns or > caveats to be aware of? > h2. Concerns/questions > Need to have a way to switch between Grouping sets or Simple chained group by > based on the plan cost. For instance for Druid based scan it makes always > sense (at least today) to push down a series of Group by and stitch result > sets in Hive later (as oppose to scan everything). > But this might be not true for other storage handler that can handle Grouping > sets it is better to push down the Grouping sets as one table scan. > Am still unsure how i can lean on the cost optimizer to select the best plan, > [~ashutoshc]/[~jcamachorodriguez] any inputs? -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HIVE-19586) Optimize Count(distinct X) pushdown based on the storage capabilities
[ https://issues.apache.org/jira/browse/HIVE-19586?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16479891#comment-16479891 ] slim bouguerra commented on HIVE-19586: --- After Talking with [~ashutoshc] Offline we decided to tackle the simple case first. This patch will allow us to expand an aggregate with single count distinct and valid Druid aggregates as a series of Group by. Thus this will lead to a plan where the first GB is executed by Druid while the rest will be done in HIVE. The next step will be to fix the couple of related issue and enable push down of Grouping set down to Druid once it is supported. > Optimize Count(distinct X) pushdown based on the storage capabilities > -- > > Key: HIVE-19586 > URL: https://issues.apache.org/jira/browse/HIVE-19586 > Project: Hive > Issue Type: Improvement > Components: Druid integration, Logical Optimizer >Reporter: slim bouguerra >Assignee: slim bouguerra >Priority: Major > Fix For: 3.0.0 > > Attachments: HIVE-19586.patch > > > h1. Goal > Provide a way to rewrite queries with combination of COUNT(Distinct) and > Aggregates like SUM as a series of Group By. > This can be useful to push down to Druid queries like > {code} > select count(DISTINCT interval_marker), count (distinct dim), sum(num_l) > FROM druid_test_table GROUP BY `__time`, `zone` ; > {code} > In general this can be useful to be used in cases where storage handlers can > not perform count (distinct column) > h1. How to do it. > Use the Calcite rule {code} > org.apache.calcite.rel.rules.AggregateExpandDistinctAggregatesRule{code} that > breaks down Count distinct to a single Group by with Grouping sets or > multiple series of Group by that might be linked with Joins if multiple > counts are present. > FYI today Hive does have a similar rule {code} > org.apache.hadoop.hive.ql.optimizer.calcite.rules.HiveExpandDistinctAggregatesRule{code}, > but it only provides a rewrite to Grouping sets based plan. > I am planing to use the actual Calcite rule, [~ashutoshc] any concerns or > caveats to be aware of? > h2. Concerns/questions > Need to have a way to switch between Grouping sets or Simple chained group by > based on the plan cost. For instance for Druid based scan it makes always > sense (at least today) to push down a series of Group by and stitch result > sets in Hive later (as oppose to scan everything). > But this might be not true for other storage handler that can handle Grouping > sets it is better to push down the Grouping sets as one table scan. > Am still unsure how i can lean on the cost optimizer to select the best plan, > [~ashutoshc]/[~jcamachorodriguez] any inputs? -- This message was sent by Atlassian JIRA (v7.6.3#76005)