[jira] [Updated] (HIVE-24053) Pluggable HttpRequestInterceptor for Hive JDBC
[ https://issues.apache.org/jira/browse/HIVE-24053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HIVE-24053: -- Labels: pull-request-available (was: ) > Pluggable HttpRequestInterceptor for Hive JDBC > -- > > Key: HIVE-24053 > URL: https://issues.apache.org/jira/browse/HIVE-24053 > Project: Hive > Issue Type: New Feature > Components: JDBC >Affects Versions: 3.1.2 >Reporter: Ying Wang >Assignee: Ying Wang >Priority: Minor > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > Allows client to pass in the name of a customize HttpRequestInterceptor, > instantiate the class and adds it to HttpClient. > Example usage: We would like to pass in a HttpRequestInterceptor for OAuth2.0 > Authentication purpose. The HttpRequestInterceptor will acquire and/or > refresh the access token and add it as authentication header each time > HiveConnection sends the HttpRequest. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-24053) Pluggable HttpRequestInterceptor for Hive JDBC
[ https://issues.apache.org/jira/browse/HIVE-24053?focusedWorklogId=473133=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-473133 ] ASF GitHub Bot logged work on HIVE-24053: - Author: ASF GitHub Bot Created on: 21/Aug/20 00:12 Start Date: 21/Aug/20 00:12 Worklog Time Spent: 10m Work Description: evelyn97 opened a new pull request #1417: URL: https://github.com/apache/hive/pull/1417 ### What changes were proposed in this pull request? This change adds an optional pluggable HttpRequestInterceptor for HiveConnection. It allows client to pass in the name of a customize HttpRequestInterceptor, instantiate the class and adds it to the HttpClient. ### Why are the changes needed? Example usage: We would like to pass in a HttpRequestInterceptor for OAuth2.0 Authentication purpose. If we pass the Authorization header and access token through http.header in the JDBC connection URL, we can not refresh the token once the connection is established. Instead, we would like to pass the name of the HttpRequestInterceptor, which will acquire and/or refresh the access token and add it as authentication header each time HiveConnection sends the HttpRequest. ### Does this PR introduce _any_ user-facing change? The HiveServer2 JDBC Connection URL will accept an additional session variable "http.interceptor" that allows client to pass in the class name. Example: http.interceptor=com.example.UserInterceptor ### How was this patch tested? Tests were not added because the HiveConnection class does not have a test to begin with. The changes in code does not affect HiveConnection behavior. It was tested by our custom JDBC Driver which uses the hive-jdbc-4.0.0-SNAPSHOT-standalone.jar. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 473133) Remaining Estimate: 0h Time Spent: 10m > Pluggable HttpRequestInterceptor for Hive JDBC > -- > > Key: HIVE-24053 > URL: https://issues.apache.org/jira/browse/HIVE-24053 > Project: Hive > Issue Type: New Feature > Components: JDBC >Affects Versions: 3.1.2 >Reporter: Ying Wang >Assignee: Ying Wang >Priority: Minor > Time Spent: 10m > Remaining Estimate: 0h > > Allows client to pass in the name of a customize HttpRequestInterceptor, > instantiate the class and adds it to HttpClient. > Example usage: We would like to pass in a HttpRequestInterceptor for OAuth2.0 > Authentication purpose. The HttpRequestInterceptor will acquire and/or > refresh the access token and add it as authentication header each time > HiveConnection sends the HttpRequest. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HIVE-24053) Pluggable HttpRequestInterceptor for Hive JDBC
[ https://issues.apache.org/jira/browse/HIVE-24053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ying Wang updated HIVE-24053: - Release Note: The HiveServer2 JDBC Connection URL will accept an additional session variable "http.interceptor" that allows client to pass in the class name. Example: http.interceptor=com.example.UserInterceptor > Pluggable HttpRequestInterceptor for Hive JDBC > -- > > Key: HIVE-24053 > URL: https://issues.apache.org/jira/browse/HIVE-24053 > Project: Hive > Issue Type: New Feature > Components: JDBC >Affects Versions: 3.1.2 >Reporter: Ying Wang >Assignee: Ying Wang >Priority: Minor > > Allows client to pass in the name of a customize HttpRequestInterceptor, > instantiate the class and adds it to HttpClient. > Example usage: We would like to pass in a HttpRequestInterceptor for OAuth2.0 > Authentication purpose. The HttpRequestInterceptor will acquire and/or > refresh the access token and add it as authentication header each time > HiveConnection sends the HttpRequest. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (HIVE-24053) Pluggable HttpRequestInterceptor for Hive JDBC
[ https://issues.apache.org/jira/browse/HIVE-24053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ying Wang reassigned HIVE-24053: > Pluggable HttpRequestInterceptor for Hive JDBC > -- > > Key: HIVE-24053 > URL: https://issues.apache.org/jira/browse/HIVE-24053 > Project: Hive > Issue Type: New Feature > Components: JDBC >Affects Versions: 3.1.2 >Reporter: Ying Wang >Assignee: Ying Wang >Priority: Minor > > Allows client to pass in the name of a customize HttpRequestInterceptor, > instantiate the class and adds it to HttpClient. > Example usage: We would like to pass in a HttpRequestInterceptor for OAuth2.0 > Authentication purpose. The HttpRequestInterceptor will acquire and/or > refresh the access token and add it as authentication header each time > HiveConnection sends the HttpRequest. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HIVE-22915) java.lang.NoSuchMethodError: com.google.common.base.Preconditions.checkArgument
[ https://issues.apache.org/jira/browse/HIVE-22915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17181467#comment-17181467 ] Evgeniy Sh commented on HIVE-22915: --- Its helped me too, thanks [~javiroman] > java.lang.NoSuchMethodError: > com.google.common.base.Preconditions.checkArgument > --- > > Key: HIVE-22915 > URL: https://issues.apache.org/jira/browse/HIVE-22915 > Project: Hive > Issue Type: Bug > Components: Hive >Affects Versions: 2.3.4 > Environment: Ubuntu 16.04 >Reporter: pradeepkumar >Priority: Critical > > Hi Team, > I am Not able to run hive. Getting following error on hive version above 3.X, > i tried all the versions. It is very critical issue.SLF4J: Class path > contains multiple SLF4J bindings. > SLF4J: Found binding in > [jar:file:/home/sreeramadasu/Downloads/hive/lib/log4j-slf4j-impl-2.6.2.jar!/org/slf4j/impl/StaticLoggerBinder.class] > SLF4J: Found binding in > [jar:file:/home/sreeramadasu/Downloads/hadoop/share/hadoop/common/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class] > SLF4J: See [http://www.slf4j.org/codes.html#multiple_bindings] for an > explanation. > SLF4J: Actual binding is of type > [org.apache.logging.slf4j.Log4jLoggerFactory] > Exception in thread "main" java.lang.NoSuchMethodError: > com.google.common.base.Preconditions.checkArgument(ZLjava/lang/String;Ljava/lang/Object;)V > at org.apache.hadoop.conf.Configuration.set(Configuration.java:1357) > at org.apache.hadoop.conf.Configuration.set(Configuration.java:1338) > at org.apache.hadoop.mapred.JobConf.setJar(JobConf.java:536) > at org.apache.hadoop.mapred.JobConf.setJarByClass(JobConf.java:554) > at org.apache.hadoop.mapred.JobConf.(JobConf.java:448) > at org.apache.hadoop.hive.conf.HiveConf.initialize(HiveConf.java:4045) > at org.apache.hadoop.hive.conf.HiveConf.(HiveConf.java:4003) > at > org.apache.hadoop.hive.common.LogUtils.initHiveLog4jCommon(LogUtils.java:81) > at org.apache.hadoop.hive.common.LogUtils.initHiveLog4j(LogUtils.java:65) > at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:702) > at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:686) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at org.apache.hadoop.util.RunJar.run(RunJar.java:323) > at org.apache.hadoop.util.RunJar.main(RunJar.java:236) > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-21052) Make sure transactions get cleaned if they are aborted before addPartitions is called
[ https://issues.apache.org/jira/browse/HIVE-21052?focusedWorklogId=473016=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-473016 ] ASF GitHub Bot logged work on HIVE-21052: - Author: ASF GitHub Bot Created on: 20/Aug/20 18:11 Start Date: 20/Aug/20 18:11 Worklog Time Spent: 10m Work Description: vpnvishv opened a new pull request #1415: URL: https://github.com/apache/hive/pull/1415 ### What changes were proposed in this pull request? Below changes are only with respect to branch-3.1. Design: taken from https://issues.apache.org/jira/secure/attachment/12954375/Aborted%20Txn%20w_Direct%20Write.pdf **Overview:** 1. add a dummy row to TXN_COMPONENTS with operation type 'p' in enqueueLockWithRetry, which will be removed in addDynamicPartition 2. If anytime txn is aborted, this dummy entry will be block initiator to remove this txnId from TXNS 3. Initiator will add a row in COMPACTION_QUEUE (with type 'p') for the above aborted txn with the state as READY_FOR_CLEANING, at a time there will be a single entry of this type for a table in COMPACTION_QUEUE. 4. Cleaner will directly pickup above request, and process it via new cleanAborted code path(scan all partitions and remove aborted dirs), once successful cleaner will remove dummy row from TXN_COMPONENTS **Cleaner Design:** - We are keeping cleaner single thread, and this new type of cleanup will be handled similar to any regular cleanup **Aborted dirs cleanup:** - In p-type cleanup, cleaner will iterate over all the partitions and remove all delta/base dirs with given aborted writeId list - added cleanup of aborted base/delta in the worker also **TXN_COMPONENTS cleanup:** - If successful, p-type entry will be removed from TXN_COMPONENTS during addDynamicPartitions - If aborted, cleaner will clean in markCleaned after successful processing of p-type cleanup **TXNS cleanup:** - No change, will be cleaned up by the initiator ### Why are the changes needed? To fix above mentioned issue. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? unit-tests added This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 473016) Remaining Estimate: 0h Time Spent: 10m > Make sure transactions get cleaned if they are aborted before addPartitions > is called > - > > Key: HIVE-21052 > URL: https://issues.apache.org/jira/browse/HIVE-21052 > Project: Hive > Issue Type: Bug > Components: Transactions >Affects Versions: 3.0.0, 3.1.1 >Reporter: Jaume M >Assignee: Jaume M >Priority: Critical > Attachments: Aborted Txn w_Direct Write.pdf, HIVE-21052.1.patch, > HIVE-21052.10.patch, HIVE-21052.11.patch, HIVE-21052.12.patch, > HIVE-21052.2.patch, HIVE-21052.3.patch, HIVE-21052.4.patch, > HIVE-21052.5.patch, HIVE-21052.6.patch, HIVE-21052.7.patch, > HIVE-21052.8.patch, HIVE-21052.9.patch > > Time Spent: 10m > Remaining Estimate: 0h > > If the transaction is aborted between openTxn and addPartitions and data has > been written on the table the transaction manager will think it's an empty > transaction and no cleaning will be done. > This is currently an issue in the streaming API and in micromanaged tables. > As proposed by [~ekoifman] this can be solved by: > * Writing an entry with a special marker to TXN_COMPONENTS at openTxn and > when addPartitions is called remove this entry from TXN_COMPONENTS and add > the corresponding partition entry to TXN_COMPONENTS. > * If the cleaner finds and entry with a special marker in TXN_COMPONENTS that > specifies that a transaction was opened and it was aborted it must generate > jobs for the worker for every possible partition available. > cc [~ewohlstadter] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HIVE-21052) Make sure transactions get cleaned if they are aborted before addPartitions is called
[ https://issues.apache.org/jira/browse/HIVE-21052?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HIVE-21052: -- Labels: pull-request-available (was: ) > Make sure transactions get cleaned if they are aborted before addPartitions > is called > - > > Key: HIVE-21052 > URL: https://issues.apache.org/jira/browse/HIVE-21052 > Project: Hive > Issue Type: Bug > Components: Transactions >Affects Versions: 3.0.0, 3.1.1 >Reporter: Jaume M >Assignee: Jaume M >Priority: Critical > Labels: pull-request-available > Attachments: Aborted Txn w_Direct Write.pdf, HIVE-21052.1.patch, > HIVE-21052.10.patch, HIVE-21052.11.patch, HIVE-21052.12.patch, > HIVE-21052.2.patch, HIVE-21052.3.patch, HIVE-21052.4.patch, > HIVE-21052.5.patch, HIVE-21052.6.patch, HIVE-21052.7.patch, > HIVE-21052.8.patch, HIVE-21052.9.patch > > Time Spent: 10m > Remaining Estimate: 0h > > If the transaction is aborted between openTxn and addPartitions and data has > been written on the table the transaction manager will think it's an empty > transaction and no cleaning will be done. > This is currently an issue in the streaming API and in micromanaged tables. > As proposed by [~ekoifman] this can be solved by: > * Writing an entry with a special marker to TXN_COMPONENTS at openTxn and > when addPartitions is called remove this entry from TXN_COMPONENTS and add > the corresponding partition entry to TXN_COMPONENTS. > * If the cleaner finds and entry with a special marker in TXN_COMPONENTS that > specifies that a transaction was opened and it was aborted it must generate > jobs for the worker for every possible partition available. > cc [~ewohlstadter] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (HIVE-24046) Table properties external table purge
[ https://issues.apache.org/jira/browse/HIVE-24046?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sthevg1 resolved HIVE-24046. Release Note: Issue was found to be not with Hive but other distribution using hive. Resolution: Invalid > Table properties external table purge > - > > Key: HIVE-24046 > URL: https://issues.apache.org/jira/browse/HIVE-24046 > Project: Hive > Issue Type: Improvement > Components: distribution >Reporter: Sthevg1 >Priority: Major > > External table with table properties TBLPROPERTIES > ("external.table.purge"="true") moves data to .Trash while performing insert > overwrite on that external table. When large data/partitions are overwritten > this fills up .Trash and affects the other functionality. > When we do insert overwrite, it should be obvious to completely purge the > data instead of moving into .Trash with out without TBLPROPERTIES > ("external.table.purge"="true"). > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-23954) count(*) with count(distinct) gives wrong results with hive.optimize.countdistinct=true
[ https://issues.apache.org/jira/browse/HIVE-23954?focusedWorklogId=472837=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-472837 ] ASF GitHub Bot logged work on HIVE-23954: - Author: ASF GitHub Bot Created on: 20/Aug/20 10:55 Start Date: 20/Aug/20 10:55 Worklog Time Spent: 10m Work Description: EugeneChung opened a new pull request #1414: URL: https://github.com/apache/hive/pull/1414 ### What changes were proposed in this pull request? It skips the reducer deduplication for the case that count functions for all and distinct are mixed. ### Why are the changes needed? `select count(*), count(distinct mid) from db1.table1 where partitioned_column = '...'` shows the wrong results especially for `count(*)`. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? I've tested with the same query like `select count(*), count(distinct mid) from db1.table1 where partitioned_column = '...'` over the same data set. My patch shows the correct result with hive.optimize.countdistinct=true as the one of hive.optimize.countdistinct=false. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 472837) Remaining Estimate: 0h Time Spent: 10m > count(*) with count(distinct) gives wrong results with > hive.optimize.countdistinct=true > --- > > Key: HIVE-23954 > URL: https://issues.apache.org/jira/browse/HIVE-23954 > Project: Hive > Issue Type: Bug > Components: Logical Optimizer >Affects Versions: 3.0.0, 3.1.0 >Reporter: Eugene Chung >Assignee: Eugene Chung >Priority: Major > Attachments: HIVE-23954.01.patch, HIVE-23954.01.patch > > Time Spent: 10m > Remaining Estimate: 0h > > {code:java} > select count(*), count(distinct mid) from db1.table1 where partitioned_column > = '...'{code} > > is not working properly when hive.optimize.countdistinct is true. By default, > it's true for all 3.x versions. > In the two plans below, the aggregations part in the Output of Group By > Operator of Map 1 are different. > > - hive.optimize.countdistinct=false > {code:java} > ++ > | Explain | > ++ > | Plan optimized by CBO. | > || > | Vertex dependency in root stage| > | Reducer 2 <- Map 1 (SIMPLE_EDGE) | > || > | Stage-0| > | Fetch Operator | > | limit:-1 | > | Stage-1| > | Reducer 2| > | File Output Operator [FS_7] | > | Group By Operator [GBY_5] (rows=1 width=24) | > | > Output:["_col0","_col1"],aggregations:["count(VALUE._col0)","count(DISTINCT > KEY._col0:0._col0)"] | > | <-Map 1 [SIMPLE_EDGE] | > | SHUFFLE [RS_4] | > | Group By Operator [GBY_3] (rows=343640771 width=4160) | > | > Output:["_col0","_col1","_col2"],aggregations:["count()","count(DISTINCT > mid)"],keys:mid | > | Select Operator [SEL_2] (rows=343640771 width=4160) | > | Output:["mid"] | > | TableScan [TS_0] (rows=343640771 width=4160) | > | db1@table1,table1,Tbl:COMPLETE,Col:NONE,Output:["mid"] | > || > ++{code} > > - hive.optimize.countdistinct=true > {code:java} > ++ > | Explain | > ++ > | Plan optimized by CBO. | > || > | Vertex dependency in root stage| > | Reducer 2 <- Map 1 (SIMPLE_EDGE) | > || > | Stage-0
[jira] [Updated] (HIVE-23954) count(*) with count(distinct) gives wrong results with hive.optimize.countdistinct=true
[ https://issues.apache.org/jira/browse/HIVE-23954?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HIVE-23954: -- Labels: pull-request-available (was: ) > count(*) with count(distinct) gives wrong results with > hive.optimize.countdistinct=true > --- > > Key: HIVE-23954 > URL: https://issues.apache.org/jira/browse/HIVE-23954 > Project: Hive > Issue Type: Bug > Components: Logical Optimizer >Affects Versions: 3.0.0, 3.1.0 >Reporter: Eugene Chung >Assignee: Eugene Chung >Priority: Major > Labels: pull-request-available > Attachments: HIVE-23954.01.patch, HIVE-23954.01.patch > > Time Spent: 10m > Remaining Estimate: 0h > > {code:java} > select count(*), count(distinct mid) from db1.table1 where partitioned_column > = '...'{code} > > is not working properly when hive.optimize.countdistinct is true. By default, > it's true for all 3.x versions. > In the two plans below, the aggregations part in the Output of Group By > Operator of Map 1 are different. > > - hive.optimize.countdistinct=false > {code:java} > ++ > | Explain | > ++ > | Plan optimized by CBO. | > || > | Vertex dependency in root stage| > | Reducer 2 <- Map 1 (SIMPLE_EDGE) | > || > | Stage-0| > | Fetch Operator | > | limit:-1 | > | Stage-1| > | Reducer 2| > | File Output Operator [FS_7] | > | Group By Operator [GBY_5] (rows=1 width=24) | > | > Output:["_col0","_col1"],aggregations:["count(VALUE._col0)","count(DISTINCT > KEY._col0:0._col0)"] | > | <-Map 1 [SIMPLE_EDGE] | > | SHUFFLE [RS_4] | > | Group By Operator [GBY_3] (rows=343640771 width=4160) | > | > Output:["_col0","_col1","_col2"],aggregations:["count()","count(DISTINCT > mid)"],keys:mid | > | Select Operator [SEL_2] (rows=343640771 width=4160) | > | Output:["mid"] | > | TableScan [TS_0] (rows=343640771 width=4160) | > | db1@table1,table1,Tbl:COMPLETE,Col:NONE,Output:["mid"] | > || > ++{code} > > - hive.optimize.countdistinct=true > {code:java} > ++ > | Explain | > ++ > | Plan optimized by CBO. | > || > | Vertex dependency in root stage| > | Reducer 2 <- Map 1 (SIMPLE_EDGE) | > || > | Stage-0| > | Fetch Operator | > | limit:-1 | > | Stage-1| > | Reducer 2| > | File Output Operator [FS_7] | > | Group By Operator [GBY_14] (rows=1 width=16) | > | > Output:["_col0","_col1"],aggregations:["count(_col1)","count(_col0)"] | > | Group By Operator [GBY_11] (rows=343640771 width=4160) | > | > Output:["_col0","_col1"],aggregations:["count(VALUE._col0)"],keys:KEY._col0 | > | <-Map 1 [SIMPLE_EDGE]| > | SHUFFLE [RS_10]| > | PartitionCols:_col0 | > | Group By Operator [GBY_9] (rows=343640771 width=4160) | > | Output:["_col0","_col1"],aggregations:["count()"],keys:mid | > | Select Operator [SEL_2] (rows=343640771 width=4160) | > | Output:["mid"] | > | TableScan [TS_0] (rows=343640771 width=4160) | > | db1@table1,table1,Tbl:COMPLETE,Col:NONE,Output:["mid"] | > || > ++ > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HIVE-23954) count(*) with count(distinct) gives wrong results with hive.optimize.countdistinct=true
[ https://issues.apache.org/jira/browse/HIVE-23954?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eugene Chung updated HIVE-23954: Status: Open (was: Patch Available) > count(*) with count(distinct) gives wrong results with > hive.optimize.countdistinct=true > --- > > Key: HIVE-23954 > URL: https://issues.apache.org/jira/browse/HIVE-23954 > Project: Hive > Issue Type: Bug > Components: Logical Optimizer >Affects Versions: 3.1.0, 3.0.0 >Reporter: Eugene Chung >Assignee: Eugene Chung >Priority: Major > Attachments: HIVE-23954.01.patch, HIVE-23954.01.patch > > > {code:java} > select count(*), count(distinct mid) from db1.table1 where partitioned_column > = '...'{code} > > is not working properly when hive.optimize.countdistinct is true. By default, > it's true for all 3.x versions. > In the two plans below, the aggregations part in the Output of Group By > Operator of Map 1 are different. > > - hive.optimize.countdistinct=false > {code:java} > ++ > | Explain | > ++ > | Plan optimized by CBO. | > || > | Vertex dependency in root stage| > | Reducer 2 <- Map 1 (SIMPLE_EDGE) | > || > | Stage-0| > | Fetch Operator | > | limit:-1 | > | Stage-1| > | Reducer 2| > | File Output Operator [FS_7] | > | Group By Operator [GBY_5] (rows=1 width=24) | > | > Output:["_col0","_col1"],aggregations:["count(VALUE._col0)","count(DISTINCT > KEY._col0:0._col0)"] | > | <-Map 1 [SIMPLE_EDGE] | > | SHUFFLE [RS_4] | > | Group By Operator [GBY_3] (rows=343640771 width=4160) | > | > Output:["_col0","_col1","_col2"],aggregations:["count()","count(DISTINCT > mid)"],keys:mid | > | Select Operator [SEL_2] (rows=343640771 width=4160) | > | Output:["mid"] | > | TableScan [TS_0] (rows=343640771 width=4160) | > | db1@table1,table1,Tbl:COMPLETE,Col:NONE,Output:["mid"] | > || > ++{code} > > - hive.optimize.countdistinct=true > {code:java} > ++ > | Explain | > ++ > | Plan optimized by CBO. | > || > | Vertex dependency in root stage| > | Reducer 2 <- Map 1 (SIMPLE_EDGE) | > || > | Stage-0| > | Fetch Operator | > | limit:-1 | > | Stage-1| > | Reducer 2| > | File Output Operator [FS_7] | > | Group By Operator [GBY_14] (rows=1 width=16) | > | > Output:["_col0","_col1"],aggregations:["count(_col1)","count(_col0)"] | > | Group By Operator [GBY_11] (rows=343640771 width=4160) | > | > Output:["_col0","_col1"],aggregations:["count(VALUE._col0)"],keys:KEY._col0 | > | <-Map 1 [SIMPLE_EDGE]| > | SHUFFLE [RS_10]| > | PartitionCols:_col0 | > | Group By Operator [GBY_9] (rows=343640771 width=4160) | > | Output:["_col0","_col1"],aggregations:["count()"],keys:mid | > | Select Operator [SEL_2] (rows=343640771 width=4160) | > | Output:["mid"] | > | TableScan [TS_0] (rows=343640771 width=4160) | > | db1@table1,table1,Tbl:COMPLETE,Col:NONE,Output:["mid"] | > || > ++ > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HIVE-23954) count(*) with count(distinct) gives wrong results with hive.optimize.countdistinct=true
[ https://issues.apache.org/jira/browse/HIVE-23954?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eugene Chung updated HIVE-23954: Attachment: HIVE-23954.01.patch Status: Patch Available (was: Open) > count(*) with count(distinct) gives wrong results with > hive.optimize.countdistinct=true > --- > > Key: HIVE-23954 > URL: https://issues.apache.org/jira/browse/HIVE-23954 > Project: Hive > Issue Type: Bug > Components: Logical Optimizer >Affects Versions: 3.1.0, 3.0.0 >Reporter: Eugene Chung >Assignee: Eugene Chung >Priority: Major > Attachments: HIVE-23954.01.patch, HIVE-23954.01.patch > > > {code:java} > select count(*), count(distinct mid) from db1.table1 where partitioned_column > = '...'{code} > > is not working properly when hive.optimize.countdistinct is true. By default, > it's true for all 3.x versions. > In the two plans below, the aggregations part in the Output of Group By > Operator of Map 1 are different. > > - hive.optimize.countdistinct=false > {code:java} > ++ > | Explain | > ++ > | Plan optimized by CBO. | > || > | Vertex dependency in root stage| > | Reducer 2 <- Map 1 (SIMPLE_EDGE) | > || > | Stage-0| > | Fetch Operator | > | limit:-1 | > | Stage-1| > | Reducer 2| > | File Output Operator [FS_7] | > | Group By Operator [GBY_5] (rows=1 width=24) | > | > Output:["_col0","_col1"],aggregations:["count(VALUE._col0)","count(DISTINCT > KEY._col0:0._col0)"] | > | <-Map 1 [SIMPLE_EDGE] | > | SHUFFLE [RS_4] | > | Group By Operator [GBY_3] (rows=343640771 width=4160) | > | > Output:["_col0","_col1","_col2"],aggregations:["count()","count(DISTINCT > mid)"],keys:mid | > | Select Operator [SEL_2] (rows=343640771 width=4160) | > | Output:["mid"] | > | TableScan [TS_0] (rows=343640771 width=4160) | > | db1@table1,table1,Tbl:COMPLETE,Col:NONE,Output:["mid"] | > || > ++{code} > > - hive.optimize.countdistinct=true > {code:java} > ++ > | Explain | > ++ > | Plan optimized by CBO. | > || > | Vertex dependency in root stage| > | Reducer 2 <- Map 1 (SIMPLE_EDGE) | > || > | Stage-0| > | Fetch Operator | > | limit:-1 | > | Stage-1| > | Reducer 2| > | File Output Operator [FS_7] | > | Group By Operator [GBY_14] (rows=1 width=16) | > | > Output:["_col0","_col1"],aggregations:["count(_col1)","count(_col0)"] | > | Group By Operator [GBY_11] (rows=343640771 width=4160) | > | > Output:["_col0","_col1"],aggregations:["count(VALUE._col0)"],keys:KEY._col0 | > | <-Map 1 [SIMPLE_EDGE]| > | SHUFFLE [RS_10]| > | PartitionCols:_col0 | > | Group By Operator [GBY_9] (rows=343640771 width=4160) | > | Output:["_col0","_col1"],aggregations:["count()"],keys:mid | > | Select Operator [SEL_2] (rows=343640771 width=4160) | > | Output:["mid"] | > | TableScan [TS_0] (rows=343640771 width=4160) | > | db1@table1,table1,Tbl:COMPLETE,Col:NONE,Output:["mid"] | > || > ++ > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)