[jira] [Updated] (HIVE-24053) Pluggable HttpRequestInterceptor for Hive JDBC

2020-08-20 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HIVE-24053:
--
Labels: pull-request-available  (was: )

> Pluggable HttpRequestInterceptor for Hive JDBC
> --
>
> Key: HIVE-24053
> URL: https://issues.apache.org/jira/browse/HIVE-24053
> Project: Hive
>  Issue Type: New Feature
>  Components: JDBC
>Affects Versions: 3.1.2
>Reporter: Ying Wang
>Assignee: Ying Wang
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Allows client to pass in the name of a customize HttpRequestInterceptor, 
> instantiate the class and adds it to HttpClient.
> Example usage: We would like to pass in a HttpRequestInterceptor for OAuth2.0 
> Authentication purpose. The HttpRequestInterceptor will acquire and/or 
> refresh the access token and add it as authentication header each time 
> HiveConnection sends the HttpRequest.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (HIVE-24053) Pluggable HttpRequestInterceptor for Hive JDBC

2020-08-20 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24053?focusedWorklogId=473133=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-473133
 ]

ASF GitHub Bot logged work on HIVE-24053:
-

Author: ASF GitHub Bot
Created on: 21/Aug/20 00:12
Start Date: 21/Aug/20 00:12
Worklog Time Spent: 10m 
  Work Description: evelyn97 opened a new pull request #1417:
URL: https://github.com/apache/hive/pull/1417


   
   
   ### What changes were proposed in this pull request?
   
   This change adds an optional pluggable HttpRequestInterceptor for 
HiveConnection. It allows client to pass in the name of a customize 
HttpRequestInterceptor, instantiate the class and adds it to the HttpClient.
   
   ### Why are the changes needed?
   
   Example usage: 
   We would like to pass in a HttpRequestInterceptor for OAuth2.0 
Authentication purpose. If we pass the Authorization header and access token 
through http.header in the JDBC connection URL, we can not refresh the token 
once the connection is established. Instead, we would like to pass the name of 
the HttpRequestInterceptor, which will acquire and/or refresh the access token 
and add it as authentication header each time HiveConnection sends the 
HttpRequest.
   
   ### Does this PR introduce _any_ user-facing change?
   
   The HiveServer2 JDBC Connection URL will accept an additional session 
variable "http.interceptor" that allows client to pass in the class name.
   Example: http.interceptor=com.example.UserInterceptor
   
   ### How was this patch tested?
   
   Tests were not added because the HiveConnection class does not have a test 
to begin with. The changes in code does not affect HiveConnection behavior. 
   It was tested by our custom JDBC Driver which uses the 
hive-jdbc-4.0.0-SNAPSHOT-standalone.jar.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 473133)
Remaining Estimate: 0h
Time Spent: 10m

> Pluggable HttpRequestInterceptor for Hive JDBC
> --
>
> Key: HIVE-24053
> URL: https://issues.apache.org/jira/browse/HIVE-24053
> Project: Hive
>  Issue Type: New Feature
>  Components: JDBC
>Affects Versions: 3.1.2
>Reporter: Ying Wang
>Assignee: Ying Wang
>Priority: Minor
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Allows client to pass in the name of a customize HttpRequestInterceptor, 
> instantiate the class and adds it to HttpClient.
> Example usage: We would like to pass in a HttpRequestInterceptor for OAuth2.0 
> Authentication purpose. The HttpRequestInterceptor will acquire and/or 
> refresh the access token and add it as authentication header each time 
> HiveConnection sends the HttpRequest.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HIVE-24053) Pluggable HttpRequestInterceptor for Hive JDBC

2020-08-20 Thread Ying Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ying Wang updated HIVE-24053:
-
Release Note: 
The HiveServer2 JDBC Connection URL will accept an additional session variable 
"http.interceptor" that allows client to pass in the class name.
Example: http.interceptor=com.example.UserInterceptor

> Pluggable HttpRequestInterceptor for Hive JDBC
> --
>
> Key: HIVE-24053
> URL: https://issues.apache.org/jira/browse/HIVE-24053
> Project: Hive
>  Issue Type: New Feature
>  Components: JDBC
>Affects Versions: 3.1.2
>Reporter: Ying Wang
>Assignee: Ying Wang
>Priority: Minor
>
> Allows client to pass in the name of a customize HttpRequestInterceptor, 
> instantiate the class and adds it to HttpClient.
> Example usage: We would like to pass in a HttpRequestInterceptor for OAuth2.0 
> Authentication purpose. The HttpRequestInterceptor will acquire and/or 
> refresh the access token and add it as authentication header each time 
> HiveConnection sends the HttpRequest.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (HIVE-24053) Pluggable HttpRequestInterceptor for Hive JDBC

2020-08-20 Thread Ying Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ying Wang reassigned HIVE-24053:



> Pluggable HttpRequestInterceptor for Hive JDBC
> --
>
> Key: HIVE-24053
> URL: https://issues.apache.org/jira/browse/HIVE-24053
> Project: Hive
>  Issue Type: New Feature
>  Components: JDBC
>Affects Versions: 3.1.2
>Reporter: Ying Wang
>Assignee: Ying Wang
>Priority: Minor
>
> Allows client to pass in the name of a customize HttpRequestInterceptor, 
> instantiate the class and adds it to HttpClient.
> Example usage: We would like to pass in a HttpRequestInterceptor for OAuth2.0 
> Authentication purpose. The HttpRequestInterceptor will acquire and/or 
> refresh the access token and add it as authentication header each time 
> HiveConnection sends the HttpRequest.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HIVE-22915) java.lang.NoSuchMethodError: com.google.common.base.Preconditions.checkArgument

2020-08-20 Thread Evgeniy Sh (Jira)


[ 
https://issues.apache.org/jira/browse/HIVE-22915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17181467#comment-17181467
 ] 

Evgeniy Sh commented on HIVE-22915:
---

Its helped me too, thanks [~javiroman]

> java.lang.NoSuchMethodError: 
> com.google.common.base.Preconditions.checkArgument
> ---
>
> Key: HIVE-22915
> URL: https://issues.apache.org/jira/browse/HIVE-22915
> Project: Hive
>  Issue Type: Bug
>  Components: Hive
>Affects Versions: 2.3.4
> Environment: Ubuntu 16.04
>Reporter: pradeepkumar
>Priority: Critical
>
> Hi Team,
> I am Not able to run hive. Getting following error on hive version above 3.X, 
> i tried all the versions. It is very critical issue.SLF4J: Class path 
> contains multiple SLF4J bindings.
>  SLF4J: Found binding in 
> [jar:file:/home/sreeramadasu/Downloads/hive/lib/log4j-slf4j-impl-2.6.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
>  SLF4J: Found binding in 
> [jar:file:/home/sreeramadasu/Downloads/hadoop/share/hadoop/common/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
>  SLF4J: See [http://www.slf4j.org/codes.html#multiple_bindings] for an 
> explanation.
>  SLF4J: Actual binding is of type 
> [org.apache.logging.slf4j.Log4jLoggerFactory]
>  Exception in thread "main" java.lang.NoSuchMethodError: 
> com.google.common.base.Preconditions.checkArgument(ZLjava/lang/String;Ljava/lang/Object;)V
>  at org.apache.hadoop.conf.Configuration.set(Configuration.java:1357)
>  at org.apache.hadoop.conf.Configuration.set(Configuration.java:1338)
>  at org.apache.hadoop.mapred.JobConf.setJar(JobConf.java:536)
>  at org.apache.hadoop.mapred.JobConf.setJarByClass(JobConf.java:554)
>  at org.apache.hadoop.mapred.JobConf.(JobConf.java:448)
>  at org.apache.hadoop.hive.conf.HiveConf.initialize(HiveConf.java:4045)
>  at org.apache.hadoop.hive.conf.HiveConf.(HiveConf.java:4003)
>  at 
> org.apache.hadoop.hive.common.LogUtils.initHiveLog4jCommon(LogUtils.java:81)
>  at org.apache.hadoop.hive.common.LogUtils.initHiveLog4j(LogUtils.java:65)
>  at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:702)
>  at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:686)
>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>  at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>  at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:498)
>  at org.apache.hadoop.util.RunJar.run(RunJar.java:323)
>  at org.apache.hadoop.util.RunJar.main(RunJar.java:236)
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (HIVE-21052) Make sure transactions get cleaned if they are aborted before addPartitions is called

2020-08-20 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-21052?focusedWorklogId=473016=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-473016
 ]

ASF GitHub Bot logged work on HIVE-21052:
-

Author: ASF GitHub Bot
Created on: 20/Aug/20 18:11
Start Date: 20/Aug/20 18:11
Worklog Time Spent: 10m 
  Work Description: vpnvishv opened a new pull request #1415:
URL: https://github.com/apache/hive/pull/1415


   
   
   ### What changes were proposed in this pull request?
   
   Below changes are only with respect to branch-3.1.
   
   Design: taken from 
https://issues.apache.org/jira/secure/attachment/12954375/Aborted%20Txn%20w_Direct%20Write.pdf
   
   **Overview:**
   1. add a dummy row to TXN_COMPONENTS with operation type 'p' in 
enqueueLockWithRetry, which will be removed in addDynamicPartition
   2. If anytime txn is aborted, this dummy entry will be block initiator to 
remove this txnId from TXNS
   3. Initiator will add a row in COMPACTION_QUEUE (with type 'p') for the 
above aborted txn with the state as READY_FOR_CLEANING, at a time there will be 
a single entry of this type for a table in COMPACTION_QUEUE.
   4. Cleaner will directly pickup above request, and process it via new 
cleanAborted code path(scan all partitions and remove aborted dirs), once 
successful cleaner will remove dummy row from TXN_COMPONENTS
   
   **Cleaner Design:**
   - We are keeping cleaner single thread, and this new type of cleanup will be 
handled similar to any regular cleanup
   
   **Aborted dirs cleanup:**
   - In p-type cleanup, cleaner will iterate over all the partitions and remove 
all delta/base dirs with given aborted writeId list
   - added cleanup of aborted base/delta in the worker also
   
   **TXN_COMPONENTS cleanup:**
   - If successful, p-type entry will be removed from TXN_COMPONENTS during 
addDynamicPartitions
   - If aborted, cleaner will clean in markCleaned after successful processing 
of p-type cleanup
   
   **TXNS cleanup:**
   - No change, will be cleaned up by the initiator 
   
   
   
   
   ### Why are the changes needed?
   To fix above mentioned issue.
   
   
   
   ### Does this PR introduce _any_ user-facing change?
   No
   
   
   
   ### How was this patch tested?
   unit-tests added
   
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 473016)
Remaining Estimate: 0h
Time Spent: 10m

> Make sure transactions get cleaned if they are aborted before addPartitions 
> is called
> -
>
> Key: HIVE-21052
> URL: https://issues.apache.org/jira/browse/HIVE-21052
> Project: Hive
>  Issue Type: Bug
>  Components: Transactions
>Affects Versions: 3.0.0, 3.1.1
>Reporter: Jaume M
>Assignee: Jaume M
>Priority: Critical
> Attachments: Aborted Txn w_Direct Write.pdf, HIVE-21052.1.patch, 
> HIVE-21052.10.patch, HIVE-21052.11.patch, HIVE-21052.12.patch, 
> HIVE-21052.2.patch, HIVE-21052.3.patch, HIVE-21052.4.patch, 
> HIVE-21052.5.patch, HIVE-21052.6.patch, HIVE-21052.7.patch, 
> HIVE-21052.8.patch, HIVE-21052.9.patch
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> If the transaction is aborted between openTxn and addPartitions and data has 
> been written on the table the transaction manager will think it's an empty 
> transaction and no cleaning will be done.
> This is currently an issue in the streaming API and in micromanaged tables. 
> As proposed by [~ekoifman] this can be solved by:
> * Writing an entry with a special marker to TXN_COMPONENTS at openTxn and 
> when addPartitions is called remove this entry from TXN_COMPONENTS and add 
> the corresponding partition entry to TXN_COMPONENTS.
> * If the cleaner finds and entry with a special marker in TXN_COMPONENTS that 
> specifies that a transaction was opened and it was aborted it must generate 
> jobs for the worker for every possible partition available.
> cc [~ewohlstadter]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HIVE-21052) Make sure transactions get cleaned if they are aborted before addPartitions is called

2020-08-20 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-21052?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HIVE-21052:
--
Labels: pull-request-available  (was: )

> Make sure transactions get cleaned if they are aborted before addPartitions 
> is called
> -
>
> Key: HIVE-21052
> URL: https://issues.apache.org/jira/browse/HIVE-21052
> Project: Hive
>  Issue Type: Bug
>  Components: Transactions
>Affects Versions: 3.0.0, 3.1.1
>Reporter: Jaume M
>Assignee: Jaume M
>Priority: Critical
>  Labels: pull-request-available
> Attachments: Aborted Txn w_Direct Write.pdf, HIVE-21052.1.patch, 
> HIVE-21052.10.patch, HIVE-21052.11.patch, HIVE-21052.12.patch, 
> HIVE-21052.2.patch, HIVE-21052.3.patch, HIVE-21052.4.patch, 
> HIVE-21052.5.patch, HIVE-21052.6.patch, HIVE-21052.7.patch, 
> HIVE-21052.8.patch, HIVE-21052.9.patch
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> If the transaction is aborted between openTxn and addPartitions and data has 
> been written on the table the transaction manager will think it's an empty 
> transaction and no cleaning will be done.
> This is currently an issue in the streaming API and in micromanaged tables. 
> As proposed by [~ekoifman] this can be solved by:
> * Writing an entry with a special marker to TXN_COMPONENTS at openTxn and 
> when addPartitions is called remove this entry from TXN_COMPONENTS and add 
> the corresponding partition entry to TXN_COMPONENTS.
> * If the cleaner finds and entry with a special marker in TXN_COMPONENTS that 
> specifies that a transaction was opened and it was aborted it must generate 
> jobs for the worker for every possible partition available.
> cc [~ewohlstadter]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (HIVE-24046) Table properties external table purge

2020-08-20 Thread Sthevg1 (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24046?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sthevg1 resolved HIVE-24046.

Release Note: Issue  was found to be not with Hive but other distribution 
using hive.
  Resolution: Invalid

> Table properties external table purge
> -
>
> Key: HIVE-24046
> URL: https://issues.apache.org/jira/browse/HIVE-24046
> Project: Hive
>  Issue Type: Improvement
>  Components: distribution
>Reporter: Sthevg1
>Priority: Major
>
> External table with table properties TBLPROPERTIES 
> ("external.table.purge"="true") moves data to .Trash while performing insert 
> overwrite on that external table. When large data/partitions are overwritten 
> this fills up .Trash and affects the other functionality.
> When we do insert overwrite, it should be obvious to completely purge the 
> data instead of moving into .Trash with out without TBLPROPERTIES 
> ("external.table.purge"="true").
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (HIVE-23954) count(*) with count(distinct) gives wrong results with hive.optimize.countdistinct=true

2020-08-20 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-23954?focusedWorklogId=472837=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-472837
 ]

ASF GitHub Bot logged work on HIVE-23954:
-

Author: ASF GitHub Bot
Created on: 20/Aug/20 10:55
Start Date: 20/Aug/20 10:55
Worklog Time Spent: 10m 
  Work Description: EugeneChung opened a new pull request #1414:
URL: https://github.com/apache/hive/pull/1414


   
   
   
   ### What changes were proposed in this pull request?
   It skips the reducer deduplication for the case that count functions for all 
and distinct are mixed.
   
   
   
   
   ### Why are the changes needed?
   
   `select count(*), count(distinct mid) from db1.table1 where 
partitioned_column = '...'` shows the wrong results especially for `count(*)`.
   
   ### Does this PR introduce _any_ user-facing change?
   
   No
   
   
   ### How was this patch tested?
   
   I've tested with the same query like `select count(*), count(distinct mid) 
from db1.table1 where partitioned_column = '...'` over the same data set. My 
patch shows the correct result with hive.optimize.countdistinct=true as the one 
of hive.optimize.countdistinct=false.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 472837)
Remaining Estimate: 0h
Time Spent: 10m

> count(*) with count(distinct) gives wrong results with 
> hive.optimize.countdistinct=true
> ---
>
> Key: HIVE-23954
> URL: https://issues.apache.org/jira/browse/HIVE-23954
> Project: Hive
>  Issue Type: Bug
>  Components: Logical Optimizer
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Eugene Chung
>Assignee: Eugene Chung
>Priority: Major
> Attachments: HIVE-23954.01.patch, HIVE-23954.01.patch
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> {code:java}
> select count(*), count(distinct mid) from db1.table1 where partitioned_column 
> = '...'{code}
>  
> is not working properly when hive.optimize.countdistinct is true. By default, 
> it's true for all 3.x versions.
> In the two plans below, the aggregations part in the Output of Group By 
> Operator of Map 1 are different.
>  
> - hive.optimize.countdistinct=false
> {code:java}
> ++
> |  Explain   |
> ++
> | Plan optimized by CBO. |
> ||
> | Vertex dependency in root stage|
> | Reducer 2 <- Map 1 (SIMPLE_EDGE)   |
> ||
> | Stage-0|
> |   Fetch Operator   |
> | limit:-1   |
> | Stage-1|
> |   Reducer 2|
> |   File Output Operator [FS_7]  |
> | Group By Operator [GBY_5] (rows=1 width=24) |
> |   
> Output:["_col0","_col1"],aggregations:["count(VALUE._col0)","count(DISTINCT 
> KEY._col0:0._col0)"] |
> | <-Map 1 [SIMPLE_EDGE]  |
> |   SHUFFLE [RS_4]   |
> | Group By Operator [GBY_3] (rows=343640771 width=4160) |
> |   
> Output:["_col0","_col1","_col2"],aggregations:["count()","count(DISTINCT 
> mid)"],keys:mid |
> |   Select Operator [SEL_2] (rows=343640771 width=4160) |
> | Output:["mid"] |
> | TableScan [TS_0] (rows=343640771 width=4160) |
> |   db1@table1,table1,Tbl:COMPLETE,Col:NONE,Output:["mid"] |
> ||
> ++{code}
>  
> - hive.optimize.countdistinct=true
> {code:java}
> ++
> |  Explain   |
> ++
> | Plan optimized by CBO. |
> ||
> | Vertex dependency in root stage|
> | Reducer 2 <- Map 1 (SIMPLE_EDGE)   |
> ||
> | Stage-0 

[jira] [Updated] (HIVE-23954) count(*) with count(distinct) gives wrong results with hive.optimize.countdistinct=true

2020-08-20 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-23954?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HIVE-23954:
--
Labels: pull-request-available  (was: )

> count(*) with count(distinct) gives wrong results with 
> hive.optimize.countdistinct=true
> ---
>
> Key: HIVE-23954
> URL: https://issues.apache.org/jira/browse/HIVE-23954
> Project: Hive
>  Issue Type: Bug
>  Components: Logical Optimizer
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Eugene Chung
>Assignee: Eugene Chung
>Priority: Major
>  Labels: pull-request-available
> Attachments: HIVE-23954.01.patch, HIVE-23954.01.patch
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> {code:java}
> select count(*), count(distinct mid) from db1.table1 where partitioned_column 
> = '...'{code}
>  
> is not working properly when hive.optimize.countdistinct is true. By default, 
> it's true for all 3.x versions.
> In the two plans below, the aggregations part in the Output of Group By 
> Operator of Map 1 are different.
>  
> - hive.optimize.countdistinct=false
> {code:java}
> ++
> |  Explain   |
> ++
> | Plan optimized by CBO. |
> ||
> | Vertex dependency in root stage|
> | Reducer 2 <- Map 1 (SIMPLE_EDGE)   |
> ||
> | Stage-0|
> |   Fetch Operator   |
> | limit:-1   |
> | Stage-1|
> |   Reducer 2|
> |   File Output Operator [FS_7]  |
> | Group By Operator [GBY_5] (rows=1 width=24) |
> |   
> Output:["_col0","_col1"],aggregations:["count(VALUE._col0)","count(DISTINCT 
> KEY._col0:0._col0)"] |
> | <-Map 1 [SIMPLE_EDGE]  |
> |   SHUFFLE [RS_4]   |
> | Group By Operator [GBY_3] (rows=343640771 width=4160) |
> |   
> Output:["_col0","_col1","_col2"],aggregations:["count()","count(DISTINCT 
> mid)"],keys:mid |
> |   Select Operator [SEL_2] (rows=343640771 width=4160) |
> | Output:["mid"] |
> | TableScan [TS_0] (rows=343640771 width=4160) |
> |   db1@table1,table1,Tbl:COMPLETE,Col:NONE,Output:["mid"] |
> ||
> ++{code}
>  
> - hive.optimize.countdistinct=true
> {code:java}
> ++
> |  Explain   |
> ++
> | Plan optimized by CBO. |
> ||
> | Vertex dependency in root stage|
> | Reducer 2 <- Map 1 (SIMPLE_EDGE)   |
> ||
> | Stage-0|
> |   Fetch Operator   |
> | limit:-1   |
> | Stage-1|
> |   Reducer 2|
> |   File Output Operator [FS_7]  |
> | Group By Operator [GBY_14] (rows=1 width=16) |
> |   
> Output:["_col0","_col1"],aggregations:["count(_col1)","count(_col0)"] |
> |   Group By Operator [GBY_11] (rows=343640771 width=4160) |
> | 
> Output:["_col0","_col1"],aggregations:["count(VALUE._col0)"],keys:KEY._col0 |
> |   <-Map 1 [SIMPLE_EDGE]|
> | SHUFFLE [RS_10]|
> |   PartitionCols:_col0  |
> |   Group By Operator [GBY_9] (rows=343640771 width=4160) |
> | Output:["_col0","_col1"],aggregations:["count()"],keys:mid |
> | Select Operator [SEL_2] (rows=343640771 width=4160) |
> |   Output:["mid"]   |
> |   TableScan [TS_0] (rows=343640771 width=4160) |
> | db1@table1,table1,Tbl:COMPLETE,Col:NONE,Output:["mid"] |
> ||
> ++
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HIVE-23954) count(*) with count(distinct) gives wrong results with hive.optimize.countdistinct=true

2020-08-20 Thread Eugene Chung (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-23954?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eugene Chung updated HIVE-23954:

Status: Open  (was: Patch Available)

> count(*) with count(distinct) gives wrong results with 
> hive.optimize.countdistinct=true
> ---
>
> Key: HIVE-23954
> URL: https://issues.apache.org/jira/browse/HIVE-23954
> Project: Hive
>  Issue Type: Bug
>  Components: Logical Optimizer
>Affects Versions: 3.1.0, 3.0.0
>Reporter: Eugene Chung
>Assignee: Eugene Chung
>Priority: Major
> Attachments: HIVE-23954.01.patch, HIVE-23954.01.patch
>
>
> {code:java}
> select count(*), count(distinct mid) from db1.table1 where partitioned_column 
> = '...'{code}
>  
> is not working properly when hive.optimize.countdistinct is true. By default, 
> it's true for all 3.x versions.
> In the two plans below, the aggregations part in the Output of Group By 
> Operator of Map 1 are different.
>  
> - hive.optimize.countdistinct=false
> {code:java}
> ++
> |  Explain   |
> ++
> | Plan optimized by CBO. |
> ||
> | Vertex dependency in root stage|
> | Reducer 2 <- Map 1 (SIMPLE_EDGE)   |
> ||
> | Stage-0|
> |   Fetch Operator   |
> | limit:-1   |
> | Stage-1|
> |   Reducer 2|
> |   File Output Operator [FS_7]  |
> | Group By Operator [GBY_5] (rows=1 width=24) |
> |   
> Output:["_col0","_col1"],aggregations:["count(VALUE._col0)","count(DISTINCT 
> KEY._col0:0._col0)"] |
> | <-Map 1 [SIMPLE_EDGE]  |
> |   SHUFFLE [RS_4]   |
> | Group By Operator [GBY_3] (rows=343640771 width=4160) |
> |   
> Output:["_col0","_col1","_col2"],aggregations:["count()","count(DISTINCT 
> mid)"],keys:mid |
> |   Select Operator [SEL_2] (rows=343640771 width=4160) |
> | Output:["mid"] |
> | TableScan [TS_0] (rows=343640771 width=4160) |
> |   db1@table1,table1,Tbl:COMPLETE,Col:NONE,Output:["mid"] |
> ||
> ++{code}
>  
> - hive.optimize.countdistinct=true
> {code:java}
> ++
> |  Explain   |
> ++
> | Plan optimized by CBO. |
> ||
> | Vertex dependency in root stage|
> | Reducer 2 <- Map 1 (SIMPLE_EDGE)   |
> ||
> | Stage-0|
> |   Fetch Operator   |
> | limit:-1   |
> | Stage-1|
> |   Reducer 2|
> |   File Output Operator [FS_7]  |
> | Group By Operator [GBY_14] (rows=1 width=16) |
> |   
> Output:["_col0","_col1"],aggregations:["count(_col1)","count(_col0)"] |
> |   Group By Operator [GBY_11] (rows=343640771 width=4160) |
> | 
> Output:["_col0","_col1"],aggregations:["count(VALUE._col0)"],keys:KEY._col0 |
> |   <-Map 1 [SIMPLE_EDGE]|
> | SHUFFLE [RS_10]|
> |   PartitionCols:_col0  |
> |   Group By Operator [GBY_9] (rows=343640771 width=4160) |
> | Output:["_col0","_col1"],aggregations:["count()"],keys:mid |
> | Select Operator [SEL_2] (rows=343640771 width=4160) |
> |   Output:["mid"]   |
> |   TableScan [TS_0] (rows=343640771 width=4160) |
> | db1@table1,table1,Tbl:COMPLETE,Col:NONE,Output:["mid"] |
> ||
> ++
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HIVE-23954) count(*) with count(distinct) gives wrong results with hive.optimize.countdistinct=true

2020-08-20 Thread Eugene Chung (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-23954?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eugene Chung updated HIVE-23954:

Attachment: HIVE-23954.01.patch
Status: Patch Available  (was: Open)

> count(*) with count(distinct) gives wrong results with 
> hive.optimize.countdistinct=true
> ---
>
> Key: HIVE-23954
> URL: https://issues.apache.org/jira/browse/HIVE-23954
> Project: Hive
>  Issue Type: Bug
>  Components: Logical Optimizer
>Affects Versions: 3.1.0, 3.0.0
>Reporter: Eugene Chung
>Assignee: Eugene Chung
>Priority: Major
> Attachments: HIVE-23954.01.patch, HIVE-23954.01.patch
>
>
> {code:java}
> select count(*), count(distinct mid) from db1.table1 where partitioned_column 
> = '...'{code}
>  
> is not working properly when hive.optimize.countdistinct is true. By default, 
> it's true for all 3.x versions.
> In the two plans below, the aggregations part in the Output of Group By 
> Operator of Map 1 are different.
>  
> - hive.optimize.countdistinct=false
> {code:java}
> ++
> |  Explain   |
> ++
> | Plan optimized by CBO. |
> ||
> | Vertex dependency in root stage|
> | Reducer 2 <- Map 1 (SIMPLE_EDGE)   |
> ||
> | Stage-0|
> |   Fetch Operator   |
> | limit:-1   |
> | Stage-1|
> |   Reducer 2|
> |   File Output Operator [FS_7]  |
> | Group By Operator [GBY_5] (rows=1 width=24) |
> |   
> Output:["_col0","_col1"],aggregations:["count(VALUE._col0)","count(DISTINCT 
> KEY._col0:0._col0)"] |
> | <-Map 1 [SIMPLE_EDGE]  |
> |   SHUFFLE [RS_4]   |
> | Group By Operator [GBY_3] (rows=343640771 width=4160) |
> |   
> Output:["_col0","_col1","_col2"],aggregations:["count()","count(DISTINCT 
> mid)"],keys:mid |
> |   Select Operator [SEL_2] (rows=343640771 width=4160) |
> | Output:["mid"] |
> | TableScan [TS_0] (rows=343640771 width=4160) |
> |   db1@table1,table1,Tbl:COMPLETE,Col:NONE,Output:["mid"] |
> ||
> ++{code}
>  
> - hive.optimize.countdistinct=true
> {code:java}
> ++
> |  Explain   |
> ++
> | Plan optimized by CBO. |
> ||
> | Vertex dependency in root stage|
> | Reducer 2 <- Map 1 (SIMPLE_EDGE)   |
> ||
> | Stage-0|
> |   Fetch Operator   |
> | limit:-1   |
> | Stage-1|
> |   Reducer 2|
> |   File Output Operator [FS_7]  |
> | Group By Operator [GBY_14] (rows=1 width=16) |
> |   
> Output:["_col0","_col1"],aggregations:["count(_col1)","count(_col0)"] |
> |   Group By Operator [GBY_11] (rows=343640771 width=4160) |
> | 
> Output:["_col0","_col1"],aggregations:["count(VALUE._col0)"],keys:KEY._col0 |
> |   <-Map 1 [SIMPLE_EDGE]|
> | SHUFFLE [RS_10]|
> |   PartitionCols:_col0  |
> |   Group By Operator [GBY_9] (rows=343640771 width=4160) |
> | Output:["_col0","_col1"],aggregations:["count()"],keys:mid |
> | Select Operator [SEL_2] (rows=343640771 width=4160) |
> |   Output:["mid"]   |
> |   TableScan [TS_0] (rows=343640771 width=4160) |
> | db1@table1,table1,Tbl:COMPLETE,Col:NONE,Output:["mid"] |
> ||
> ++
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)