date:20210203

[jira] [Comment Edited] (HIVE-22126) hive-exec packaging should shade guava

2021-02-03 Thread Pranay (Jira)



[ 
https://issues.apache.org/jira/browse/HIVE-22126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17278548#comment-17278548
 ] 

Pranay edited comment on HIVE-22126 at 2/4/21, 5:35 AM:


[~csun] I have the same issue on hive-3.1.2. Did you happen to workaround this?


was (Author: pranayvyas):
[~csun] I have the same issue, I am trying it ton hive 3.1.2. Did you happen to 
workaround this?

> hive-exec packaging should shade guava
> --
>
> Key: HIVE-22126
> URL: https://issues.apache.org/jira/browse/HIVE-22126
> Project: Hive
>  Issue Type: Bug
>Reporter: Vihang Karajgaonkar
>Assignee: Eugene Chung
>Priority: Major
> Fix For: 4.0.0
>
> Attachments: HIVE-22126.01.patch, HIVE-22126.02.patch, 
> HIVE-22126.03.patch, HIVE-22126.04.patch, HIVE-22126.05.patch, 
> HIVE-22126.06.patch, HIVE-22126.07.patch, HIVE-22126.08.patch, 
> HIVE-22126.09.patch, HIVE-22126.09.patch, HIVE-22126.09.patch, 
> HIVE-22126.09.patch, HIVE-22126.09.patch
>
>
> The ql/pom.xml includes complete guava library into hive-exec.jar 
> https://github.com/apache/hive/blob/master/ql/pom.xml#L990 This causes a 
> problems for downstream clients of hive which have hive-exec.jar in their 
> classpath since they are pinned to the same guava version as that of hive. 
> We should shade guava classes so that other components which depend on 
> hive-exec can independently use a different version of guava as needed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (HIVE-22126) hive-exec packaging should shade guava

2021-02-03 Thread Pranay (Jira)



[ 
https://issues.apache.org/jira/browse/HIVE-22126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17278548#comment-17278548
 ] 

Pranay commented on HIVE-22126:
---

[~csun] I have the same issue, I am trying it ton hive 3.1.2. Did you happen to 
workaround this?

> hive-exec packaging should shade guava
> --
>
> Key: HIVE-22126
> URL: https://issues.apache.org/jira/browse/HIVE-22126
> Project: Hive
>  Issue Type: Bug
>Reporter: Vihang Karajgaonkar
>Assignee: Eugene Chung
>Priority: Major
> Fix For: 4.0.0
>
> Attachments: HIVE-22126.01.patch, HIVE-22126.02.patch, 
> HIVE-22126.03.patch, HIVE-22126.04.patch, HIVE-22126.05.patch, 
> HIVE-22126.06.patch, HIVE-22126.07.patch, HIVE-22126.08.patch, 
> HIVE-22126.09.patch, HIVE-22126.09.patch, HIVE-22126.09.patch, 
> HIVE-22126.09.patch, HIVE-22126.09.patch
>
>
> The ql/pom.xml includes complete guava library into hive-exec.jar 
> https://github.com/apache/hive/blob/master/ql/pom.xml#L990 This causes a 
> problems for downstream clients of hive which have hive-exec.jar in their 
> classpath since they are pinned to the same guava version as that of hive. 
> We should shade guava classes so that other components which depend on 
> hive-exec can independently use a different version of guava as needed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (HIVE-24664) Support column aliases in Values clause

2021-02-03 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-24664?focusedWorklogId=547426=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-547426
 ]

ASF GitHub Bot logged work on HIVE-24664:
-

Author: ASF GitHub Bot
Created on: 04/Feb/21 05:01
Start Date: 04/Feb/21 05:01
Worklog Time Spent: 10m 
  Work Description: jcamachor commented on a change in pull request #1892:
URL: https://github.com/apache/hive/pull/1892#discussion_r569947228



##
File path: 
ql/src/java/org/apache/hadoop/hive/ql/parse/type/TypeCheckProcFactory.java
##
@@ -1632,4 +1641,19 @@ public static String getFunctionText(ASTNode expr, 
boolean isFunction) {
 return BaseSemanticAnalyzer.unescapeIdentifier(funcText);
   }
 
+  private SemanticNodeProcessor getValueAliasProcessor() {
+return new ValueAliasProcessor();
+  }
+
+  public static final Object ALIAS_PLACEHOLDER = new Object();

Review comment:
   Can we reduce the visibility of this object?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 547426)
Time Spent: 2h 10m  (was: 2h)

> Support column aliases in Values clause
> ---
>
> Key: HIVE-24664
> URL: https://issues.apache.org/jira/browse/HIVE-24664
> Project: Hive
>  Issue Type: Improvement
>Reporter: Krisztian Kasa
>Assignee: Krisztian Kasa
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 2h 10m
>  Remaining Estimate: 0h
>
> Enable explicitly specify column aliases in the first row of Values clause. 
> If not all the columns has alias specified generate one.
> {code:java}
> values(1, 2 b, 3 c),(4, 5, 6);
> {code}
> {code:java}
> _col1   b   c
>   1 2   3
>   4 5   6
> {code}
>  This is not an standard SQL feature but some database engines like Impala 
> supports it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (HIVE-24485) Make the slow-start behavior tunable

2021-02-03 Thread Gopal Vijayaraghavan (Jira)



[ 
https://issues.apache.org/jira/browse/HIVE-24485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17278522#comment-17278522
 ] 

Gopal Vijayaraghavan commented on HIVE-24485:
-

[~okumin]: can you change the patch to move the parameters into edgeProp (as in 
set it there and send it in, instead of adding a conf).

{code}
  public void setSlowStart(boolean slowStart) {
this.isSlowStart = slowStart;
  }
{code}

Let me say that this doesn't change what it does right now, but it is easier to 
tweak it by the edgeProp within planning (& the debugger is neater, because the 
object has everything).

> Make the slow-start behavior tunable
> 
>
> Key: HIVE-24485
> URL: https://issues.apache.org/jira/browse/HIVE-24485
> Project: Hive
>  Issue Type: Improvement
>  Components: Hive, Tez
>Affects Versions: 3.1.2, 4.0.0
>Reporter: okumin
>Assignee: okumin
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> This ticket would enable users to configure the timing of slow-start with 
> `tez.shuffle-vertex-manager.min-src-fraction` and 
> `tez.shuffle-vertex-manager.max-src-fraction`.
> Hive on Tez currently doesn't honor these parameters and ShuffleVertexManager 
> always uses the default value.
> We can control the timing to start vertexes the accuracy of estimated input 
> size if we can tweak these ones. This is useful when a vertex has tasks that 
> process a different amount of data.
>  
> We can reproduce the issue with this query.
> {code:java}
> SET hive.tez.auto.reducer.parallelism=true;
> SET hive.tez.min.partition.factor=1.0; -- enforce auto-parallelism
> SET tez.shuffle-vertex-manager.min-src-fraction=0.55;
> SET tez.shuffle-vertex-manager.max-src-fraction=0.95;
> CREATE TABLE mofu (name string);
> INSERT INTO mofu (name) VALUES ('12345');
> SELECT name, count(*) FROM mofu GROUP BY name;{code}
> The fractions are ignored.
> {code:java}
> 2020-12-04 11:41:42,484 [INFO] [Dispatcher thread {Central}] 
> |vertexmanager.ShuffleVertexManagerBase|: Settings minFrac: 0.25 maxFrac: 
> 0.75 auto: true desiredTaskIput: 25600
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (HIVE-24485) Make the slow-start behavior tunable

2021-02-03 Thread okumin (Jira)



[ 
https://issues.apache.org/jira/browse/HIVE-24485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17278519#comment-17278519
 ] 

okumin commented on HIVE-24485:
---

I got a notification to stale the PR.

[~gopalv] or anyone familiar with Hive on Tez: Could you please take a look 
when you have a chance?

> Make the slow-start behavior tunable
> 
>
> Key: HIVE-24485
> URL: https://issues.apache.org/jira/browse/HIVE-24485
> Project: Hive
>  Issue Type: Improvement
>  Components: Hive, Tez
>Affects Versions: 3.1.2, 4.0.0
>Reporter: okumin
>Assignee: okumin
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> This ticket would enable users to configure the timing of slow-start with 
> `tez.shuffle-vertex-manager.min-src-fraction` and 
> `tez.shuffle-vertex-manager.max-src-fraction`.
> Hive on Tez currently doesn't honor these parameters and ShuffleVertexManager 
> always uses the default value.
> We can control the timing to start vertexes the accuracy of estimated input 
> size if we can tweak these ones. This is useful when a vertex has tasks that 
> process a different amount of data.
>  
> We can reproduce the issue with this query.
> {code:java}
> SET hive.tez.auto.reducer.parallelism=true;
> SET hive.tez.min.partition.factor=1.0; -- enforce auto-parallelism
> SET tez.shuffle-vertex-manager.min-src-fraction=0.55;
> SET tez.shuffle-vertex-manager.max-src-fraction=0.95;
> CREATE TABLE mofu (name string);
> INSERT INTO mofu (name) VALUES ('12345');
> SELECT name, count(*) FROM mofu GROUP BY name;{code}
> The fractions are ignored.
> {code:java}
> 2020-12-04 11:41:42,484 [INFO] [Dispatcher thread {Central}] 
> |vertexmanager.ShuffleVertexManagerBase|: Settings minFrac: 0.25 maxFrac: 
> 0.75 auto: true desiredTaskIput: 25600
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (HIVE-24485) Make the slow-start behavior tunable

2021-02-03 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-24485?focusedWorklogId=547373=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-547373
 ]

ASF GitHub Bot logged work on HIVE-24485:
-

Author: ASF GitHub Bot
Created on: 04/Feb/21 00:45
Start Date: 04/Feb/21 00:45
Worklog Time Spent: 10m 
  Work Description: github-actions[bot] commented on pull request #1744:
URL: https://github.com/apache/hive/pull/1744#issuecomment-772937622


   This pull request has been automatically marked as stale because it has not 
had recent activity. It will be closed if no further activity occurs.
   Feel free to reach out on the d...@hive.apache.org list if the patch is in 
need of reviews.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 547373)
Time Spent: 20m  (was: 10m)

> Make the slow-start behavior tunable
> 
>
> Key: HIVE-24485
> URL: https://issues.apache.org/jira/browse/HIVE-24485
> Project: Hive
>  Issue Type: Improvement
>  Components: Hive, Tez
>Affects Versions: 3.1.2, 4.0.0
>Reporter: okumin
>Assignee: okumin
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> This ticket would enable users to configure the timing of slow-start with 
> `tez.shuffle-vertex-manager.min-src-fraction` and 
> `tez.shuffle-vertex-manager.max-src-fraction`.
> Hive on Tez currently doesn't honor these parameters and ShuffleVertexManager 
> always uses the default value.
> We can control the timing to start vertexes the accuracy of estimated input 
> size if we can tweak these ones. This is useful when a vertex has tasks that 
> process a different amount of data.
>  
> We can reproduce the issue with this query.
> {code:java}
> SET hive.tez.auto.reducer.parallelism=true;
> SET hive.tez.min.partition.factor=1.0; -- enforce auto-parallelism
> SET tez.shuffle-vertex-manager.min-src-fraction=0.55;
> SET tez.shuffle-vertex-manager.max-src-fraction=0.95;
> CREATE TABLE mofu (name string);
> INSERT INTO mofu (name) VALUES ('12345');
> SELECT name, count(*) FROM mofu GROUP BY name;{code}
> The fractions are ignored.
> {code:java}
> 2020-12-04 11:41:42,484 [INFO] [Dispatcher thread {Central}] 
> |vertexmanager.ShuffleVertexManagerBase|: Settings minFrac: 0.25 maxFrac: 
> 0.75 auto: true desiredTaskIput: 25600
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Resolved] (HIVE-23553) Upgrade ORC version to 1.6.7

2021-02-03 Thread Jesus Camacho Rodriguez (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-23553?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jesus Camacho Rodriguez resolved HIVE-23553.

Fix Version/s: 4.0.0
   Resolution: Fixed

Pushed to master, thanks [~pgaref]! This was long overdue.

> Upgrade ORC version to 1.6.7
> 
>
> Key: HIVE-23553
> URL: https://issues.apache.org/jira/browse/HIVE-23553
> Project: Hive
>  Issue Type: Improvement
>Reporter: Panagiotis Garefalakis
>Assignee: Panagiotis Garefalakis
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 9h 20m
>  Remaining Estimate: 0h
>
>  Apache Hive is currently on 1.5.X version and in order to take advantage of 
> the latest ORC improvements such as column encryption we have to bump to 
> 1.6.X.
> https://issues.apache.org/jira/secure/ReleaseNote.jspa?version=12343288==12318320=Create_token=A5KQ-2QAV-T4JA-FDED_4ae78f19321c7fb1e7f337fba1dd90af751d8810_lin
> Even though ORC reader could work out of the box, HIVE LLAP is heavily 
> depending on internal ORC APIs e.g., to retrieve and store File Footers, 
> Tails, streams – un/compress RG data etc. As there ware many internal changes 
> from 1.5 to 1.6 (Input stream offsets, relative BufferChunks etc.) the 
> upgrade is not straightforward.
> This Umbrella Jira tracks this upgrade effort.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (HIVE-23553) Upgrade ORC version to 1.6.7

2021-02-03 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-23553?focusedWorklogId=547351=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-547351
 ]

ASF GitHub Bot logged work on HIVE-23553:
-

Author: ASF GitHub Bot
Created on: 04/Feb/21 00:02
Start Date: 04/Feb/21 00:02
Worklog Time Spent: 10m 
  Work Description: jcamachor merged pull request #1823:
URL: https://github.com/apache/hive/pull/1823


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 547351)
Time Spent: 9h 20m  (was: 9h 10m)

> Upgrade ORC version to 1.6.7
> 
>
> Key: HIVE-23553
> URL: https://issues.apache.org/jira/browse/HIVE-23553
> Project: Hive
>  Issue Type: Improvement
>Reporter: Panagiotis Garefalakis
>Assignee: Panagiotis Garefalakis
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 9h 20m
>  Remaining Estimate: 0h
>
>  Apache Hive is currently on 1.5.X version and in order to take advantage of 
> the latest ORC improvements such as column encryption we have to bump to 
> 1.6.X.
> https://issues.apache.org/jira/secure/ReleaseNote.jspa?version=12343288==12318320=Create_token=A5KQ-2QAV-T4JA-FDED_4ae78f19321c7fb1e7f337fba1dd90af751d8810_lin
> Even though ORC reader could work out of the box, HIVE LLAP is heavily 
> depending on internal ORC APIs e.g., to retrieve and store File Footers, 
> Tails, streams – un/compress RG data etc. As there ware many internal changes 
> from 1.5 to 1.6 (Input stream offsets, relative BufferChunks etc.) the 
> upgrade is not straightforward.
> This Umbrella Jira tracks this upgrade effort.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HIVE-24733) Handle replication when db location and managed location is set to custom location on source

2021-02-03 Thread Pravin Sinha (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-24733?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pravin Sinha updated HIVE-24733:

Status: Patch Available  (was: Open)

> Handle replication when db location and managed location is set to custom 
> location on source
> 
>
> Key: HIVE-24733
> URL: https://issues.apache.org/jira/browse/HIVE-24733
> Project: Hive
>  Issue Type: Task
>Reporter: Pravin Sinha
>Assignee: Pravin Sinha
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> {color:#172b4d} {color}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (HIVE-24733) Handle replication when db location and managed location is set to custom location on source

2021-02-03 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-24733?focusedWorklogId=547336=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-547336
 ]

ASF GitHub Bot logged work on HIVE-24733:
-

Author: ASF GitHub Bot
Created on: 03/Feb/21 23:20
Start Date: 03/Feb/21 23:20
Worklog Time Spent: 10m 
  Work Description: pkumarsinha opened a new pull request #1942:
URL: https://github.com/apache/hive/pull/1942


   …is set to custom location on source
   
   
   
   ### What changes were proposed in this pull request?
   
   
   
   ### Why are the changes needed?
   
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   
   
   ### How was this patch tested?
   
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 547336)
Remaining Estimate: 0h
Time Spent: 10m

> Handle replication when db location and managed location is set to custom 
> location on source
> 
>
> Key: HIVE-24733
> URL: https://issues.apache.org/jira/browse/HIVE-24733
> Project: Hive
>  Issue Type: Task
>Reporter: Pravin Sinha
>Assignee: Pravin Sinha
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> {color:#172b4d} {color}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HIVE-24733) Handle replication when db location and managed location is set to custom location on source

2021-02-03 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-24733?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HIVE-24733:
--
Labels: pull-request-available  (was: )

> Handle replication when db location and managed location is set to custom 
> location on source
> 
>
> Key: HIVE-24733
> URL: https://issues.apache.org/jira/browse/HIVE-24733
> Project: Hive
>  Issue Type: Task
>Reporter: Pravin Sinha
>Assignee: Pravin Sinha
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> {color:#172b4d} {color}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (HIVE-24733) Handle replication when db location and managed location is set to custom location on source

2021-02-03 Thread Pravin Sinha (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-24733?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pravin Sinha reassigned HIVE-24733:
---


> Handle replication when db location and managed location is set to custom 
> location on source
> 
>
> Key: HIVE-24733
> URL: https://issues.apache.org/jira/browse/HIVE-24733
> Project: Hive
>  Issue Type: Task
>Reporter: Pravin Sinha
>Assignee: Pravin Sinha
>Priority: Major
>
> {color:#172b4d} {color}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (HIVE-23553) Upgrade ORC version to 1.6.7

2021-02-03 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-23553?focusedWorklogId=547313=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-547313
 ]

ASF GitHub Bot logged work on HIVE-23553:
-

Author: ASF GitHub Bot
Created on: 03/Feb/21 22:17
Start Date: 03/Feb/21 22:17
Worklog Time Spent: 10m 
  Work Description: mustafaiman commented on pull request #1823:
URL: https://github.com/apache/hive/pull/1823#issuecomment-772864136


   Looks good to me. Thanks for the effort @pgaref 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 547313)
Time Spent: 9h 10m  (was: 9h)

> Upgrade ORC version to 1.6.7
> 
>
> Key: HIVE-23553
> URL: https://issues.apache.org/jira/browse/HIVE-23553
> Project: Hive
>  Issue Type: Improvement
>Reporter: Panagiotis Garefalakis
>Assignee: Panagiotis Garefalakis
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 9h 10m
>  Remaining Estimate: 0h
>
>  Apache Hive is currently on 1.5.X version and in order to take advantage of 
> the latest ORC improvements such as column encryption we have to bump to 
> 1.6.X.
> https://issues.apache.org/jira/secure/ReleaseNote.jspa?version=12343288==12318320=Create_token=A5KQ-2QAV-T4JA-FDED_4ae78f19321c7fb1e7f337fba1dd90af751d8810_lin
> Even though ORC reader could work out of the box, HIVE LLAP is heavily 
> depending on internal ORC APIs e.g., to retrieve and store File Footers, 
> Tails, streams – un/compress RG data etc. As there ware many internal changes 
> from 1.5 to 1.6 (Input stream offsets, relative BufferChunks etc.) the 
> upgrade is not straightforward.
> This Umbrella Jira tracks this upgrade effort.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (HIVE-23553) Upgrade ORC version to 1.6.7

2021-02-03 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-23553?focusedWorklogId=547303=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-547303
 ]

ASF GitHub Bot logged work on HIVE-23553:
-

Author: ASF GitHub Bot
Created on: 03/Feb/21 21:55
Start Date: 03/Feb/21 21:55
Worklog Time Spent: 10m 
  Work Description: jcamachor commented on pull request #1823:
URL: https://github.com/apache/hive/pull/1823#issuecomment-772850624


   Thanks for addressing the comments @pgaref . I am fine from my side, +1.
   
   I'd like to hear from @mustafaiman , if it's fine from his side, we can 
merge it.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 547303)
Time Spent: 9h  (was: 8h 50m)

> Upgrade ORC version to 1.6.7
> 
>
> Key: HIVE-23553
> URL: https://issues.apache.org/jira/browse/HIVE-23553
> Project: Hive
>  Issue Type: Improvement
>Reporter: Panagiotis Garefalakis
>Assignee: Panagiotis Garefalakis
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 9h
>  Remaining Estimate: 0h
>
>  Apache Hive is currently on 1.5.X version and in order to take advantage of 
> the latest ORC improvements such as column encryption we have to bump to 
> 1.6.X.
> https://issues.apache.org/jira/secure/ReleaseNote.jspa?version=12343288==12318320=Create_token=A5KQ-2QAV-T4JA-FDED_4ae78f19321c7fb1e7f337fba1dd90af751d8810_lin
> Even though ORC reader could work out of the box, HIVE LLAP is heavily 
> depending on internal ORC APIs e.g., to retrieve and store File Footers, 
> Tails, streams – un/compress RG data etc. As there ware many internal changes 
> from 1.5 to 1.6 (Input stream offsets, relative BufferChunks etc.) the 
> upgrade is not straightforward.
> This Umbrella Jira tracks this upgrade effort.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (HIVE-23553) Upgrade ORC version to 1.6.7

2021-02-03 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-23553?focusedWorklogId=547300=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-547300
 ]

ASF GitHub Bot logged work on HIVE-23553:
-

Author: ASF GitHub Bot
Created on: 03/Feb/21 21:52
Start Date: 03/Feb/21 21:52
Worklog Time Spent: 10m 
  Work Description: jcamachor commented on a change in pull request #1823:
URL: https://github.com/apache/hive/pull/1823#discussion_r569775411



##
File path: 
ql/src/java/org/apache/hadoop/hive/ql/io/orc/encoded/EncodedTreeReaderFactory.java
##
@@ -2585,6 +2590,7 @@ private static TreeReader getPrimitiveTreeReader(final 
int columnIndex,
 .setColumnEncoding(columnEncoding)
 .setVectors(vectors)
 .setContext(context)
+.setIsInstant(columnType.getCategory()  == 
TypeDescription.Category.TIMESTAMP_INSTANT)

Review comment:
   @pgaref , can we create a follow-up JIRA to implement TIMESTAMP WITH 
LOCAL TIME ZONE integration with ORC so we do not forget about it?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 547300)
Time Spent: 8h 50m  (was: 8h 40m)

> Upgrade ORC version to 1.6.7
> 
>
> Key: HIVE-23553
> URL: https://issues.apache.org/jira/browse/HIVE-23553
> Project: Hive
>  Issue Type: Improvement
>Reporter: Panagiotis Garefalakis
>Assignee: Panagiotis Garefalakis
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 8h 50m
>  Remaining Estimate: 0h
>
>  Apache Hive is currently on 1.5.X version and in order to take advantage of 
> the latest ORC improvements such as column encryption we have to bump to 
> 1.6.X.
> https://issues.apache.org/jira/secure/ReleaseNote.jspa?version=12343288==12318320=Create_token=A5KQ-2QAV-T4JA-FDED_4ae78f19321c7fb1e7f337fba1dd90af751d8810_lin
> Even though ORC reader could work out of the box, HIVE LLAP is heavily 
> depending on internal ORC APIs e.g., to retrieve and store File Footers, 
> Tails, streams – un/compress RG data etc. As there ware many internal changes 
> from 1.5 to 1.6 (Input stream offsets, relative BufferChunks etc.) the 
> upgrade is not straightforward.
> This Umbrella Jira tracks this upgrade effort.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (HIVE-23553) Upgrade ORC version to 1.6.7

2021-02-03 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-23553?focusedWorklogId=547292=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-547292
 ]

ASF GitHub Bot logged work on HIVE-23553:
-

Author: ASF GitHub Bot
Created on: 03/Feb/21 21:40
Start Date: 03/Feb/21 21:40
Worklog Time Spent: 10m 
  Work Description: pgaref commented on pull request #1823:
URL: https://github.com/apache/hive/pull/1823#issuecomment-772842805


   Gentle ping @mustafaiman @jcamachor  -- any further comments here?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 547292)
Time Spent: 8h 40m  (was: 8.5h)

> Upgrade ORC version to 1.6.7
> 
>
> Key: HIVE-23553
> URL: https://issues.apache.org/jira/browse/HIVE-23553
> Project: Hive
>  Issue Type: Improvement
>Reporter: Panagiotis Garefalakis
>Assignee: Panagiotis Garefalakis
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 8h 40m
>  Remaining Estimate: 0h
>
>  Apache Hive is currently on 1.5.X version and in order to take advantage of 
> the latest ORC improvements such as column encryption we have to bump to 
> 1.6.X.
> https://issues.apache.org/jira/secure/ReleaseNote.jspa?version=12343288==12318320=Create_token=A5KQ-2QAV-T4JA-FDED_4ae78f19321c7fb1e7f337fba1dd90af751d8810_lin
> Even though ORC reader could work out of the box, HIVE LLAP is heavily 
> depending on internal ORC APIs e.g., to retrieve and store File Footers, 
> Tails, streams – un/compress RG data etc. As there ware many internal changes 
> from 1.5 to 1.6 (Input stream offsets, relative BufferChunks etc.) the 
> upgrade is not straightforward.
> This Umbrella Jira tracks this upgrade effort.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HIVE-24730) Shims classes override values from hive-site.xml and tez-site.xml silently

2021-02-03 Thread Jira



 [ 
https://issues.apache.org/jira/browse/HIVE-24730?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

László Bodor updated HIVE-24730:

Description: 
Since HIVE-14887, 
[Hadoop23Shims|https://github.com/apache/hive/blob/master/shims/0.23/src/main/java/org/apache/hadoop/hive/shims/Hadoop23Shims.java]
 silently overrides e.g. hive.tez.container.size which is defined in 
data/conf/hive/llap/hive-site.xml. This way, the developer will have no idea 
about what happened after setting those values in the xml.
My proposal: 
1. don't set those values, unless they contain the default value (e.g.: -1 for 
hive.tez.container.size)
2. put an INFO level log message about the override

OR:

put a comment in hive-site.xml and tez-site.xml files that shims override it 
while creating a tez mini cluster


  was:
Since HIVE-14887, 
[Hadoop23Shims|https://github.com/apache/hive/blob/master/shims/0.23/src/main/java/org/apache/hadoop/hive/shims/Hadoop23Shims.java]
 silently overrides e.g. hive.tez.container.size which is defined in 
data/conf/hive/llap/hive-site.xml. This way, the developer will have no idea 
about what happened after setting those values in the xml.
My proposal: 
1. don't set those values, unless they reflect an invalid default value (e.g.: 
-1 for hive.tez.container.size)
2. put an INFO level log message about the override

OR:

put a comment in hive-site.xml and tez-site.xml files that shims override it 
while creating a tez mini cluster



> Shims classes override values from hive-site.xml and tez-site.xml silently
> --
>
> Key: HIVE-24730
> URL: https://issues.apache.org/jira/browse/HIVE-24730
> Project: Hive
>  Issue Type: Bug
>Reporter: László Bodor
>Assignee: László Bodor
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Since HIVE-14887, 
> [Hadoop23Shims|https://github.com/apache/hive/blob/master/shims/0.23/src/main/java/org/apache/hadoop/hive/shims/Hadoop23Shims.java]
>  silently overrides e.g. hive.tez.container.size which is defined in 
> data/conf/hive/llap/hive-site.xml. This way, the developer will have no idea 
> about what happened after setting those values in the xml.
> My proposal: 
> 1. don't set those values, unless they contain the default value (e.g.: -1 
> for hive.tez.container.size)
> 2. put an INFO level log message about the override
> OR:
> put a comment in hive-site.xml and tez-site.xml files that shims override it 
> while creating a tez mini cluster



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (HIVE-24723) Use ExecutorService in TezSessionPool

2021-02-03 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-24723?focusedWorklogId=547221=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-547221
 ]

ASF GitHub Bot logged work on HIVE-24723:
-

Author: ASF GitHub Bot
Created on: 03/Feb/21 18:55
Start Date: 03/Feb/21 18:55
Worklog Time Spent: 10m 
  Work Description: belugabehr commented on a change in pull request #1939:
URL: https://github.com/apache/hive/pull/1939#discussion_r569663760



##
File path: ql/src/java/org/apache/hadoop/hive/ql/exec/tez/TezSessionPool.java
##
@@ -395,20 +379,38 @@ int getInitialSize() {
 } while (!deltaRemaining.compareAndSet(oldVal, oldVal + delta));
 int toStart = oldVal + delta;
 if (toStart <= 0) return createDummyFuture();
-LOG.info("Resizing the pool; adding " + toStart + " sessions");
-
-// 2) If we need to create some extra sessions, we'd do it just like 
startup does.
-int threadCount = Math.max(1, Math.min(toStart,
-HiveConf.getIntVar(initConf, 
ConfVars.HIVE_SERVER2_TEZ_SESSION_MAX_INIT_THREADS)));
-List> threadTasks = new 
ArrayList<>(threadCount);
-// This is an async method, so always launch threads, even for a single 
task.
-for (int i = 0; i < threadCount; ++i) {
-  ListenableFutureTask task = ListenableFutureTask.create(
-  new CreateSessionsRunnable(deltaRemaining));
-  new Thread(task, "Tez pool resize " + i).start();
-  threadTasks.add(task);
+LOG.info("Resizing the pool; adding {} sessions", toStart);
+
+final int threadCount =
+Math.min(toStart, HiveConf.getIntVar(initConf, 
ConfVars.HIVE_SERVER2_TEZ_SESSION_MAX_INIT_THREADS));
+
+return createSessions(toStart, threadCount);
+  }
+
+  private ListenableFuture> createSessions(int sessionCount, int 
maxParallel) {
+Preconditions.checkArgument(sessionCount > 0);
+Preconditions.checkArgument(maxParallel > 0);

Review comment:
   This was previous checked in the calling methods.  I have moved it here 
to remove redundancy.  Not an expensive operation and it's not called all that 
often, so no worries even if it is doubled somewhere.  Better safe than sorry 
as the code changes.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 547221)
Time Spent: 1h 20m  (was: 1h 10m)

> Use ExecutorService in TezSessionPool
> -
>
> Key: HIVE-24723
> URL: https://issues.apache.org/jira/browse/HIVE-24723
> Project: Hive
>  Issue Type: Improvement
>  Components: Tez
>Reporter: David Mollitor
>Assignee: David Mollitor
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> Currently there are some wonky home-made thread pooling action going on in 
> {{TezSessionPool}}.  Replace it with some JDK/Guava goodness.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (HIVE-24730) Shims classes override values from hive-site.xml and tez-site.xml silently

2021-02-03 Thread Jira



[ 
https://issues.apache.org/jira/browse/HIVE-24730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17278275#comment-17278275
 ] 

Mustafa İman commented on HIVE-24730:
-

Why do we override configs programmatically at all? Can't we just put whatever 
is necessary to the config file?

> Shims classes override values from hive-site.xml and tez-site.xml silently
> --
>
> Key: HIVE-24730
> URL: https://issues.apache.org/jira/browse/HIVE-24730
> Project: Hive
>  Issue Type: Bug
>Reporter: László Bodor
>Assignee: László Bodor
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Since HIVE-14887, 
> [Hadoop23Shims|https://github.com/apache/hive/blob/master/shims/0.23/src/main/java/org/apache/hadoop/hive/shims/Hadoop23Shims.java]
>  silently overrides e.g. hive.tez.container.size which is defined in 
> data/conf/hive/llap/hive-site.xml. This way, the developer will have no idea 
> about what happened after setting those values in the xml.
> My proposal: 
> 1. don't set those values, unless they reflect an invalid default value 
> (e.g.: -1 for hive.tez.container.size)
> 2. put an INFO level log message about the override
> OR:
> put a comment in hive-site.xml and tez-site.xml files that shims override it 
> while creating a tez mini cluster



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (HIVE-24707) Apply Sane Default for Tez Containers as Last Resort

2021-02-03 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-24707?focusedWorklogId=547167=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-547167
 ]

ASF GitHub Bot logged work on HIVE-24707:
-

Author: ASF GitHub Bot
Created on: 03/Feb/21 17:51
Start Date: 03/Feb/21 17:51
Worklog Time Spent: 10m 
  Work Description: pgaref edited a comment on pull request #1933:
URL: https://github.com/apache/hive/pull/1933#issuecomment-772699104


   > @pgaref If only you hadn't asked me to look one more time I probably would 
have just merged. :)
   
   Np at all @belugabehr -- thats what reviews are for :) 
   Updated PR



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 547167)
Time Spent: 3h 40m  (was: 3.5h)

> Apply Sane Default for Tez Containers as Last Resort
> 
>
> Key: HIVE-24707
> URL: https://issues.apache.org/jira/browse/HIVE-24707
> Project: Hive
>  Issue Type: Improvement
>Reporter: David Mollitor
>Assignee: Panagiotis Garefalakis
>Priority: Trivial
>  Labels: pull-request-available
>  Time Spent: 3h 40m
>  Remaining Estimate: 0h
>
> {code:java|title=DagUtils.java}
> public static Resource getContainerResource(Configuration conf) {
> int memory = HiveConf.getIntVar(conf, 
> HiveConf.ConfVars.HIVETEZCONTAINERSIZE) > 0 ?
>   HiveConf.getIntVar(conf, HiveConf.ConfVars.HIVETEZCONTAINERSIZE) :
>   conf.getInt(MRJobConfig.MAP_MEMORY_MB, 
> MRJobConfig.DEFAULT_MAP_MEMORY_MB);
> int cpus = HiveConf.getIntVar(conf, HiveConf.ConfVars.HIVETEZCPUVCORES) > 
> 0 ?
>   HiveConf.getIntVar(conf, HiveConf.ConfVars.HIVETEZCPUVCORES) :
>   conf.getInt(MRJobConfig.MAP_CPU_VCORES, 
> MRJobConfig.DEFAULT_MAP_CPU_VCORES);
> return Resource.newInstance(memory, cpus);
>   }
> {code}
> If Tez Container Size or VCores is an invalid value ( <= 0 ) then it falls 
> back onto the MapReduce configurations, but if the MapReduce configurations 
> have invalid values ( <= 0 ), they are excepted regardless and this will 
> cause failures down the road.
> This code should also check the MapReduce values and fall back to MapReduce 
> default values if they are <= 0.
> Also, some logging would be nice here too, reporting about where the 
> configuration values came from.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (HIVE-24707) Apply Sane Default for Tez Containers as Last Resort

2021-02-03 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-24707?focusedWorklogId=547166=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-547166
 ]

ASF GitHub Bot logged work on HIVE-24707:
-

Author: ASF GitHub Bot
Created on: 03/Feb/21 17:50
Start Date: 03/Feb/21 17:50
Worklog Time Spent: 10m 
  Work Description: pgaref commented on pull request #1933:
URL: https://github.com/apache/hive/pull/1933#issuecomment-772699104


   > @pgaref If only you hadn't asked me to look one more time I probably would 
have just merged. :)
   
   Np at all @belugabehr -- thats what reviews are for :) 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 547166)
Time Spent: 3.5h  (was: 3h 20m)

> Apply Sane Default for Tez Containers as Last Resort
> 
>
> Key: HIVE-24707
> URL: https://issues.apache.org/jira/browse/HIVE-24707
> Project: Hive
>  Issue Type: Improvement
>Reporter: David Mollitor
>Assignee: Panagiotis Garefalakis
>Priority: Trivial
>  Labels: pull-request-available
>  Time Spent: 3.5h
>  Remaining Estimate: 0h
>
> {code:java|title=DagUtils.java}
> public static Resource getContainerResource(Configuration conf) {
> int memory = HiveConf.getIntVar(conf, 
> HiveConf.ConfVars.HIVETEZCONTAINERSIZE) > 0 ?
>   HiveConf.getIntVar(conf, HiveConf.ConfVars.HIVETEZCONTAINERSIZE) :
>   conf.getInt(MRJobConfig.MAP_MEMORY_MB, 
> MRJobConfig.DEFAULT_MAP_MEMORY_MB);
> int cpus = HiveConf.getIntVar(conf, HiveConf.ConfVars.HIVETEZCPUVCORES) > 
> 0 ?
>   HiveConf.getIntVar(conf, HiveConf.ConfVars.HIVETEZCPUVCORES) :
>   conf.getInt(MRJobConfig.MAP_CPU_VCORES, 
> MRJobConfig.DEFAULT_MAP_CPU_VCORES);
> return Resource.newInstance(memory, cpus);
>   }
> {code}
> If Tez Container Size or VCores is an invalid value ( <= 0 ) then it falls 
> back onto the MapReduce configurations, but if the MapReduce configurations 
> have invalid values ( <= 0 ), they are excepted regardless and this will 
> cause failures down the road.
> This code should also check the MapReduce values and fall back to MapReduce 
> default values if they are <= 0.
> Also, some logging would be nice here too, reporting about where the 
> configuration values came from.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (HIVE-24723) Use ExecutorService in TezSessionPool

2021-02-03 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-24723?focusedWorklogId=547138=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-547138
 ]

ASF GitHub Bot logged work on HIVE-24723:
-

Author: ASF GitHub Bot
Created on: 03/Feb/21 17:16
Start Date: 03/Feb/21 17:16
Worklog Time Spent: 10m 
  Work Description: belugabehr commented on a change in pull request #1939:
URL: https://github.com/apache/hive/pull/1939#discussion_r569594294



##
File path: ql/src/java/org/apache/hadoop/hive/ql/exec/tez/TezSessionPool.java
##
@@ -395,20 +379,38 @@ int getInitialSize() {
 } while (!deltaRemaining.compareAndSet(oldVal, oldVal + delta));
 int toStart = oldVal + delta;
 if (toStart <= 0) return createDummyFuture();
-LOG.info("Resizing the pool; adding " + toStart + " sessions");
-
-// 2) If we need to create some extra sessions, we'd do it just like 
startup does.
-int threadCount = Math.max(1, Math.min(toStart,
-HiveConf.getIntVar(initConf, 
ConfVars.HIVE_SERVER2_TEZ_SESSION_MAX_INIT_THREADS)));
-List> threadTasks = new 
ArrayList<>(threadCount);
-// This is an async method, so always launch threads, even for a single 
task.
-for (int i = 0; i < threadCount; ++i) {
-  ListenableFutureTask task = ListenableFutureTask.create(
-  new CreateSessionsRunnable(deltaRemaining));
-  new Thread(task, "Tez pool resize " + i).start();
-  threadTasks.add(task);
+LOG.info("Resizing the pool; adding {} sessions", toStart);
+
+final int threadCount =
+Math.min(toStart, HiveConf.getIntVar(initConf, 
ConfVars.HIVE_SERVER2_TEZ_SESSION_MAX_INIT_THREADS));

Review comment:
   I would advise against that.  The first one initialized the pool, the 
second one is to increase the existing pool.  I think the naming is OK.

##
File path: ql/src/java/org/apache/hadoop/hive/ql/exec/tez/TezSessionPool.java
##
@@ -395,20 +379,38 @@ int getInitialSize() {
 } while (!deltaRemaining.compareAndSet(oldVal, oldVal + delta));
 int toStart = oldVal + delta;
 if (toStart <= 0) return createDummyFuture();
-LOG.info("Resizing the pool; adding " + toStart + " sessions");
-
-// 2) If we need to create some extra sessions, we'd do it just like 
startup does.
-int threadCount = Math.max(1, Math.min(toStart,
-HiveConf.getIntVar(initConf, 
ConfVars.HIVE_SERVER2_TEZ_SESSION_MAX_INIT_THREADS)));
-List> threadTasks = new 
ArrayList<>(threadCount);
-// This is an async method, so always launch threads, even for a single 
task.
-for (int i = 0; i < threadCount; ++i) {
-  ListenableFutureTask task = ListenableFutureTask.create(
-  new CreateSessionsRunnable(deltaRemaining));
-  new Thread(task, "Tez pool resize " + i).start();
-  threadTasks.add(task);
+LOG.info("Resizing the pool; adding {} sessions", toStart);
+
+final int threadCount =
+Math.min(toStart, HiveConf.getIntVar(initConf, 
ConfVars.HIVE_SERVER2_TEZ_SESSION_MAX_INIT_THREADS));
+
+return createSessions(toStart, threadCount);
+  }
+
+  private ListenableFuture> createSessions(int sessionCount, int 
maxParallel) {

Review comment:
   Sure





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 547138)
Time Spent: 1h 10m  (was: 1h)

> Use ExecutorService in TezSessionPool
> -
>
> Key: HIVE-24723
> URL: https://issues.apache.org/jira/browse/HIVE-24723
> Project: Hive
>  Issue Type: Improvement
>  Components: Tez
>Reporter: David Mollitor
>Assignee: David Mollitor
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> Currently there are some wonky home-made thread pooling action going on in 
> {{TezSessionPool}}.  Replace it with some JDK/Guava goodness.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (HIVE-24723) Use ExecutorService in TezSessionPool

2021-02-03 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-24723?focusedWorklogId=547137=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-547137
 ]

ASF GitHub Bot logged work on HIVE-24723:
-

Author: ASF GitHub Bot
Created on: 03/Feb/21 17:13
Start Date: 03/Feb/21 17:13
Worklog Time Spent: 10m 
  Work Description: belugabehr commented on pull request #1939:
URL: https://github.com/apache/hive/pull/1939#issuecomment-772673304


   @pgaref Ya, tests failing were from this change.  I made a typo.  Will 
address your comments.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 547137)
Time Spent: 1h  (was: 50m)

> Use ExecutorService in TezSessionPool
> -
>
> Key: HIVE-24723
> URL: https://issues.apache.org/jira/browse/HIVE-24723
> Project: Hive
>  Issue Type: Improvement
>  Components: Tez
>Reporter: David Mollitor
>Assignee: David Mollitor
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> Currently there are some wonky home-made thread pooling action going on in 
> {{TezSessionPool}}.  Replace it with some JDK/Guava goodness.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (HIVE-24673) Migrate NegativeCliDriver and NegativeMinimrCliDriver to llap

2021-02-03 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-24673?focusedWorklogId=547136=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-547136
 ]

ASF GitHub Bot logged work on HIVE-24673:
-

Author: ASF GitHub Bot
Created on: 03/Feb/21 17:13
Start Date: 03/Feb/21 17:13
Worklog Time Spent: 10m 
  Work Description: mustafaiman closed pull request #1902:
URL: https://github.com/apache/hive/pull/1902


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 547136)
Time Spent: 3h 20m  (was: 3h 10m)

> Migrate NegativeCliDriver and NegativeMinimrCliDriver to llap
> -
>
> Key: HIVE-24673
> URL: https://issues.apache.org/jira/browse/HIVE-24673
> Project: Hive
>  Issue Type: Improvement
>Reporter: Mustafa İman
>Assignee: Mustafa İman
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 3h 20m
>  Remaining Estimate: 0h
>
> These test drivers should run on llap. Otherwise we can run into situations 
> where certain queries correctly fail on MapReduce but not on Tez.
> Also, it is better if negative cli drivers does not mask "Caused by" lines in 
> test output. Otherwise, a query may start to fail for other reasons than the 
> expected one and we do not realize it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Resolved] (HIVE-24673) Migrate NegativeCliDriver and NegativeMinimrCliDriver to llap

2021-02-03 Thread Jira



 [ 
https://issues.apache.org/jira/browse/HIVE-24673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mustafa İman resolved HIVE-24673.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Merged to master. Thank you for review [~kgyrtkirk]

> Migrate NegativeCliDriver and NegativeMinimrCliDriver to llap
> -
>
> Key: HIVE-24673
> URL: https://issues.apache.org/jira/browse/HIVE-24673
> Project: Hive
>  Issue Type: Improvement
>Reporter: Mustafa İman
>Assignee: Mustafa İman
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 3h 20m
>  Remaining Estimate: 0h
>
> These test drivers should run on llap. Otherwise we can run into situations 
> where certain queries correctly fail on MapReduce but not on Tez.
> Also, it is better if negative cli drivers does not mask "Caused by" lines in 
> test output. Otherwise, a query may start to fail for other reasons than the 
> expected one and we do not realize it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (HIVE-24701) Remove String Manipulation from Date Parsing TimestampTZUtil

2021-02-03 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-24701?focusedWorklogId=547135=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-547135
 ]

ASF GitHub Bot logged work on HIVE-24701:
-

Author: ASF GitHub Bot
Created on: 03/Feb/21 17:12
Start Date: 03/Feb/21 17:12
Worklog Time Spent: 10m 
  Work Description: belugabehr commented on pull request #1927:
URL: https://github.com/apache/hive/pull/1927#issuecomment-772672233


   ```none
   Testing / split-14 / PostProcess / testVectorUDFUnixTimeStamp – 
org.apache.hadoop.hive.ql.exec.vector.expressions.TestVectorDateExpressions
   2s
   Stacktrace
   java.lang.AssertionError: expected:<9784802439> but was:<-97972298294822>
at org.junit.Assert.fail(Assert.java:89)
at org.junit.Assert.failNotEquals(Assert.java:835)
   ```
   
   This issue is fixed in #1938 so should pass after #1938 is merged



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 547135)
Time Spent: 0.5h  (was: 20m)

> Remove String Manipulation from Date Parsing TimestampTZUtil
> 
>
> Key: HIVE-24701
> URL: https://issues.apache.org/jira/browse/HIVE-24701
> Project: Hive
>  Issue Type: Improvement
>Reporter: David Mollitor
>Assignee: David Mollitor
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> This operation is pretty slow:
> {code:java}
>   // Converts Date to TimestampTZ.
>   public static TimestampTZ convert(Date date, ZoneId defaultTimeZone) {
> return parse(date.toString(), defaultTimeZone);
>   }
> {code}
> To convert from Date to TimestampTZ, it creates a string, then parses it.  
> Should be able to just look at the epoch time and do the conversion without 
> all the string manipulation/parsing.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (HIVE-24723) Use ExecutorService in TezSessionPool

2021-02-03 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-24723?focusedWorklogId=547134=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-547134
 ]

ASF GitHub Bot logged work on HIVE-24723:
-

Author: ASF GitHub Bot
Created on: 03/Feb/21 17:12
Start Date: 03/Feb/21 17:12
Worklog Time Spent: 10m 
  Work Description: pgaref commented on pull request #1939:
URL: https://github.com/apache/hive/pull/1939#issuecomment-772672176


   There are also some test timeouts -- are they related to this TezSessionPool 
change? cc @abstractdog 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 547134)
Time Spent: 50m  (was: 40m)

> Use ExecutorService in TezSessionPool
> -
>
> Key: HIVE-24723
> URL: https://issues.apache.org/jira/browse/HIVE-24723
> Project: Hive
>  Issue Type: Improvement
>  Components: Tez
>Reporter: David Mollitor
>Assignee: David Mollitor
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> Currently there are some wonky home-made thread pooling action going on in 
> {{TezSessionPool}}.  Replace it with some JDK/Guava goodness.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (HIVE-24723) Use ExecutorService in TezSessionPool

2021-02-03 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-24723?focusedWorklogId=547133=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-547133
 ]

ASF GitHub Bot logged work on HIVE-24723:
-

Author: ASF GitHub Bot
Created on: 03/Feb/21 17:10
Start Date: 03/Feb/21 17:10
Worklog Time Spent: 10m 
  Work Description: pgaref commented on a change in pull request #1939:
URL: https://github.com/apache/hive/pull/1939#discussion_r569583989



##
File path: ql/src/java/org/apache/hadoop/hive/ql/exec/tez/TezSessionPool.java
##
@@ -395,20 +379,38 @@ int getInitialSize() {
 } while (!deltaRemaining.compareAndSet(oldVal, oldVal + delta));
 int toStart = oldVal + delta;
 if (toStart <= 0) return createDummyFuture();
-LOG.info("Resizing the pool; adding " + toStart + " sessions");
-
-// 2) If we need to create some extra sessions, we'd do it just like 
startup does.
-int threadCount = Math.max(1, Math.min(toStart,
-HiveConf.getIntVar(initConf, 
ConfVars.HIVE_SERVER2_TEZ_SESSION_MAX_INIT_THREADS)));
-List> threadTasks = new 
ArrayList<>(threadCount);
-// This is an async method, so always launch threads, even for a single 
task.
-for (int i = 0; i < threadCount; ++i) {
-  ListenableFutureTask task = ListenableFutureTask.create(
-  new CreateSessionsRunnable(deltaRemaining));
-  new Thread(task, "Tez pool resize " + i).start();
-  threadTasks.add(task);
+LOG.info("Resizing the pool; adding {} sessions", toStart);
+
+final int threadCount =
+Math.min(toStart, HiveConf.getIntVar(initConf, 
ConfVars.HIVE_SERVER2_TEZ_SESSION_MAX_INIT_THREADS));
+
+return createSessions(toStart, threadCount);
+  }
+
+  private ListenableFuture> createSessions(int sessionCount, int 
maxParallel) {

Review comment:
   Shall we use same vars here? Add method doc?

##
File path: ql/src/java/org/apache/hadoop/hive/ql/exec/tez/TezSessionPool.java
##
@@ -395,20 +379,38 @@ int getInitialSize() {
 } while (!deltaRemaining.compareAndSet(oldVal, oldVal + delta));
 int toStart = oldVal + delta;
 if (toStart <= 0) return createDummyFuture();
-LOG.info("Resizing the pool; adding " + toStart + " sessions");
-
-// 2) If we need to create some extra sessions, we'd do it just like 
startup does.
-int threadCount = Math.max(1, Math.min(toStart,
-HiveConf.getIntVar(initConf, 
ConfVars.HIVE_SERVER2_TEZ_SESSION_MAX_INIT_THREADS)));
-List> threadTasks = new 
ArrayList<>(threadCount);
-// This is an async method, so always launch threads, even for a single 
task.
-for (int i = 0; i < threadCount; ++i) {
-  ListenableFutureTask task = ListenableFutureTask.create(
-  new CreateSessionsRunnable(deltaRemaining));
-  new Thread(task, "Tez pool resize " + i).start();
-  threadTasks.add(task);
+LOG.info("Resizing the pool; adding {} sessions", toStart);
+
+final int threadCount =
+Math.min(toStart, HiveConf.getIntVar(initConf, 
ConfVars.HIVE_SERVER2_TEZ_SESSION_MAX_INIT_THREADS));

Review comment:
   Shall we rename toStart -> initialSize for consistency?

##
File path: ql/src/java/org/apache/hadoop/hive/ql/exec/tez/TezSessionPool.java
##
@@ -395,20 +379,38 @@ int getInitialSize() {
 } while (!deltaRemaining.compareAndSet(oldVal, oldVal + delta));
 int toStart = oldVal + delta;
 if (toStart <= 0) return createDummyFuture();
-LOG.info("Resizing the pool; adding " + toStart + " sessions");
-
-// 2) If we need to create some extra sessions, we'd do it just like 
startup does.
-int threadCount = Math.max(1, Math.min(toStart,
-HiveConf.getIntVar(initConf, 
ConfVars.HIVE_SERVER2_TEZ_SESSION_MAX_INIT_THREADS)));
-List> threadTasks = new 
ArrayList<>(threadCount);
-// This is an async method, so always launch threads, even for a single 
task.
-for (int i = 0; i < threadCount; ++i) {
-  ListenableFutureTask task = ListenableFutureTask.create(
-  new CreateSessionsRunnable(deltaRemaining));
-  new Thread(task, "Tez pool resize " + i).start();
-  threadTasks.add(task);
+LOG.info("Resizing the pool; adding {} sessions", toStart);
+
+final int threadCount =
+Math.min(toStart, HiveConf.getIntVar(initConf, 
ConfVars.HIVE_SERVER2_TEZ_SESSION_MAX_INIT_THREADS));
+
+return createSessions(toStart, threadCount);
+  }
+
+  private ListenableFuture> createSessions(int sessionCount, int 
maxParallel) {
+Preconditions.checkArgument(sessionCount > 0);
+Preconditions.checkArgument(maxParallel > 0);

Review comment:
   Second Precondition check is redundant





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

[jira] [Work logged] (HIVE-24693) Parquet Timestamp Values Read/Write Very Slow

2021-02-03 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-24693?focusedWorklogId=547132=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-547132
 ]

ASF GitHub Bot logged work on HIVE-24693:
-

Author: ASF GitHub Bot
Created on: 03/Feb/21 17:08
Start Date: 03/Feb/21 17:08
Worklog Time Spent: 10m 
  Work Description: belugabehr commented on pull request #1938:
URL: https://github.com/apache/hive/pull/1938#issuecomment-772669362


   I think we have some flaky tests.  Will re-build.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 547132)
Time Spent: 1h 40m  (was: 1.5h)

> Parquet Timestamp Values Read/Write Very Slow
> -
>
> Key: HIVE-24693
> URL: https://issues.apache.org/jira/browse/HIVE-24693
> Project: Hive
>  Issue Type: Improvement
>Reporter: David Mollitor
>Assignee: David Mollitor
>Priority: Critical
>  Labels: pull-request-available
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> Parquet {{DataWriteableWriter}} relias on {{NanoTimeUtils}} to convert a 
> timestamp object into a binary value.  The way in which it does this,... it 
> calls {{toString()}} on the timestamp object, and then parses the String.  
> This particular timestamp do not carry a timezone, so the string is something 
> like:
> {{2021-21-03 12:32:23....}}
> The parse code tries to parse the string assuming there is a time zone, and 
> if not, falls-back and applies the provided "default time zone".  As was 
> noted in [HIVE-24353], if something fails to parse, it is very expensive to 
> try to parse again.  So, for each timestamp in the Parquet file, it:
> * Builds a string from the time stamp
> * Parses it (throws an exception, parses again)
> There is no need to do this kind of string manipulations/parsing, it should 
> just be using the epoch millis/seconds/time stored internal to the Timestamp 
> object.
> {code:java}
>   // Converts Timestamp to TimestampTZ.
>   public static TimestampTZ convert(Timestamp ts, ZoneId defaultTimeZone) {
> return parse(ts.toString(), defaultTimeZone);
>   }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (HIVE-24693) Parquet Timestamp Values Read/Write Very Slow

2021-02-03 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-24693?focusedWorklogId=547131=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-547131
 ]

ASF GitHub Bot logged work on HIVE-24693:
-

Author: ASF GitHub Bot
Created on: 03/Feb/21 17:08
Start Date: 03/Feb/21 17:08
Worklog Time Spent: 10m 
  Work Description: belugabehr opened a new pull request #1938:
URL: https://github.com/apache/hive/pull/1938


   Replaces #1918



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 547131)
Time Spent: 1.5h  (was: 1h 20m)

> Parquet Timestamp Values Read/Write Very Slow
> -
>
> Key: HIVE-24693
> URL: https://issues.apache.org/jira/browse/HIVE-24693
> Project: Hive
>  Issue Type: Improvement
>Reporter: David Mollitor
>Assignee: David Mollitor
>Priority: Critical
>  Labels: pull-request-available
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> Parquet {{DataWriteableWriter}} relias on {{NanoTimeUtils}} to convert a 
> timestamp object into a binary value.  The way in which it does this,... it 
> calls {{toString()}} on the timestamp object, and then parses the String.  
> This particular timestamp do not carry a timezone, so the string is something 
> like:
> {{2021-21-03 12:32:23....}}
> The parse code tries to parse the string assuming there is a time zone, and 
> if not, falls-back and applies the provided "default time zone".  As was 
> noted in [HIVE-24353], if something fails to parse, it is very expensive to 
> try to parse again.  So, for each timestamp in the Parquet file, it:
> * Builds a string from the time stamp
> * Parses it (throws an exception, parses again)
> There is no need to do this kind of string manipulations/parsing, it should 
> just be using the epoch millis/seconds/time stored internal to the Timestamp 
> object.
> {code:java}
>   // Converts Timestamp to TimestampTZ.
>   public static TimestampTZ convert(Timestamp ts, ZoneId defaultTimeZone) {
> return parse(ts.toString(), defaultTimeZone);
>   }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (HIVE-24693) Parquet Timestamp Values Read/Write Very Slow

2021-02-03 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-24693?focusedWorklogId=547129=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-547129
 ]

ASF GitHub Bot logged work on HIVE-24693:
-

Author: ASF GitHub Bot
Created on: 03/Feb/21 17:07
Start Date: 03/Feb/21 17:07
Worklog Time Spent: 10m 
  Work Description: belugabehr closed pull request #1938:
URL: https://github.com/apache/hive/pull/1938


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 547129)
Time Spent: 1h 20m  (was: 1h 10m)

> Parquet Timestamp Values Read/Write Very Slow
> -
>
> Key: HIVE-24693
> URL: https://issues.apache.org/jira/browse/HIVE-24693
> Project: Hive
>  Issue Type: Improvement
>Reporter: David Mollitor
>Assignee: David Mollitor
>Priority: Critical
>  Labels: pull-request-available
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> Parquet {{DataWriteableWriter}} relias on {{NanoTimeUtils}} to convert a 
> timestamp object into a binary value.  The way in which it does this,... it 
> calls {{toString()}} on the timestamp object, and then parses the String.  
> This particular timestamp do not carry a timezone, so the string is something 
> like:
> {{2021-21-03 12:32:23....}}
> The parse code tries to parse the string assuming there is a time zone, and 
> if not, falls-back and applies the provided "default time zone".  As was 
> noted in [HIVE-24353], if something fails to parse, it is very expensive to 
> try to parse again.  So, for each timestamp in the Parquet file, it:
> * Builds a string from the time stamp
> * Parses it (throws an exception, parses again)
> There is no need to do this kind of string manipulations/parsing, it should 
> just be using the epoch millis/seconds/time stored internal to the Timestamp 
> object.
> {code:java}
>   // Converts Timestamp to TimestampTZ.
>   public static TimestampTZ convert(Timestamp ts, ZoneId defaultTimeZone) {
> return parse(ts.toString(), defaultTimeZone);
>   }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (HIVE-24707) Apply Sane Default for Tez Containers as Last Resort

2021-02-03 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-24707?focusedWorklogId=547128=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-547128
 ]

ASF GitHub Bot logged work on HIVE-24707:
-

Author: ASF GitHub Bot
Created on: 03/Feb/21 17:05
Start Date: 03/Feb/21 17:05
Worklog Time Spent: 10m 
  Work Description: belugabehr commented on a change in pull request #1933:
URL: https://github.com/apache/hive/pull/1933#discussion_r569580552



##
File path: ql/src/java/org/apache/hadoop/hive/ql/exec/MemoryInfo.java
##
@@ -34,9 +37,8 @@
   private Configuration conf;
   private boolean isTez;
   private boolean isLlap;
-  private long maxExecutorMemory;
-  private long mapJoinMemoryThreshold;
-  private long dynPartJoinMemoryThreshold;
+  private long maxExecutorMemory; // value in Bytes

Review comment:
   Sorry to nit, but can we make these 'final' instance variables?  Also, 
can you please move the 'value in bytes' into a proper Javadoc on the getter 
method?

##
File path: ql/src/java/org/apache/hadoop/hive/ql/exec/tez/DagUtils.java
##
@@ -700,13 +700,26 @@ public int getPartition(Object key, Object value, int 
numPartitions) {
* container size isn't set.
*/
   public static Resource getContainerResource(Configuration conf) {
-int memory = HiveConf.getIntVar(conf, 
HiveConf.ConfVars.HIVETEZCONTAINERSIZE) > 0 ?
-  HiveConf.getIntVar(conf, HiveConf.ConfVars.HIVETEZCONTAINERSIZE) :
-  conf.getInt(MRJobConfig.MAP_MEMORY_MB, 
MRJobConfig.DEFAULT_MAP_MEMORY_MB);
-int cpus = HiveConf.getIntVar(conf, HiveConf.ConfVars.HIVETEZCPUVCORES) > 
0 ?
-  HiveConf.getIntVar(conf, HiveConf.ConfVars.HIVETEZCPUVCORES) :
-  conf.getInt(MRJobConfig.MAP_CPU_VCORES, 
MRJobConfig.DEFAULT_MAP_CPU_VCORES);
-return Resource.newInstance(memory, cpus);
+int memorySizeMb = HiveConf.getIntVar(conf, 
HiveConf.ConfVars.HIVETEZCONTAINERSIZE);
+if (memorySizeMb <= 0) {
+  LOG.warn("Falling back to MapReduce container MB {}", 
MRJobConfig.MAP_MEMORY_MB);
+  memorySizeMb = conf.getInt(MRJobConfig.MAP_MEMORY_MB, 
MRJobConfig.DEFAULT_MAP_MEMORY_MB);
+  // When config is explicitly set to "-1" defaultValue does not work!
+  if (memorySizeMb <= 0) {
+LOG.warn("Falling back to default container MB {}", 
MRJobConfig.DEFAULT_MAP_MEMORY_MB);
+memorySizeMb = MRJobConfig.DEFAULT_MAP_MEMORY_MB;
+  }
+}
+int cpuCores = HiveConf.getIntVar(conf, 
HiveConf.ConfVars.HIVETEZCPUVCORES);
+if (cpuCores <= 0) {
+  LOG.warn("Falling back to MapReduce container VCores {}", 
MRJobConfig.MAP_CPU_VCORES);

Review comment:
   Can we please update to say:
   
   ```java
   LOG.warn("No Tez VCore size specified by {}.  Falling back...", 
HiveConf.ConfVars.HIVETEZCPUVCORES,  MRJobConfig.MAP_CPU_VCORES);
   ```

##
File path: ql/src/java/org/apache/hadoop/hive/ql/exec/tez/DagUtils.java
##
@@ -700,13 +700,26 @@ public int getPartition(Object key, Object value, int 
numPartitions) {
* container size isn't set.
*/
   public static Resource getContainerResource(Configuration conf) {
-int memory = HiveConf.getIntVar(conf, 
HiveConf.ConfVars.HIVETEZCONTAINERSIZE) > 0 ?
-  HiveConf.getIntVar(conf, HiveConf.ConfVars.HIVETEZCONTAINERSIZE) :
-  conf.getInt(MRJobConfig.MAP_MEMORY_MB, 
MRJobConfig.DEFAULT_MAP_MEMORY_MB);
-int cpus = HiveConf.getIntVar(conf, HiveConf.ConfVars.HIVETEZCPUVCORES) > 
0 ?
-  HiveConf.getIntVar(conf, HiveConf.ConfVars.HIVETEZCPUVCORES) :
-  conf.getInt(MRJobConfig.MAP_CPU_VCORES, 
MRJobConfig.DEFAULT_MAP_CPU_VCORES);
-return Resource.newInstance(memory, cpus);
+int memorySizeMb = HiveConf.getIntVar(conf, 
HiveConf.ConfVars.HIVETEZCONTAINERSIZE);
+if (memorySizeMb <= 0) {
+  LOG.warn("Falling back to MapReduce container MB {}", 
MRJobConfig.MAP_MEMORY_MB);

Review comment:
   Can we please update to say:
   
   ```java
   LOG.warn("No Tez container size specified by {}.  Falling back...", 
HiveConf.ConfVars.HIVETEZCONTAINERSIZE,  MRJobConfig.MAP_MEMORY_MB);
   ```





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 547128)
Time Spent: 3h 20m  (was: 3h 10m)

> Apply Sane Default for Tez Containers as Last Resort
> 
>
> Key: HIVE-24707
> URL: https://issues.apache.org/jira/browse/HIVE-24707
> Project: Hive
>  Issue Type: Improvement
>Reporter: David Mollitor
>Assignee: Panagiotis Garefalakis
>

[jira] [Updated] (HIVE-24730) Shims classes override values from hive-site.xml and tez-site.xml silently

2021-02-03 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-24730?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HIVE-24730:
--
Labels: pull-request-available  (was: )

> Shims classes override values from hive-site.xml and tez-site.xml silently
> --
>
> Key: HIVE-24730
> URL: https://issues.apache.org/jira/browse/HIVE-24730
> Project: Hive
>  Issue Type: Bug
>Reporter: László Bodor
>Assignee: László Bodor
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Since HIVE-14887, 
> [Hadoop23Shims|https://github.com/apache/hive/blob/master/shims/0.23/src/main/java/org/apache/hadoop/hive/shims/Hadoop23Shims.java]
>  silently overrides e.g. hive.tez.container.size which is defined in 
> data/conf/hive/llap/hive-site.xml. This way, the developer will have no idea 
> about what happened after setting those values in the xml.
> My proposal: 
> 1. don't set those values, unless they reflect an invalid default value 
> (e.g.: -1 for hive.tez.container.size)
> 2. put an INFO level log message about the override
> OR:
> put a comment in hive-site.xml and tez-site.xml files that shims override it 
> while creating a tez mini cluster



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (HIVE-24730) Shims classes override values from hive-site.xml and tez-site.xml silently

2021-02-03 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-24730?focusedWorklogId=547116=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-547116
 ]

ASF GitHub Bot logged work on HIVE-24730:
-

Author: ASF GitHub Bot
Created on: 03/Feb/21 16:47
Start Date: 03/Feb/21 16:47
Worklog Time Spent: 10m 
  Work Description: abstractdog opened a new pull request #1941:
URL: https://github.com/apache/hive/pull/1941


   
   
   ### What changes were proposed in this pull request?
   
   
   
   ### Why are the changes needed?
   
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   
   
   ### How was this patch tested?
   
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 547116)
Remaining Estimate: 0h
Time Spent: 10m

> Shims classes override values from hive-site.xml and tez-site.xml silently
> --
>
> Key: HIVE-24730
> URL: https://issues.apache.org/jira/browse/HIVE-24730
> Project: Hive
>  Issue Type: Bug
>Reporter: László Bodor
>Assignee: László Bodor
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Since HIVE-14887, 
> [Hadoop23Shims|https://github.com/apache/hive/blob/master/shims/0.23/src/main/java/org/apache/hadoop/hive/shims/Hadoop23Shims.java]
>  silently overrides e.g. hive.tez.container.size which is defined in 
> data/conf/hive/llap/hive-site.xml. This way, the developer will have no idea 
> about what happened after setting those values in the xml.
> My proposal: 
> 1. don't set those values, unless they reflect an invalid default value 
> (e.g.: -1 for hive.tez.container.size)
> 2. put an INFO level log message about the override
> OR:
> put a comment in hive-site.xml and tez-site.xml files that shims override it 
> while creating a tez mini cluster



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (HIVE-23882) Compiler should skip MJ keyExpr for probe optimization

2021-02-03 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-23882?focusedWorklogId=547103=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-547103
 ]

ASF GitHub Bot logged work on HIVE-23882:
-

Author: ASF GitHub Bot
Created on: 03/Feb/21 16:28
Start Date: 03/Feb/21 16:28
Worklog Time Spent: 10m 
  Work Description: kgyrtkirk commented on a change in pull request #1286:
URL: https://github.com/apache/hive/pull/1286#discussion_r569557253



##
File path: ql/src/java/org/apache/hadoop/hive/ql/exec/OperatorUtils.java
##
@@ -654,11 +655,18 @@ public static String findTableColNameOf(Operator 
start, String internalColNam
 continue;
   }
   // If columnName is the output of a ColumnExpr get the original 
columnName from the Expr Map
-  if (currentOp.getColumnExprMap() != null && 
currentOp.getColumnExprMap().containsKey(internalColName)
-  && currentOp.getColumnExprMap().get(internalColName) instanceof 
ExprNodeColumnDesc) {
-internalColName = ((ExprNodeColumnDesc) 
currentOp.getColumnExprMap().get(internalColName)).getColumn();
+  if (currentOp.getColumnExprMap() != null && 
currentOp.getColumnExprMap().containsKey(internalColName)) {
+// Only use colInfo that is ExprNodeColumnDesc (could even be a UDF on 
the key at this point)
+if (currentOp.getColumnExprMap().get(internalColName) instanceof 
ExprNodeColumnDesc) {
+  internalColName = ((ExprNodeColumnDesc) 
currentOp.getColumnExprMap().get(internalColName)).getColumn();
+  keyColInfo = currentOp.getSchema().getColumnInfo(internalColName);

Review comment:
   this method somewhat reminds me to the `backtrack` methods we have in 
the `ExprNodeDescUtils` class - I don't know if you will find a perfect match ; 
but might worth to take a look at them





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 547103)
Time Spent: 1h 40m  (was: 1.5h)

> Compiler should skip MJ keyExpr for probe optimization
> --
>
> Key: HIVE-23882
> URL: https://issues.apache.org/jira/browse/HIVE-23882
> Project: Hive
>  Issue Type: Sub-task
>Reporter: Panagiotis Garefalakis
>Assignee: Panagiotis Garefalakis
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> In probe we cannot currently support Key expressions (on the big table Side) 
> as ORC CVs Probe directly the smalltable HT (there is no expr evaluation at 
> that level).
> TezCompiler should take this into account when picking MJs to push probe 
> details



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (HIVE-23882) Compiler should skip MJ keyExpr for probe optimization

2021-02-03 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-23882?focusedWorklogId=547100=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-547100
 ]

ASF GitHub Bot logged work on HIVE-23882:
-

Author: ASF GitHub Bot
Created on: 03/Feb/21 16:23
Start Date: 03/Feb/21 16:23
Worklog Time Spent: 10m 
  Work Description: kgyrtkirk commented on a change in pull request #1286:
URL: https://github.com/apache/hive/pull/1286#discussion_r569552909



##
File path: ql/src/java/org/apache/hadoop/hive/ql/exec/OperatorUtils.java
##
@@ -654,11 +655,18 @@ public static String findTableColNameOf(Operator 
start, String internalColNam
 continue;
   }
   // If columnName is the output of a ColumnExpr get the original 
columnName from the Expr Map
-  if (currentOp.getColumnExprMap() != null && 
currentOp.getColumnExprMap().containsKey(internalColName)
-  && currentOp.getColumnExprMap().get(internalColName) instanceof 
ExprNodeColumnDesc) {
-internalColName = ((ExprNodeColumnDesc) 
currentOp.getColumnExprMap().get(internalColName)).getColumn();

Review comment:
   the old code also has similar issues





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 547100)
Time Spent: 1.5h  (was: 1h 20m)

> Compiler should skip MJ keyExpr for probe optimization
> --
>
> Key: HIVE-23882
> URL: https://issues.apache.org/jira/browse/HIVE-23882
> Project: Hive
>  Issue Type: Sub-task
>Reporter: Panagiotis Garefalakis
>Assignee: Panagiotis Garefalakis
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> In probe we cannot currently support Key expressions (on the big table Side) 
> as ORC CVs Probe directly the smalltable HT (there is no expr evaluation at 
> that level).
> TezCompiler should take this into account when picking MJs to push probe 
> details



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (HIVE-23882) Compiler should skip MJ keyExpr for probe optimization

2021-02-03 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-23882?focusedWorklogId=547097=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-547097
 ]

ASF GitHub Bot logged work on HIVE-23882:
-

Author: ASF GitHub Bot
Created on: 03/Feb/21 16:20
Start Date: 03/Feb/21 16:20
Worklog Time Spent: 10m 
  Work Description: kgyrtkirk commented on a change in pull request #1286:
URL: https://github.com/apache/hive/pull/1286#discussion_r569549445



##
File path: ql/src/java/org/apache/hadoop/hive/ql/exec/OperatorUtils.java
##
@@ -654,11 +655,18 @@ public static String findTableColNameOf(Operator 
start, String internalColNam
 continue;
   }
   // If columnName is the output of a ColumnExpr get the original 
columnName from the Expr Map
-  if (currentOp.getColumnExprMap() != null && 
currentOp.getColumnExprMap().containsKey(internalColName)
-  && currentOp.getColumnExprMap().get(internalColName) instanceof 
ExprNodeColumnDesc) {
-internalColName = ((ExprNodeColumnDesc) 
currentOp.getColumnExprMap().get(internalColName)).getColumn();
+  if (currentOp.getColumnExprMap() != null && 
currentOp.getColumnExprMap().containsKey(internalColName)) {
+// Only use colInfo that is ExprNodeColumnDesc (could even be a UDF on 
the key at this point)
+if (currentOp.getColumnExprMap().get(internalColName) instanceof 
ExprNodeColumnDesc) {
+  internalColName = ((ExprNodeColumnDesc) 
currentOp.getColumnExprMap().get(internalColName)).getColumn();
+  keyColInfo = currentOp.getSchema().getColumnInfo(internalColName);

Review comment:
   this doesn't look right to me; these 2 lines are mapping back to the 
previous ops colname and look it up on the current operator - if this does fix 
some issue for you ; then I guess there were already some issue with the 
mapping/schema - I think the original bug should be fixed in this case; because 
the output of the above loopback might be undefined
   
   I think this fix may cause troubles later...





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 547097)
Time Spent: 1h 20m  (was: 1h 10m)

> Compiler should skip MJ keyExpr for probe optimization
> --
>
> Key: HIVE-23882
> URL: https://issues.apache.org/jira/browse/HIVE-23882
> Project: Hive
>  Issue Type: Sub-task
>Reporter: Panagiotis Garefalakis
>Assignee: Panagiotis Garefalakis
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> In probe we cannot currently support Key expressions (on the big table Side) 
> as ORC CVs Probe directly the smalltable HT (there is no expr evaluation at 
> that level).
> TezCompiler should take this into account when picking MJs to push probe 
> details



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (HIVE-23882) Compiler should skip MJ keyExpr for probe optimization

2021-02-03 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-23882?focusedWorklogId=547096=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-547096
 ]

ASF GitHub Bot logged work on HIVE-23882:
-

Author: ASF GitHub Bot
Created on: 03/Feb/21 16:19
Start Date: 03/Feb/21 16:19
Worklog Time Spent: 10m 
  Work Description: kgyrtkirk commented on a change in pull request #1286:
URL: https://github.com/apache/hive/pull/1286#discussion_r569549445



##
File path: ql/src/java/org/apache/hadoop/hive/ql/exec/OperatorUtils.java
##
@@ -654,11 +655,18 @@ public static String findTableColNameOf(Operator 
start, String internalColNam
 continue;
   }
   // If columnName is the output of a ColumnExpr get the original 
columnName from the Expr Map
-  if (currentOp.getColumnExprMap() != null && 
currentOp.getColumnExprMap().containsKey(internalColName)
-  && currentOp.getColumnExprMap().get(internalColName) instanceof 
ExprNodeColumnDesc) {
-internalColName = ((ExprNodeColumnDesc) 
currentOp.getColumnExprMap().get(internalColName)).getColumn();
+  if (currentOp.getColumnExprMap() != null && 
currentOp.getColumnExprMap().containsKey(internalColName)) {
+// Only use colInfo that is ExprNodeColumnDesc (could even be a UDF on 
the key at this point)
+if (currentOp.getColumnExprMap().get(internalColName) instanceof 
ExprNodeColumnDesc) {
+  internalColName = ((ExprNodeColumnDesc) 
currentOp.getColumnExprMap().get(internalColName)).getColumn();
+  keyColInfo = currentOp.getSchema().getColumnInfo(internalColName);

Review comment:
   this doesn't look right to me; these 2 lines are mapping back to the 
previous ops colname and look it up on the current operator - if something like 
this happens - then the original bug should be fixed.
   
   I think this fix may cause troubles later...





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 547096)
Time Spent: 1h 10m  (was: 1h)

> Compiler should skip MJ keyExpr for probe optimization
> --
>
> Key: HIVE-23882
> URL: https://issues.apache.org/jira/browse/HIVE-23882
> Project: Hive
>  Issue Type: Sub-task
>Reporter: Panagiotis Garefalakis
>Assignee: Panagiotis Garefalakis
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> In probe we cannot currently support Key expressions (on the big table Side) 
> as ORC CVs Probe directly the smalltable HT (there is no expr evaluation at 
> that level).
> TezCompiler should take this into account when picking MJs to push probe 
> details



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HIVE-24730) Shims classes override values from hive-site.xml and tez-site.xml silently

2021-02-03 Thread Jira



 [ 
https://issues.apache.org/jira/browse/HIVE-24730?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

László Bodor updated HIVE-24730:

Description: 
Since HIVE-14887, 
[Hadoop23Shims|https://github.com/apache/hive/blob/master/shims/0.23/src/main/java/org/apache/hadoop/hive/shims/Hadoop23Shims.java]
 silently overrides e.g. hive.tez.container.size which is defined in 
data/conf/hive/llap/hive-site.xml. This way, the developer will have no idea 
about what happened after setting those values in the xml.
My proposal: 
1. don't set those values, unless they reflect an invalid default value (e.g.: 
-1 for hive.tez.container.size)
2. put an INFO level log message about the override

OR:

put a comment in hive-site.xml and tez-site.xml files that shims override it 
while creating a tez mini cluster


  was:
Since HIVE-14887, 
[Hadoop23Shims|https://github.com/apache/hive/blob/master/shims/0.23/src/main/java/org/apache/hadoop/hive/shims/Hadoop23Shims.java]
 silently overrides e.g. hive.tez.container.size which is defined in 
data/conf/hive/llap/hive-site.xml. This way, the developer will have no idea 
about what happened after setting those values in the xml.
My proposal: 
1. don't set those values, unless they reflect an invalid default value (e.g.: 
-1 for hive.tez.container.size)
2. put an INFO level log message about the override


> Shims classes override values from hive-site.xml and tez-site.xml silently
> --
>
> Key: HIVE-24730
> URL: https://issues.apache.org/jira/browse/HIVE-24730
> Project: Hive
>  Issue Type: Bug
>Reporter: László Bodor
>Assignee: László Bodor
>Priority: Major
>
> Since HIVE-14887, 
> [Hadoop23Shims|https://github.com/apache/hive/blob/master/shims/0.23/src/main/java/org/apache/hadoop/hive/shims/Hadoop23Shims.java]
>  silently overrides e.g. hive.tez.container.size which is defined in 
> data/conf/hive/llap/hive-site.xml. This way, the developer will have no idea 
> about what happened after setting those values in the xml.
> My proposal: 
> 1. don't set those values, unless they reflect an invalid default value 
> (e.g.: -1 for hive.tez.container.size)
> 2. put an INFO level log message about the override
> OR:
> put a comment in hive-site.xml and tez-site.xml files that shims override it 
> while creating a tez mini cluster



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HIVE-24730) Shims classes override values from hive-site.xml and tez-site.xml silently

2021-02-03 Thread Jira



 [ 
https://issues.apache.org/jira/browse/HIVE-24730?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

László Bodor updated HIVE-24730:

Description: 
Since HIVE-14887, 
[Hadoop23Shims|https://github.com/apache/hive/blob/master/shims/0.23/src/main/java/org/apache/hadoop/hive/shims/Hadoop23Shims.java]
 silently overrides e.g. hive.tez.container.size which is defined in 
data/conf/hive/llap/hive-site.xml. This way, the developer will have no idea 
about what happened after setting those values in the xml.
My proposal: 
1. don't set those values, unless they reflect an invalid default value (e.g.: 
-1 for hive.tez.container.size)
2. put an INFO level log message about the override

> Shims classes override values from hive-site.xml and tez-site.xml silently
> --
>
> Key: HIVE-24730
> URL: https://issues.apache.org/jira/browse/HIVE-24730
> Project: Hive
>  Issue Type: Bug
>Reporter: László Bodor
>Assignee: László Bodor
>Priority: Major
>
> Since HIVE-14887, 
> [Hadoop23Shims|https://github.com/apache/hive/blob/master/shims/0.23/src/main/java/org/apache/hadoop/hive/shims/Hadoop23Shims.java]
>  silently overrides e.g. hive.tez.container.size which is defined in 
> data/conf/hive/llap/hive-site.xml. This way, the developer will have no idea 
> about what happened after setting those values in the xml.
> My proposal: 
> 1. don't set those values, unless they reflect an invalid default value 
> (e.g.: -1 for hive.tez.container.size)
> 2. put an INFO level log message about the override



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (HIVE-24730) Shims classes override values from hive-site.xml and tez-site.xml silently

2021-02-03 Thread Jira



 [ 
https://issues.apache.org/jira/browse/HIVE-24730?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

László Bodor reassigned HIVE-24730:
---

Assignee: László Bodor

> Shims classes override values from hive-site.xml and tez-site.xml silently
> --
>
> Key: HIVE-24730
> URL: https://issues.apache.org/jira/browse/HIVE-24730
> Project: Hive
>  Issue Type: Bug
>Reporter: László Bodor
>Assignee: László Bodor
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (HIVE-24719) There's a getAcidState call without impersonation in compactor.Worker

2021-02-03 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-24719?focusedWorklogId=547052=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-547052
 ]

ASF GitHub Bot logged work on HIVE-24719:
-

Author: ASF GitHub Bot
Created on: 03/Feb/21 15:09
Start Date: 03/Feb/21 15:09
Worklog Time Spent: 10m 
  Work Description: klcopp merged pull request #1937:
URL: https://github.com/apache/hive/pull/1937


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 547052)
Time Spent: 1h 10m  (was: 1h)

> There's a getAcidState call without impersonation in compactor.Worker
> -
>
> Key: HIVE-24719
> URL: https://issues.apache.org/jira/browse/HIVE-24719
> Project: Hive
>  Issue Type: Improvement
>Reporter: Karen Coppage
>Assignee: Karen Coppage
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> In compactor.Initiator and compactor.Cleaner, getAcidState is called by a 
> proxy user (the table/partition dir owner) because the HS2 user might not 
> have permission to list the files. In Worker getAcidState is not called by a 
> proxy user.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Resolved] (HIVE-24719) There's a getAcidState call without impersonation in compactor.Worker

2021-02-03 Thread Karen Coppage (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-24719?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karen Coppage resolved HIVE-24719.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Committed to master branch. Thanks [~pvary] for reviewing!

> There's a getAcidState call without impersonation in compactor.Worker
> -
>
> Key: HIVE-24719
> URL: https://issues.apache.org/jira/browse/HIVE-24719
> Project: Hive
>  Issue Type: Improvement
>Reporter: Karen Coppage
>Assignee: Karen Coppage
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> In compactor.Initiator and compactor.Cleaner, getAcidState is called by a 
> proxy user (the table/partition dir owner) because the HS2 user might not 
> have permission to list the files. In Worker getAcidState is not called by a 
> proxy user.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HIVE-24723) Use ExecutorService in TezSessionPool

2021-02-03 Thread David Mollitor (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-24723?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Mollitor updated HIVE-24723:
--
Description: Currently there are some wonky home-made thread pooling action 
going on in {{TezSessionPool}}.  Replace it with some JDK/Guava goodness.  
(was: Currently there are some wonky home-made thread pooling action going on 
in {{TezSessionPool}.  Replace it with some JDK/Guava goodness.)

> Use ExecutorService in TezSessionPool
> -
>
> Key: HIVE-24723
> URL: https://issues.apache.org/jira/browse/HIVE-24723
> Project: Hive
>  Issue Type: Improvement
>  Components: Tez
>Reporter: David Mollitor
>Assignee: David Mollitor
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> Currently there are some wonky home-made thread pooling action going on in 
> {{TezSessionPool}}.  Replace it with some JDK/Guava goodness.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (HIVE-24723) Use ExecutorService in TezSessionPool

2021-02-03 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-24723?focusedWorklogId=547023=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-547023
 ]

ASF GitHub Bot logged work on HIVE-24723:
-

Author: ASF GitHub Bot
Created on: 03/Feb/21 14:23
Start Date: 03/Feb/21 14:23
Worklog Time Spent: 10m 
  Work Description: belugabehr commented on pull request #1939:
URL: https://github.com/apache/hive/pull/1939#issuecomment-772547163


   @pgaref FYI



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 547023)
Time Spent: 0.5h  (was: 20m)

> Use ExecutorService in TezSessionPool
> -
>
> Key: HIVE-24723
> URL: https://issues.apache.org/jira/browse/HIVE-24723
> Project: Hive
>  Issue Type: Improvement
>  Components: Tez
>Reporter: David Mollitor
>Assignee: David Mollitor
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> Currently there are some wonky home-made thread pooling action going on in 
> {{TezSessionPool}.  Replace it with some JDK/Guava goodness.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (HIVE-24310) Allow specified number of deserialize errors to be ignored

2021-02-03 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-24310?focusedWorklogId=547013=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-547013
 ]

ASF GitHub Bot logged work on HIVE-24310:
-

Author: ASF GitHub Bot
Created on: 03/Feb/21 13:49
Start Date: 03/Feb/21 13:49
Worklog Time Spent: 10m 
  Work Description: dengzhhu653 opened a new pull request #1607:
URL: https://github.com/apache/hive/pull/1607


   
   
   ### What changes were proposed in this pull request?
   Allow specified number of deserialize errors to be ignored
   
   
   
   ### Why are the changes needed?
   Sometimes we see some corrupted records in user's raw data,  like one 
corrupted in a file which contains over thousands of records, user has to 
either give up all records or replay the whole data in order to run 
successfully on hive, we should provide a way to ignore such corrupted records. 
   
   
   
   ### Does this PR introduce _any_ user-facing change?
   No
   
   
   
   ### How was this patch tested?
   unit tests
   
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 547013)
Time Spent: 1h 40m  (was: 1.5h)

> Allow specified number of deserialize errors to be ignored
> --
>
> Key: HIVE-24310
> URL: https://issues.apache.org/jira/browse/HIVE-24310
> Project: Hive
>  Issue Type: Improvement
>  Components: Operators
>Reporter: Zhihua Deng
>Assignee: Zhihua Deng
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> Sometimes we see some corrupted records in user's raw data,  like one 
> corrupted in a file which contains over thousands of records, user has to 
> either give up all records or replay the whole data in order to run 
> successfully on hive, we should provide a way to ignore such corrupted 
> records. 
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (HIVE-24310) Allow specified number of deserialize errors to be ignored

2021-02-03 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-24310?focusedWorklogId=547011=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-547011
 ]

ASF GitHub Bot logged work on HIVE-24310:
-

Author: ASF GitHub Bot
Created on: 03/Feb/21 13:48
Start Date: 03/Feb/21 13:48
Worklog Time Spent: 10m 
  Work Description: dengzhhu653 closed pull request #1607:
URL: https://github.com/apache/hive/pull/1607


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 547011)
Time Spent: 1.5h  (was: 1h 20m)

> Allow specified number of deserialize errors to be ignored
> --
>
> Key: HIVE-24310
> URL: https://issues.apache.org/jira/browse/HIVE-24310
> Project: Hive
>  Issue Type: Improvement
>  Components: Operators
>Reporter: Zhihua Deng
>Assignee: Zhihua Deng
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> Sometimes we see some corrupted records in user's raw data,  like one 
> corrupted in a file which contains over thousands of records, user has to 
> either give up all records or replay the whole data in order to run 
> successfully on hive, we should provide a way to ignore such corrupted 
> records. 
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (HIVE-23485) Bound GroupByOperator stats using largest NDV among columns

2021-02-03 Thread Stamatis Zampetakis (Jira)



[ 
https://issues.apache.org/jira/browse/HIVE-23485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17277868#comment-17277868
 ] 

Stamatis Zampetakis commented on HIVE-23485:


Fixed in 
[47bc287f9dbc22f425ffb1968c393dc842145082|https://github.com/apache/hive/commit/47bc287f9dbc22f425ffb1968c393dc842145082].
 Thanks for the review @kgyrtkirk!

> Bound GroupByOperator stats using largest NDV among columns
> ---
>
> Key: HIVE-23485
> URL: https://issues.apache.org/jira/browse/HIVE-23485
> Project: Hive
>  Issue Type: Improvement
>Reporter: Stamatis Zampetakis
>Assignee: Stamatis Zampetakis
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
> Attachments: HIVE-23485.01.patch, HIVE-23485.02.patch
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> Consider the following SQL query:
> {code:sql}
> select id, name from person group by id, name;
> {code}
> and assume that the person table contains the following tuples:
> {code:sql}
> insert into person values (0, 'A') ;
> insert into person values (1, 'A') ;
> insert into person values (2, 'B') ;
> insert into person values (3, 'B') ;
> insert into person values (4, 'B') ;
> insert into person values (5, 'C') ;
> {code}
> If we know the number of distinct values (NDV) for all columns in the group 
> by clause then we can infer a lower bound for the total number of rows by 
> taking the maximun NDV of the involved columns. 
> Currently the query in the scenario above has the following plan:
> {noformat}
> Vertex dependency in root stage
> Reducer 2 <- Map 1 (SIMPLE_EDGE)
> Stage-0
>   Fetch Operator
> limit:-1
> Stage-1
>   Reducer 2 vectorized
>   File Output Operator [FS_11]
> Group By Operator [GBY_10] (rows=3 width=92)
>   Output:["_col0","_col1"],keys:KEY._col0, KEY._col1
> <-Map 1 [SIMPLE_EDGE] vectorized
>   SHUFFLE [RS_9]
> PartitionCols:_col0, _col1
> Group By Operator [GBY_8] (rows=3 width=92)
>   Output:["_col0","_col1"],keys:id, name
>   Select Operator [SEL_7] (rows=6 width=92)
> Output:["id","name"]
> TableScan [TS_0] (rows=6 width=92)
>   
> default@person,person,Tbl:COMPLETE,Col:COMPLETE,Output:["id","name"]{noformat}
> Observe that the stats for group by report 3 rows but given that the ID 
> attribute is part of the aggregation the rows cannot be less than 6.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Comment Edited] (HIVE-23485) Bound GroupByOperator stats using largest NDV among columns

2021-02-03 Thread Stamatis Zampetakis (Jira)



[ 
https://issues.apache.org/jira/browse/HIVE-23485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17277868#comment-17277868
 ] 

Stamatis Zampetakis edited comment on HIVE-23485 at 2/3/21, 10:03 AM:
--

Fixed in 
[47bc287f9dbc22f425ffb1968c393dc842145082|https://github.com/apache/hive/commit/47bc287f9dbc22f425ffb1968c393dc842145082].
 Thanks for the review [~kgyrtkirk]


was (Author: zabetak):
Fixed in 
[47bc287f9dbc22f425ffb1968c393dc842145082|https://github.com/apache/hive/commit/47bc287f9dbc22f425ffb1968c393dc842145082].
 Thanks for the review @kgyrtkirk!

> Bound GroupByOperator stats using largest NDV among columns
> ---
>
> Key: HIVE-23485
> URL: https://issues.apache.org/jira/browse/HIVE-23485
> Project: Hive
>  Issue Type: Improvement
>Reporter: Stamatis Zampetakis
>Assignee: Stamatis Zampetakis
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
> Attachments: HIVE-23485.01.patch, HIVE-23485.02.patch
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> Consider the following SQL query:
> {code:sql}
> select id, name from person group by id, name;
> {code}
> and assume that the person table contains the following tuples:
> {code:sql}
> insert into person values (0, 'A') ;
> insert into person values (1, 'A') ;
> insert into person values (2, 'B') ;
> insert into person values (3, 'B') ;
> insert into person values (4, 'B') ;
> insert into person values (5, 'C') ;
> {code}
> If we know the number of distinct values (NDV) for all columns in the group 
> by clause then we can infer a lower bound for the total number of rows by 
> taking the maximun NDV of the involved columns. 
> Currently the query in the scenario above has the following plan:
> {noformat}
> Vertex dependency in root stage
> Reducer 2 <- Map 1 (SIMPLE_EDGE)
> Stage-0
>   Fetch Operator
> limit:-1
> Stage-1
>   Reducer 2 vectorized
>   File Output Operator [FS_11]
> Group By Operator [GBY_10] (rows=3 width=92)
>   Output:["_col0","_col1"],keys:KEY._col0, KEY._col1
> <-Map 1 [SIMPLE_EDGE] vectorized
>   SHUFFLE [RS_9]
> PartitionCols:_col0, _col1
> Group By Operator [GBY_8] (rows=3 width=92)
>   Output:["_col0","_col1"],keys:id, name
>   Select Operator [SEL_7] (rows=6 width=92)
> Output:["id","name"]
> TableScan [TS_0] (rows=6 width=92)
>   
> default@person,person,Tbl:COMPLETE,Col:COMPLETE,Output:["id","name"]{noformat}
> Observe that the stats for group by report 3 rows but given that the ID 
> attribute is part of the aggregation the rows cannot be less than 6.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HIVE-23485) Bound GroupByOperator stats using largest NDV among columns

2021-02-03 Thread Stamatis Zampetakis (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-23485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stamatis Zampetakis updated HIVE-23485:
---
Release Note:   (was: Fixed in 
[47bc287f9dbc22f425ffb1968c393dc842145082|https://github.com/apache/hive/commit/47bc287f9dbc22f425ffb1968c393dc842145082].
 Thanks for the review @kgyrtkirk!)

> Bound GroupByOperator stats using largest NDV among columns
> ---
>
> Key: HIVE-23485
> URL: https://issues.apache.org/jira/browse/HIVE-23485
> Project: Hive
>  Issue Type: Improvement
>Reporter: Stamatis Zampetakis
>Assignee: Stamatis Zampetakis
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
> Attachments: HIVE-23485.01.patch, HIVE-23485.02.patch
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> Consider the following SQL query:
> {code:sql}
> select id, name from person group by id, name;
> {code}
> and assume that the person table contains the following tuples:
> {code:sql}
> insert into person values (0, 'A') ;
> insert into person values (1, 'A') ;
> insert into person values (2, 'B') ;
> insert into person values (3, 'B') ;
> insert into person values (4, 'B') ;
> insert into person values (5, 'C') ;
> {code}
> If we know the number of distinct values (NDV) for all columns in the group 
> by clause then we can infer a lower bound for the total number of rows by 
> taking the maximun NDV of the involved columns. 
> Currently the query in the scenario above has the following plan:
> {noformat}
> Vertex dependency in root stage
> Reducer 2 <- Map 1 (SIMPLE_EDGE)
> Stage-0
>   Fetch Operator
> limit:-1
> Stage-1
>   Reducer 2 vectorized
>   File Output Operator [FS_11]
> Group By Operator [GBY_10] (rows=3 width=92)
>   Output:["_col0","_col1"],keys:KEY._col0, KEY._col1
> <-Map 1 [SIMPLE_EDGE] vectorized
>   SHUFFLE [RS_9]
> PartitionCols:_col0, _col1
> Group By Operator [GBY_8] (rows=3 width=92)
>   Output:["_col0","_col1"],keys:id, name
>   Select Operator [SEL_7] (rows=6 width=92)
> Output:["id","name"]
> TableScan [TS_0] (rows=6 width=92)
>   
> default@person,person,Tbl:COMPLETE,Col:COMPLETE,Output:["id","name"]{noformat}
> Observe that the stats for group by report 3 rows but given that the ID 
> attribute is part of the aggregation the rows cannot be less than 6.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HIVE-23485) Bound GroupByOperator stats using largest NDV among columns

2021-02-03 Thread Stamatis Zampetakis (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-23485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stamatis Zampetakis updated HIVE-23485:
---
Fix Version/s: 4.0.0
 Release Note: Fixed in 
[47bc287f9dbc22f425ffb1968c393dc842145082|https://github.com/apache/hive/commit/47bc287f9dbc22f425ffb1968c393dc842145082].
 Thanks for the review @kgyrtkirk!
   Resolution: Fixed
   Status: Resolved  (was: Patch Available)

> Bound GroupByOperator stats using largest NDV among columns
> ---
>
> Key: HIVE-23485
> URL: https://issues.apache.org/jira/browse/HIVE-23485
> Project: Hive
>  Issue Type: Improvement
>Reporter: Stamatis Zampetakis
>Assignee: Stamatis Zampetakis
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
> Attachments: HIVE-23485.01.patch, HIVE-23485.02.patch
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> Consider the following SQL query:
> {code:sql}
> select id, name from person group by id, name;
> {code}
> and assume that the person table contains the following tuples:
> {code:sql}
> insert into person values (0, 'A') ;
> insert into person values (1, 'A') ;
> insert into person values (2, 'B') ;
> insert into person values (3, 'B') ;
> insert into person values (4, 'B') ;
> insert into person values (5, 'C') ;
> {code}
> If we know the number of distinct values (NDV) for all columns in the group 
> by clause then we can infer a lower bound for the total number of rows by 
> taking the maximun NDV of the involved columns. 
> Currently the query in the scenario above has the following plan:
> {noformat}
> Vertex dependency in root stage
> Reducer 2 <- Map 1 (SIMPLE_EDGE)
> Stage-0
>   Fetch Operator
> limit:-1
> Stage-1
>   Reducer 2 vectorized
>   File Output Operator [FS_11]
> Group By Operator [GBY_10] (rows=3 width=92)
>   Output:["_col0","_col1"],keys:KEY._col0, KEY._col1
> <-Map 1 [SIMPLE_EDGE] vectorized
>   SHUFFLE [RS_9]
> PartitionCols:_col0, _col1
> Group By Operator [GBY_8] (rows=3 width=92)
>   Output:["_col0","_col1"],keys:id, name
>   Select Operator [SEL_7] (rows=6 width=92)
> Output:["id","name"]
> TableScan [TS_0] (rows=6 width=92)
>   
> default@person,person,Tbl:COMPLETE,Col:COMPLETE,Output:["id","name"]{noformat}
> Observe that the stats for group by report 3 rows but given that the ID 
> attribute is part of the aggregation the rows cannot be less than 6.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (HIVE-24717) Migrate to listStatusIterator in moving files

2021-02-03 Thread Steve Loughran (Jira)



[ 
https://issues.apache.org/jira/browse/HIVE-24717?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17277866#comment-17277866
 ] 

Steve Loughran commented on HIVE-24717:
---

happy to review a hadoop PR with the relevant fix backported

> Migrate to listStatusIterator in moving files
> -
>
> Key: HIVE-24717
> URL: https://issues.apache.org/jira/browse/HIVE-24717
> Project: Hive
>  Issue Type: Improvement
>Reporter: Mustafa İman
>Assignee: Mustafa İman
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> Hive.java has various calls to hdfs listStatus call when moving 
> files/directories around. These codepaths are used for insert overwrite 
> table/partition queries.
> listStatus It is blocking call whereas listStatusIterator is backed by a 
> RemoteIterator and fetches pages in the background. Hive should take 
> advantage of that since Hadoop has implemented listStatusIterator for S3 
> recently https://issues.apache.org/jira/browse/HADOOP-17074



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (HIVE-24711) hive metastore memory leak

2021-02-03 Thread LinZhongwei (Jira)



[ 
https://issues.apache.org/jira/browse/HIVE-24711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17277863#comment-17277863
 ] 

LinZhongwei commented on HIVE-24711:


If this config cannot be recognized by hdp hive, hive metastore restarting will 
fail. 

> hive metastore memory leak
> --
>
> Key: HIVE-24711
> URL: https://issues.apache.org/jira/browse/HIVE-24711
> Project: Hive
>  Issue Type: Bug
>  Components: Hive, Metastore
>Affects Versions: 3.1.0
>Reporter: LinZhongwei
>Priority: Major
>
> hdp version:3.1.5.31-1
> hive version:3.1.0.3.1.5.31-1
> hadoop version:3.1.1.3.1.5.31-1
> We find that the hive metastore has memory leak if we set 
> compactor.initiator.on to true.
> If we disable the configuration, the memory leak disappear.
> How can we resolve this problem?
> Even if we set the heap size of hive metastore to 40 GB, after 1 month the 
> hive metastore service will be down with outofmemory.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (HIVE-24711) hive metastore memory leak

2021-02-03 Thread LinZhongwei (Jira)



[ 
https://issues.apache.org/jira/browse/HIVE-24711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17277860#comment-17277860
 ] 

LinZhongwei commented on HIVE-24711:


yes. Following messages are from hivemetastore.log.   But in my hdp version 
hive, I can not find metastore.housekeeping.threads.on in the config dir and 
ambari web . I use 'grep -R -i "metastore.housekeeping.threads.on"'.  
Because when I turned off 'compactor.Initiator' ,  PartitionDiscoveryTask logs 
disappeared. I think when I turned off compactor.Initiator, 
'metastore.housekeeping.threads.on' was also turned off.  I will try to set it 
on hive, and restart hive metastore.


2021-02-03T16:56:21,709 ERROR [PartitionDiscoveryTask-2]: 
metastore.RetryingHMSHandler (RetryingHMSHandler.java:invokeInternal(197)) - 
MetaException(message:java.security.AccessControlException: Permission denied: 
user=hive, access=WRITE, 
inode="/apps/edl_cn/staging/edl_cn.PAYMENT_EVENT_DELTA_incremental/etl_run_id=20200625005959":gp_etl_edl_batch:gp_etl_edl_batch:drwxr-xr-x

> hive metastore memory leak
> --
>
> Key: HIVE-24711
> URL: https://issues.apache.org/jira/browse/HIVE-24711
> Project: Hive
>  Issue Type: Bug
>  Components: Hive, Metastore
>Affects Versions: 3.1.0
>Reporter: LinZhongwei
>Priority: Major
>
> hdp version:3.1.5.31-1
> hive version:3.1.0.3.1.5.31-1
> hadoop version:3.1.1.3.1.5.31-1
> We find that the hive metastore has memory leak if we set 
> compactor.initiator.on to true.
> If we disable the configuration, the memory leak disappear.
> How can we resolve this problem?
> Even if we set the heap size of hive metastore to 40 GB, after 1 month the 
> hive metastore service will be down with outofmemory.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HIVE-24710) Optimise PTF iteration for count(*) to reduce CPU and IO cost

2021-02-03 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-24710?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HIVE-24710:
--
Labels: performance pull-request-available  (was: performance)

> Optimise PTF iteration for count(*) to reduce CPU and IO cost
> -
>
> Key: HIVE-24710
> URL: https://issues.apache.org/jira/browse/HIVE-24710
> Project: Hive
>  Issue Type: Improvement
>  Components: HiveServer2
>Reporter: Rajesh Balamohan
>Priority: Major
>  Labels: performance, pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> E.g query
> {noformat}
> select x, y, count(*) over (partition by x order by y range between 86400 
> PRECEDING and CURRENT ROW) r0 from foo
> {noformat}
> 1. In such cases, there is no need to iterate over the rowcontainers often 
> (internally it does O(n^2) operations taking forever when window frame is 
> really large). This can be optimised to reduce CPU burn and IO.
> 2. BasePartitionEvaluator::calcFunctionValue need not materialize ROW when 
> parameters are empty. This codepath can also be optimised.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (HIVE-24710) Optimise PTF iteration for count(*) to reduce CPU and IO cost

2021-02-03 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-24710?focusedWorklogId=546934=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-546934
 ]

ASF GitHub Bot logged work on HIVE-24710:
-

Author: ASF GitHub Bot
Created on: 03/Feb/21 09:52
Start Date: 03/Feb/21 09:52
Worklog Time Spent: 10m 
  Work Description: rbalamohan opened a new pull request #1940:
URL: https://github.com/apache/hive/pull/1940


   https://issues.apache.org/jira/browse/HIVE-24710
   
   {noformat}
   select x, y, count(*) over (partition by x order by y range between 86400 
PRECEDING and CURRENT ROW) r0 from foo
   {noformat}
   
   When there are duplicates "y",  window frame becomes really large and 
internal implementation of PTFOperator ends up running for O(n^2) times. E.g in 
some queries, we had 2.5 M entries in the window and that caused it to run 
forever in single task.  Along with this, there is high amount of IO due to 
reading and discarding rows from RowContainers (note that we just need the 
count and nothing from materizlied row).
   
   1. In such cases, there is no need to iterate over the rowcontainers often 
(internally it does O(n^2) operations taking forever when window frame is 
really large). This can be optimised to reduce CPU burn and IO.
   2. BasePartitionEvaluator::calcFunctionValue need not materialize ROW when 
parameters are empty. This codepath can also be optimised.
   
   ### What changes were proposed in this pull request?
   - For count(*), PR follows a fast path and just takes up the count of 
PTFPartitionIterator.
   - When parameters are empty/null, it tries to run via optimised iterator 
which does not materialize anything in ROW. This helps in reducing IO cost. 
   
   ### How was this patch tested?
   small internal cluster



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 546934)
Remaining Estimate: 0h
Time Spent: 10m

> Optimise PTF iteration for count(*) to reduce CPU and IO cost
> -
>
> Key: HIVE-24710
> URL: https://issues.apache.org/jira/browse/HIVE-24710
> Project: Hive
>  Issue Type: Improvement
>  Components: HiveServer2
>Reporter: Rajesh Balamohan
>Priority: Major
>  Labels: performance
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> E.g query
> {noformat}
> select x, y, count(*) over (partition by x order by y range between 86400 
> PRECEDING and CURRENT ROW) r0 from foo
> {noformat}
> 1. In such cases, there is no need to iterate over the rowcontainers often 
> (internally it does O(n^2) operations taking forever when window frame is 
> really large). This can be optimised to reduce CPU burn and IO.
> 2. BasePartitionEvaluator::calcFunctionValue need not materialize ROW when 
> parameters are empty. This codepath can also be optimised.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (HIVE-24710) Optimise PTF iteration for count(*) to reduce CPU and IO cost

2021-02-03 Thread Rajesh Balamohan (Jira)



[ 
https://issues.apache.org/jira/browse/HIVE-24710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17277853#comment-17277853
 ] 

Rajesh Balamohan commented on HIVE-24710:
-

Updated the subject and description of this ticket based on further debugging.

> Optimise PTF iteration for count(*) to reduce CPU and IO cost
> -
>
> Key: HIVE-24710
> URL: https://issues.apache.org/jira/browse/HIVE-24710
> Project: Hive
>  Issue Type: Improvement
>  Components: HiveServer2
>Reporter: Rajesh Balamohan
>Priority: Major
>  Labels: performance
>
> E.g query
> {noformat}
> select x, y, count(*) over (partition by x order by y range between 86400 
> PRECEDING and CURRENT ROW) r0 from foo
> {noformat}
> 1. In such cases, there is no need to iterate over the rowcontainers often 
> (internally it does O(n^2) operations taking forever when window frame is 
> really large). This can be optimised to reduce CPU burn and IO.
> 2. BasePartitionEvaluator::calcFunctionValue need not materialize ROW when 
> parameters are empty. This codepath can also be optimised.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HIVE-24710) Optimise PTF iteration for count(*) to reduce CPU and IO cost

2021-02-03 Thread Rajesh Balamohan (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-24710?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rajesh Balamohan updated HIVE-24710:

Description: 
E.g query

{noformat}
select x, y, count(*) over (partition by x order by y range between 86400 
PRECEDING and CURRENT ROW) r0 from foo
{noformat}

1. In such cases, there is no need to iterate over the rowcontainers often 
(internally it does O(n^2) operations taking forever when window frame is 
really large). This can be optimised to reduce CPU burn and IO.
2. BasePartitionEvaluator::calcFunctionValue need not materialize ROW when 
parameters are empty. This codepath can also be optimised.







  was:
PTFRowContainer could be reading the same block repeatedly for the first block. 
Default block size is around 25000. For the first 25000 rowIdx, it would read 
the block repeatedly due to ("rowIdx < currentReadBlockStartRow ") condition.

{noformat}
 public Row getAt(int rowIdx) throws HiveException {
int blockSize = getBlockSize();
if ( rowIdx < currentReadBlockStartRow || rowIdx >= 
currentReadBlockStartRow + blockSize ) {
  readBlock(getBlockNum(rowIdx));
}
return getReadBlockRow(rowIdx - currentReadBlockStartRow);
  }
{noformat} 

https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/persistence/PTFRowContainer.java#L167

 


> Optimise PTF iteration for count(*) to reduce CPU and IO cost
> -
>
> Key: HIVE-24710
> URL: https://issues.apache.org/jira/browse/HIVE-24710
> Project: Hive
>  Issue Type: Improvement
>  Components: HiveServer2
>Reporter: Rajesh Balamohan
>Priority: Major
>  Labels: performance
>
> E.g query
> {noformat}
> select x, y, count(*) over (partition by x order by y range between 86400 
> PRECEDING and CURRENT ROW) r0 from foo
> {noformat}
> 1. In such cases, there is no need to iterate over the rowcontainers often 
> (internally it does O(n^2) operations taking forever when window frame is 
> really large). This can be optimised to reduce CPU burn and IO.
> 2. BasePartitionEvaluator::calcFunctionValue need not materialize ROW when 
> parameters are empty. This codepath can also be optimised.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HIVE-24710) Optimise PTF iteration for count(*) to reduce CPU and IO cost

2021-02-03 Thread Rajesh Balamohan (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-24710?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rajesh Balamohan updated HIVE-24710:

Summary: Optimise PTF iteration for count(*) to reduce CPU and IO cost  
(was: PTFRowContainer could be reading more number of blocks than needed)

> Optimise PTF iteration for count(*) to reduce CPU and IO cost
> -
>
> Key: HIVE-24710
> URL: https://issues.apache.org/jira/browse/HIVE-24710
> Project: Hive
>  Issue Type: Improvement
>  Components: HiveServer2
>Reporter: Rajesh Balamohan
>Priority: Major
>  Labels: performance
>
> PTFRowContainer could be reading the same block repeatedly for the first 
> block. Default block size is around 25000. For the first 25000 rowIdx, it 
> would read the block repeatedly due to ("rowIdx < currentReadBlockStartRow ") 
> condition.
> {noformat}
>  public Row getAt(int rowIdx) throws HiveException {
> int blockSize = getBlockSize();
> if ( rowIdx < currentReadBlockStartRow || rowIdx >= 
> currentReadBlockStartRow + blockSize ) {
>   readBlock(getBlockNum(rowIdx));
> }
> return getReadBlockRow(rowIdx - currentReadBlockStartRow);
>   }
> {noformat} 
> https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/persistence/PTFRowContainer.java#L167
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (HIVE-24727) Cache hydration api in llap proto

2021-02-03 Thread Antal Sinkovits (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-24727?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antal Sinkovits reassigned HIVE-24727:
--

Assignee: Antal Sinkovits

> Cache hydration api in llap proto
> -
>
> Key: HIVE-24727
> URL: https://issues.apache.org/jira/browse/HIVE-24727
> Project: Hive
>  Issue Type: Sub-task
>Reporter: Antal Sinkovits
>Assignee: Antal Sinkovits
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (HIVE-24725) Collect top priority items from llap cache policy

2021-02-03 Thread Antal Sinkovits (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-24725?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antal Sinkovits reassigned HIVE-24725:
--

Assignee: Antal Sinkovits

> Collect top priority items from llap cache policy
> -
>
> Key: HIVE-24725
> URL: https://issues.apache.org/jira/browse/HIVE-24725
> Project: Hive
>  Issue Type: Sub-task
>Reporter: Antal Sinkovits
>Assignee: Antal Sinkovits
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (HIVE-24726) Track required data for cache hydration

2021-02-03 Thread Antal Sinkovits (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-24726?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antal Sinkovits reassigned HIVE-24726:
--

Assignee: Antal Sinkovits

> Track required data for cache hydration
> ---
>
> Key: HIVE-24726
> URL: https://issues.apache.org/jira/browse/HIVE-24726
> Project: Hive
>  Issue Type: Sub-task
>Reporter: Antal Sinkovits
>Assignee: Antal Sinkovits
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (HIVE-24729) Implement strategy for llap cache hydration

2021-02-03 Thread Antal Sinkovits (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-24729?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antal Sinkovits reassigned HIVE-24729:
--

Assignee: Antal Sinkovits

> Implement strategy for llap cache hydration
> ---
>
> Key: HIVE-24729
> URL: https://issues.apache.org/jira/browse/HIVE-24729
> Project: Hive
>  Issue Type: Sub-task
>Reporter: Antal Sinkovits
>Assignee: Antal Sinkovits
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (HIVE-24728) Low level reader for llap cache hydration

2021-02-03 Thread Antal Sinkovits (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-24728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antal Sinkovits reassigned HIVE-24728:
--

Assignee: Antal Sinkovits

> Low level reader for llap cache hydration
> -
>
> Key: HIVE-24728
> URL: https://issues.apache.org/jira/browse/HIVE-24728
> Project: Hive
>  Issue Type: Sub-task
>Reporter: Antal Sinkovits
>Assignee: Antal Sinkovits
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

64 matches

Mail list logo