[jira] [Work logged] (HIVE-23851) MSCK REPAIR Command With Partition Filtering Fails While Dropping Partitions

2020-10-07 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-23851?focusedWorklogId=497097=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-497097
 ]

ASF GitHub Bot logged work on HIVE-23851:
-

Author: ASF GitHub Bot
Created on: 08/Oct/20 05:08
Start Date: 08/Oct/20 05:08
Worklog Time Spent: 10m 
  Work Description: shameersss1 commented on pull request #1271:
URL: https://github.com/apache/hive/pull/1271#issuecomment-705332460


   @kgyrtkirk Thank you taking your time in reviewing this!
   
   Yes, +1 from my side for (deprecating) kyro stuffs. String based approach is 
cool. But i am not sure how easy or difficult will it be to do the changes. I 
will try to explore this from my side as well.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 497097)
Time Spent: 5h  (was: 4h 50m)

> MSCK REPAIR Command With Partition Filtering Fails While Dropping Partitions
> 
>
> Key: HIVE-23851
> URL: https://issues.apache.org/jira/browse/HIVE-23851
> Project: Hive
>  Issue Type: Bug
>Affects Versions: 4.0.0
>Reporter: Syed Shameerur Rahman
>Assignee: Syed Shameerur Rahman
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 5h
>  Remaining Estimate: 0h
>
> *Steps to reproduce:*
> # Create external table
> # Run msck command to sync all the partitions with metastore
> # Remove one of the partition path
> # Run msck repair with partition filtering
> *Stack Trace:*
> {code:java}
>  2020-07-15T02:10:29,045 ERROR [4dad298b-28b1-4e6b-94b6-aa785b60c576 main] 
> ppr.PartitionExpressionForMetastore: Failed to deserialize the expression
>  java.lang.IndexOutOfBoundsException: Index: 110, Size: 0
>  at java.util.ArrayList.rangeCheck(ArrayList.java:657) ~[?:1.8.0_192]
>  at java.util.ArrayList.get(ArrayList.java:433) ~[?:1.8.0_192]
>  at 
> org.apache.hive.com.esotericsoftware.kryo.util.MapReferenceResolver.getReadObject(MapReferenceResolver.java:60)
>  ~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
>  at 
> org.apache.hive.com.esotericsoftware.kryo.Kryo.readReferenceOrNull(Kryo.java:857)
>  ~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
>  at org.apache.hive.com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:707) 
> ~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
>  at 
> org.apache.hadoop.hive.ql.exec.SerializationUtilities$KryoWithHooks.readObject(SerializationUtilities.java:211)
>  ~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
>  at 
> org.apache.hadoop.hive.ql.exec.SerializationUtilities.deserializeObjectFromKryo(SerializationUtilities.java:806)
>  ~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
>  at 
> org.apache.hadoop.hive.ql.exec.SerializationUtilities.deserializeExpressionFromKryo(SerializationUtilities.java:775)
>  ~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
>  at 
> org.apache.hadoop.hive.ql.optimizer.ppr.PartitionExpressionForMetastore.deserializeExpr(PartitionExpressionForMetastore.java:96)
>  [hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
>  at 
> org.apache.hadoop.hive.ql.optimizer.ppr.PartitionExpressionForMetastore.convertExprToFilter(PartitionExpressionForMetastore.java:52)
>  [hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
>  at 
> org.apache.hadoop.hive.metastore.PartFilterExprUtil.makeExpressionTree(PartFilterExprUtil.java:48)
>  [hive-standalone-metastore-server-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
>  at 
> org.apache.hadoop.hive.metastore.ObjectStore.getPartitionsByExprInternal(ObjectStore.java:3593)
>  [hive-standalone-metastore-server-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
>  at 
> org.apache.hadoop.hive.metastore.VerifyingObjectStore.getPartitionsByExpr(VerifyingObjectStore.java:80)
>  [hive-standalone-metastore-server-4.0.0-SNAPSHOT-tests.jar:4.0.0-SNAPSHOT]
>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:1.8.0_192]
>  at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 
> ~[?:1.8.0_192]
> {code}
> *Cause:*
> In case of msck repair with partition filtering we expect expression proxy 
> class to be set as PartitionExpressionForMetastore ( 
> https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/ddl/misc/msck/MsckAnalyzer.java#L78
>  ), While dropping partition we serialize the drop partition filter 
> expression as ( 
> https://github.com/apache/hive/blob/master/standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/Msck.java#L589

[jira] [Work logged] (HIVE-23811) deleteReader SARG rowId/bucketId are not getting validated properly

2020-10-07 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-23811?focusedWorklogId=497027=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-497027
 ]

ASF GitHub Bot logged work on HIVE-23811:
-

Author: ASF GitHub Bot
Created on: 08/Oct/20 00:44
Start Date: 08/Oct/20 00:44
Worklog Time Spent: 10m 
  Work Description: github-actions[bot] commented on pull request #1214:
URL: https://github.com/apache/hive/pull/1214#issuecomment-705266436


   This pull request has been automatically marked as stale because it has not 
had recent activity. It will be closed if no further activity occurs.
   Feel free to reach out on the d...@hive.apache.org list if the patch is in 
need of reviews.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 497027)
Time Spent: 0.5h  (was: 20m)

> deleteReader SARG rowId/bucketId are not getting validated properly
> ---
>
> Key: HIVE-23811
> URL: https://issues.apache.org/jira/browse/HIVE-23811
> Project: Hive
>  Issue Type: Bug
>Reporter: Naresh P R
>Assignee: Naresh P R
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> Though we are iterating over min/max stripeIndex, we always seem to pick 
> ColumnStats from first stripe
> [https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/io/orc/VectorizedOrcAcidRowBatchReader.java#L596]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (HIVE-23955) Classification of Error Codes in Replication

2020-10-07 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-23955?focusedWorklogId=497025=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-497025
 ]

ASF GitHub Bot logged work on HIVE-23955:
-

Author: ASF GitHub Bot
Created on: 08/Oct/20 00:43
Start Date: 08/Oct/20 00:43
Worklog Time Spent: 10m 
  Work Description: github-actions[bot] commented on pull request #1358:
URL: https://github.com/apache/hive/pull/1358#issuecomment-705266417


   This pull request has been automatically marked as stale because it has not 
had recent activity. It will be closed if no further activity occurs.
   Feel free to reach out on the d...@hive.apache.org list if the patch is in 
need of reviews.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 497025)
Time Spent: 1h 50m  (was: 1h 40m)

> Classification of Error Codes in Replication
> 
>
> Key: HIVE-23955
> URL: https://issues.apache.org/jira/browse/HIVE-23955
> Project: Hive
>  Issue Type: Task
>Reporter: Aasha Medhi
>Assignee: Aasha Medhi
>Priority: Major
>  Labels: pull-request-available
> Attachments: HIVE-23955.01.patch, HIVE-23955.02.patch, 
> HIVE-23955.03.patch, HIVE-23955.04.patch, Retry Logic for Replication.pdf
>
>  Time Spent: 1h 50m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (HIVE-21611) Date.getTime() can be changed to System.currentTimeMillis()

2020-10-07 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-21611?focusedWorklogId=497026=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-497026
 ]

ASF GitHub Bot logged work on HIVE-21611:
-

Author: ASF GitHub Bot
Created on: 08/Oct/20 00:43
Start Date: 08/Oct/20 00:43
Worklog Time Spent: 10m 
  Work Description: github-actions[bot] closed pull request #595:
URL: https://github.com/apache/hive/pull/595


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 497026)
Time Spent: 2h 10m  (was: 2h)

> Date.getTime() can be changed to System.currentTimeMillis()
> ---
>
> Key: HIVE-21611
> URL: https://issues.apache.org/jira/browse/HIVE-21611
> Project: Hive
>  Issue Type: Bug
>Reporter: bd2019us
>Assignee: Hunter Logan
>Priority: Major
>  Labels: pull-request-available
> Attachments: 1.patch
>
>  Time Spent: 2h 10m
>  Remaining Estimate: 0h
>
> Hello,
> I found that System.currentTimeMillis() can be used here instead of new 
> Date.getTime().
> Since new Date() is a thin wrapper of light method 
> System.currentTimeMillis(). The performance will be greatly damaged if it is 
> invoked too much times.
> According to my local testing at the same environment, 
> System.currentTimeMillis() can achieve a speedup to 5 times (435 ms vs 2073 
> ms), when these two methods are invoked 5,000,000 times.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HIVE-24232) Incorrect translation of rollup expression from Calcite

2020-10-07 Thread Jesus Camacho Rodriguez (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jesus Camacho Rodriguez updated HIVE-24232:
---
Fix Version/s: 4.0.0
   Resolution: Fixed
   Status: Resolved  (was: Patch Available)

> Incorrect translation of rollup expression from Calcite
> ---
>
> Key: HIVE-24232
> URL: https://issues.apache.org/jira/browse/HIVE-24232
> Project: Hive
>  Issue Type: Bug
>  Components: CBO
>Reporter: Jesus Camacho Rodriguez
>Assignee: Jesus Camacho Rodriguez
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> In Calcite, it is not necessary that the columns in the group set are in the 
> same order as the rollup. For instance, this is the Calcite representation of 
> a rollup for a given query:
> {code}
> HiveAggregate(group=[{1, 6, 7}], groups=[[{1, 6, 7}, {1, 7}, {1}, {}]], 
> agg#0=[sum($12)], agg#1=[count($12)], agg#2=[sum($4)], agg#3=[count($4)], 
> agg#4=[sum($15)], agg#5=[count($15)])
> {code}
> When we generate the Hive plan from the Calcite operator, we make such 
> assumption incorrectly.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (HIVE-24232) Incorrect translation of rollup expression from Calcite

2020-10-07 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24232?focusedWorklogId=496961=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-496961
 ]

ASF GitHub Bot logged work on HIVE-24232:
-

Author: ASF GitHub Bot
Created on: 07/Oct/20 22:03
Start Date: 07/Oct/20 22:03
Worklog Time Spent: 10m 
  Work Description: jcamachor merged pull request #1554:
URL: https://github.com/apache/hive/pull/1554


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 496961)
Time Spent: 0.5h  (was: 20m)

> Incorrect translation of rollup expression from Calcite
> ---
>
> Key: HIVE-24232
> URL: https://issues.apache.org/jira/browse/HIVE-24232
> Project: Hive
>  Issue Type: Bug
>  Components: CBO
>Reporter: Jesus Camacho Rodriguez
>Assignee: Jesus Camacho Rodriguez
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> In Calcite, it is not necessary that the columns in the group set are in the 
> same order as the rollup. For instance, this is the Calcite representation of 
> a rollup for a given query:
> {code}
> HiveAggregate(group=[{1, 6, 7}], groups=[[{1, 6, 7}, {1, 7}, {1}, {}]], 
> agg#0=[sum($12)], agg#1=[count($12)], agg#2=[sum($4)], agg#3=[count($4)], 
> agg#4=[sum($15)], agg#5=[count($15)])
> {code}
> When we generate the Hive plan from the Calcite operator, we make such 
> assumption incorrectly.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (HIVE-24203) Implement stats annotation rule for the LateralViewJoinOperator

2020-10-07 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24203?focusedWorklogId=496802=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-496802
 ]

ASF GitHub Bot logged work on HIVE-24203:
-

Author: ASF GitHub Bot
Created on: 07/Oct/20 17:55
Start Date: 07/Oct/20 17:55
Worklog Time Spent: 10m 
  Work Description: kgyrtkirk commented on a change in pull request #1531:
URL: https://github.com/apache/hive/pull/1531#discussion_r501203341



##
File path: 
ql/src/java/org/apache/hadoop/hive/ql/optimizer/stats/annotation/StatsRulesProcFactory.java
##
@@ -2921,6 +2920,97 @@ public Object process(Node nd, Stack stack, 
NodeProcessorCtx procCtx,
 }
   }
 
+  /**
+   * LateralViewJoinOperator changes the data size and column level statistics.
+   *
+   * A diagram of LATERAL VIEW.
+   *
+   *   [Lateral View Forward]
+   *  / \
+   *[Select]  [Select]
+   *||
+   *| [UDTF]
+   *\   /
+   *   [Lateral View Join]
+   *
+   * For each row of the source, the left branch just picks columns and the 
right branch processes UDTF.
+   * And then LVJ joins a row from the left branch with rows from the right 
branch.
+   * The join has one-to-many relationship since UDTF can generate multiple 
rows.
+   *
+   * This rule multiplies the stats from the left branch by T(right) / T(left) 
and sums up the both sides.
+   */
+  public static class LateralViewJoinStatsRule extends DefaultStatsRule 
implements SemanticNodeProcessor {
+@Override
+public Object process(Node nd, Stack stack, NodeProcessorCtx procCtx,
+  Object... nodeOutputs) throws SemanticException {
+  final LateralViewJoinOperator lop = (LateralViewJoinOperator) nd;
+  final AnnotateStatsProcCtx aspCtx = (AnnotateStatsProcCtx) procCtx;
+  final HiveConf conf = aspCtx.getConf();
+
+  if (!isAllParentsContainStatistics(lop)) {
+return null;
+  }
+
+  final List> parents = 
lop.getParentOperators();
+  if (parents.size() != 2) {
+LOG.warn("LateralViewJoinOperator should have just two parents but 
actually has "
++ parents.size() + " parents.");
+return null;
+  }
+
+  final Statistics selectStats = 
parents.get(LateralViewJoinOperator.SELECT_TAG).getStatistics();
+  final Statistics udtfStats = 
parents.get(LateralViewJoinOperator.UDTF_TAG).getStatistics();
+
+  final double factor = (double) udtfStats.getNumRows() / (double) 
selectStats.getNumRows();
+  final long selectDataSize = 
StatsUtils.safeMult(selectStats.getDataSize(), factor);
+  final long dataSize = StatsUtils.safeAdd(selectDataSize, 
udtfStats.getDataSize());
+  Statistics joinedStats = new Statistics(udtfStats.getNumRows(), 
dataSize, 0, 0);
+
+  if (satisfyPrecondition(selectStats) && satisfyPrecondition(udtfStats)) {
+final Map columnExprMap = lop.getColumnExprMap();
+final RowSchema schema = lop.getSchema();
+
+joinedStats.updateColumnStatsState(selectStats.getColumnStatsState());
+final List selectColStats = StatsUtils
+.getColStatisticsFromExprMap(conf, selectStats, columnExprMap, 
schema);
+joinedStats.addToColumnStats(multiplyColStats(selectColStats, factor));
+
+joinedStats.updateColumnStatsState(udtfStats.getColumnStatsState());
+final List udtfColStats = StatsUtils
+.getColStatisticsFromExprMap(conf, udtfStats, columnExprMap, 
schema);
+joinedStats.addToColumnStats(udtfColStats);
+
+joinedStats = applyRuntimeStats(aspCtx.getParseContext().getContext(), 
joinedStats, lop);
+lop.setStatistics(joinedStats);
+
+if (LOG.isDebugEnabled()) {
+  LOG.debug("[0] STATS-" + lop.toString() + ": " + 
joinedStats.extendedToString());
+}

Review comment:
   I don't know what's the point of these `[0]`/`[1]` markers; from one of 
the historical commits it seems to me like these are some kind of "log message 
indexes" inside the method 
   I think we could stop doing that...





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 496802)
Time Spent: 2h  (was: 1h 50m)

> Implement stats annotation rule for the LateralViewJoinOperator
> ---
>
> Key: HIVE-24203
> URL: https://issues.apache.org/jira/browse/HIVE-24203
> Project: Hive
>  Issue Type: Improvement
>  Components: Physical Optimizer
>Affects 

[jira] [Commented] (HIVE-22344) I can't run hive in command line

2020-10-07 Thread Amit Singh (Jira)


[ 
https://issues.apache.org/jira/browse/HIVE-22344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17209714#comment-17209714
 ] 

Amit Singh commented on HIVE-22344:
---

Replace the Hive jar under /lib with the one shipped with Hadoop.

> I can't run hive in command line
> 
>
> Key: HIVE-22344
> URL: https://issues.apache.org/jira/browse/HIVE-22344
> Project: Hive
>  Issue Type: Bug
>  Components: CLI
>Affects Versions: 3.1.2
> Environment: hive: 3.1.2
> hadoop 3.2.1
>  
>Reporter: Smith Cruise
>Priority: Blocker
>
> I can't run hive in command. It tell me :
> {code:java}
> [hadoop@master lib]$ hive
> which: no hbase in 
> (/home/hadoop/apache-hive-3.1.2-bin/bin:{{pwd}}/bin:/home/hadoop/.local/bin:/home/hadoop/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/root/bin)
> SLF4J: Class path contains multiple SLF4J bindings.
> SLF4J: Found binding in 
> [jar:file:/home/hadoop/apache-hive-3.1.2-bin/lib/log4j-slf4j-impl-2.10.0.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: Found binding in 
> [jar:file:/home/hadoop/hadoop3/share/hadoop/common/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an 
> explanation.
> SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
> Exception in thread "main" java.lang.NoSuchMethodError: 
> com.google.common.base.Preconditions.checkArgument(ZLjava/lang/String;Ljava/lang/Object;)V
> at org.apache.hadoop.conf.Configuration.set(Configuration.java:1357)
> at org.apache.hadoop.conf.Configuration.set(Configuration.java:1338)
> at org.apache.hadoop.mapred.JobConf.setJar(JobConf.java:536)
> at org.apache.hadoop.mapred.JobConf.setJarByClass(JobConf.java:554)
> at org.apache.hadoop.mapred.JobConf.(JobConf.java:448)
> at org.apache.hadoop.hive.conf.HiveConf.initialize(HiveConf.java:5141)
> at org.apache.hadoop.hive.conf.HiveConf.(HiveConf.java:5099)
> at 
> org.apache.hadoop.hive.common.LogUtils.initHiveLog4jCommon(LogUtils.java:97)
> at 
> org.apache.hadoop.hive.common.LogUtils.initHiveLog4j(LogUtils.java:81)
> at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:699)
> at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:683)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at org.apache.hadoop.util.RunJar.run(RunJar.java:323)
> at org.apache.hadoop.util.RunJar.main(RunJar.java:236)
> {code}
> I don't know what's wrong about it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (HIVE-24199) Incorrect result when subquey in exists contains limit

2020-10-07 Thread Krisztian Kasa (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Krisztian Kasa resolved HIVE-24199.
---
Resolution: Fixed

Pushed to master, thanks [~vgarg] for review.

> Incorrect result when subquey in exists contains limit
> --
>
> Key: HIVE-24199
> URL: https://issues.apache.org/jira/browse/HIVE-24199
> Project: Hive
>  Issue Type: Bug
>Reporter: Krisztian Kasa
>Assignee: Krisztian Kasa
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> {code:java}
> create table web_sales (ws_order_number int, ws_warehouse_sk int) stored as 
> orc;
> insert into web_sales values
> (1, 1),
> (1, 2),
> (2, 1),
> (2, 2);
> select * from web_sales ws1
> where exists (select 1 from web_sales ws2 where ws1.ws_order_number = 
> ws2.ws_order_number limit 1);
> 1 1
> 1 2
> {code}
> {code:java}
> CBO PLAN:
> HiveSemiJoin(condition=[=($0, $2)], joinType=[semi])
>   HiveProject(ws_order_number=[$0], ws_warehouse_sk=[$1])
> HiveFilter(condition=[IS NOT NULL($0)])
>   HiveTableScan(table=[[default, web_sales]], table:alias=[ws1])
>   HiveProject(ws_order_number=[$0])
> HiveSortLimit(fetch=[1])  <-- This shouldn't be added
>   HiveProject(ws_order_number=[$0])
> HiveFilter(condition=[IS NOT NULL($0)])
>   HiveTableScan(table=[[default, web_sales]], table:alias=[ws2])
> {code}
> Limit n on the right side of the join reduces the result set coming from the 
> right to only n record hence not all the ws_order_number values are included 
> which leads to correctness issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (HIVE-24199) Incorrect result when subquey in exists contains limit

2020-10-07 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24199?focusedWorklogId=496765=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-496765
 ]

ASF GitHub Bot logged work on HIVE-24199:
-

Author: ASF GitHub Bot
Created on: 07/Oct/20 17:15
Start Date: 07/Oct/20 17:15
Worklog Time Spent: 10m 
  Work Description: kasakrisz merged pull request #1525:
URL: https://github.com/apache/hive/pull/1525


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 496765)
Time Spent: 50m  (was: 40m)

> Incorrect result when subquey in exists contains limit
> --
>
> Key: HIVE-24199
> URL: https://issues.apache.org/jira/browse/HIVE-24199
> Project: Hive
>  Issue Type: Bug
>Reporter: Krisztian Kasa
>Assignee: Krisztian Kasa
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> {code:java}
> create table web_sales (ws_order_number int, ws_warehouse_sk int) stored as 
> orc;
> insert into web_sales values
> (1, 1),
> (1, 2),
> (2, 1),
> (2, 2);
> select * from web_sales ws1
> where exists (select 1 from web_sales ws2 where ws1.ws_order_number = 
> ws2.ws_order_number limit 1);
> 1 1
> 1 2
> {code}
> {code:java}
> CBO PLAN:
> HiveSemiJoin(condition=[=($0, $2)], joinType=[semi])
>   HiveProject(ws_order_number=[$0], ws_warehouse_sk=[$1])
> HiveFilter(condition=[IS NOT NULL($0)])
>   HiveTableScan(table=[[default, web_sales]], table:alias=[ws1])
>   HiveProject(ws_order_number=[$0])
> HiveSortLimit(fetch=[1])  <-- This shouldn't be added
>   HiveProject(ws_order_number=[$0])
> HiveFilter(condition=[IS NOT NULL($0)])
>   HiveTableScan(table=[[default, web_sales]], table:alias=[ws2])
> {code}
> Limit n on the right side of the join reduces the result set coming from the 
> right to only n record hence not all the ws_order_number values are included 
> which leads to correctness issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HIVE-24040) Slightly odd behaviour with CHAR comparisons and string literals

2020-10-07 Thread Tim Armstrong (Jira)


[ 
https://issues.apache.org/jira/browse/HIVE-24040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17209672#comment-17209672
 ] 

Tim Armstrong commented on HIVE-24040:
--

[~kgyrtkirk] I'd recommend reading 
http://databasearchitects.blogspot.com/2015/01/fun-with-char.html for an 
interesting perspective on this (one of its conclusions is that Postgres and 
other systems do not implement the spec exactly, and that may be a good thing).

> Slightly odd behaviour with CHAR comparisons and string literals
> 
>
> Key: HIVE-24040
> URL: https://issues.apache.org/jira/browse/HIVE-24040
> Project: Hive
>  Issue Type: Bug
>Reporter: Tim Armstrong
>Priority: Major
>
> If t is a char column, this statement behaves a bit strangely - since the RHS 
> is a STRING, I would have expected it to behave consistently with other 
> CHAR/STRING comparisons, where the CHAR column has its trailing spaces 
> removed and the STRING does not have its trailing spaces removed.
> {noformat}
> select count(*) from ax where t = cast('a ' as string);
> {noformat}
> Instead it seems to be treated the same as if it was a plain literal, 
> interpreted as CHAR, i.e.
> {noformat}
> select count(*) from ax where t = 'a ';
> {noformat}
> Here are some more experiments I did based on 
> https://github.com/apache/hive/blob/master/ql/src/test/queries/clientpositive/in_typecheck_char.q
>  that seem to show some inconsistencies.
> {noformat}
> -- Hive version 3.1.3000.7.2.1.0-287 r4e72e59f1c2a51a64e0ff37b14bd396cd4e97b98
> create table ax(s char(1),t char(10));
> insert into ax values ('a','a'),('a','a '),('b','bb');
> -- varchar literal preserves trailing space
> select count(*) from ax where t = cast('a ' as varchar(50));
> +--+
> | _c0  |
> +--+
> | 0|
> +--+
> -- explicit cast of literal to string removes trailing space
> select count(*) from ax where t = cast('a ' as string);
> +--+
> | _c0  |
> +--+
> | 2|
> +--+
> -- other string expressions preserve trailing space
> select count(*) from ax where t = concat('a', ' ');
> +--+
> | _c0  |
> +--+
> | 0|
> +--+
> -- varchar col preserves trailing space
> create table stringv as select cast('a  ' as varchar(50));
> select count(*) from ax, stringv where t = `_c0`;
> +--+
> | _c0  |
> +--+
> | 0|
> +--+
> -- string col preserves trailing space
> create table stringa as select 'a  ';
> select count(*) from ax, stringa where t = `_c0`;
> +--+
> | _c0  |
> +--+
> | 0|
> +--+
> {noformat}
> [~jcamachorodriguez] [~kgyrtkirk]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (HIVE-24229) DirectSql fails in case of OracleDB

2020-10-07 Thread Zoltan Haindrich (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zoltan Haindrich resolved HIVE-24229.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Merged into master. Thank you [~ayushtkn]!

> DirectSql fails in case of OracleDB
> ---
>
> Key: HIVE-24229
> URL: https://issues.apache.org/jira/browse/HIVE-24229
> Project: Hive
>  Issue Type: Bug
>Reporter: Ayush Saxena
>Assignee: Ayush Saxena
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> Direct Sql fails due to different data type mapping incase of Oracle DB



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (HIVE-24229) DirectSql fails in case of OracleDB

2020-10-07 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24229?focusedWorklogId=496733=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-496733
 ]

ASF GitHub Bot logged work on HIVE-24229:
-

Author: ASF GitHub Bot
Created on: 07/Oct/20 16:32
Start Date: 07/Oct/20 16:32
Worklog Time Spent: 10m 
  Work Description: kgyrtkirk merged pull request #1552:
URL: https://github.com/apache/hive/pull/1552


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 496733)
Time Spent: 40m  (was: 0.5h)

> DirectSql fails in case of OracleDB
> ---
>
> Key: HIVE-24229
> URL: https://issues.apache.org/jira/browse/HIVE-24229
> Project: Hive
>  Issue Type: Bug
>Reporter: Ayush Saxena
>Assignee: Ayush Saxena
>Priority: Critical
>  Labels: pull-request-available
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> Direct Sql fails due to different data type mapping incase of Oracle DB



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (HIVE-21052) Make sure transactions get cleaned if they are aborted before addPartitions is called

2020-10-07 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-21052?focusedWorklogId=496724=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-496724
 ]

ASF GitHub Bot logged work on HIVE-21052:
-

Author: ASF GitHub Bot
Created on: 07/Oct/20 16:17
Start Date: 07/Oct/20 16:17
Worklog Time Spent: 10m 
  Work Description: deniskuzZ commented on a change in pull request #1415:
URL: https://github.com/apache/hive/pull/1415#discussion_r501141180



##
File path: 
standalone-metastore/src/main/java/org/apache/hadoop/hive/metastore/txn/CompactionTxnHandler.java
##
@@ -386,15 +427,27 @@ public void markCleaned(CompactionInfo info) throws 
MetaException {
   pStmt.setLong(paramCount++, info.highestWriteId);
 }
 LOG.debug("Going to execute update <" + s + ">");
-if (pStmt.executeUpdate() < 1) {
-  LOG.error("Expected to remove at least one row from 
completed_txn_components when " +
-"marking compaction entry as clean!");
+if ((updCount = pStmt.executeUpdate()) < 1) {
+  // In the case of clean abort commit hasn't happened so 
completed_txn_components hasn't been filled
+  if (!info.isCleanAbortedCompaction()) {
+LOG.error(
+"Expected to remove at least one row from 
completed_txn_components when "
++ "marking compaction entry as clean!");
+  }
 }
 
 s = "select distinct txn_id from TXNS, TXN_COMPONENTS where txn_id = 
tc_txnid and txn_state = '" +
   TXN_ABORTED + "' and tc_database = ? and tc_table = ?";
 if (info.highestWriteId != 0) s += " and tc_writeid <= ?";
 if (info.partName != null) s += " and tc_partition = ?";
+if (info.writeIds != null && info.writeIds.size() > 0) {
+  String[] wriStr = new String[info.writeIds.size()];
+  int i = 0;
+  for (Long writeId: writeIds) {
+wriStr[i++] = writeId.toString();
+  }
+  s += " and tc_writeid in (" + String.join(",", wriStr) + ")";

Review comment:
   is this even used, statement was already compiled?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 496724)
Time Spent: 6.5h  (was: 6h 20m)

> Make sure transactions get cleaned if they are aborted before addPartitions 
> is called
> -
>
> Key: HIVE-21052
> URL: https://issues.apache.org/jira/browse/HIVE-21052
> Project: Hive
>  Issue Type: Bug
>  Components: Transactions
>Affects Versions: 3.0.0, 3.1.1
>Reporter: Jaume M
>Assignee: Jaume M
>Priority: Critical
>  Labels: pull-request-available
> Attachments: Aborted Txn w_Direct Write.pdf, HIVE-21052.1.patch, 
> HIVE-21052.10.patch, HIVE-21052.11.patch, HIVE-21052.12.patch, 
> HIVE-21052.2.patch, HIVE-21052.3.patch, HIVE-21052.4.patch, 
> HIVE-21052.5.patch, HIVE-21052.6.patch, HIVE-21052.7.patch, 
> HIVE-21052.8.patch, HIVE-21052.9.patch
>
>  Time Spent: 6.5h
>  Remaining Estimate: 0h
>
> If the transaction is aborted between openTxn and addPartitions and data has 
> been written on the table the transaction manager will think it's an empty 
> transaction and no cleaning will be done.
> This is currently an issue in the streaming API and in micromanaged tables. 
> As proposed by [~ekoifman] this can be solved by:
> * Writing an entry with a special marker to TXN_COMPONENTS at openTxn and 
> when addPartitions is called remove this entry from TXN_COMPONENTS and add 
> the corresponding partition entry to TXN_COMPONENTS.
> * If the cleaner finds and entry with a special marker in TXN_COMPONENTS that 
> specifies that a transaction was opened and it was aborted it must generate 
> jobs for the worker for every possible partition available.
> cc [~ewohlstadter]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (HIVE-24236) Connection leak in TxnHandler

2020-10-07 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24236?focusedWorklogId=496693=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-496693
 ]

ASF GitHub Bot logged work on HIVE-24236:
-

Author: ASF GitHub Bot
Created on: 07/Oct/20 15:43
Start Date: 07/Oct/20 15:43
Worklog Time Spent: 10m 
  Work Description: yongzhi commented on pull request #1559:
URL: https://github.com/apache/hive/pull/1559#issuecomment-705023401


   recheck



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 496693)
Time Spent: 1h 10m  (was: 1h)

> Connection leak in TxnHandler
> -
>
> Key: HIVE-24236
> URL: https://issues.apache.org/jira/browse/HIVE-24236
> Project: Hive
>  Issue Type: Bug
>  Components: Metastore
>Reporter: Yongzhi Chen
>Assignee: Yongzhi Chen
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> We see failures in QE tests with cannot allocate connections errors. The 
> exception stack like following:
> {noformat}
> 2020-09-29T18:44:26,563 INFO  [Heartbeater-0]: txn.TxnHandler 
> (TxnHandler.java:checkRetryable(3733)) - Non-retryable error in 
> heartbeat(HeartbeatRequest(lockid:0, txnid:11908)) : Cannot get a connection, 
> general error (SQLState=null, ErrorCode=0)
> 2020-09-29T18:44:26,564 ERROR [Heartbeater-0]: metastore.RetryingHMSHandler 
> (RetryingHMSHandler.java:invokeInternal(201)) - MetaException(message:Unable 
> to select from transaction database 
> org.apache.commons.dbcp.SQLNestedException: Cannot get a connection, general 
> error
> at 
> org.apache.commons.dbcp.PoolingDataSource.getConnection(PoolingDataSource.java:118)
> at 
> org.apache.hadoop.hive.metastore.txn.TxnHandler.getDbConn(TxnHandler.java:3605)
> at 
> org.apache.hadoop.hive.metastore.txn.TxnHandler.getDbConn(TxnHandler.java:3598)
> at 
> org.apache.hadoop.hive.metastore.txn.TxnHandler.heartbeat(TxnHandler.java:2739)
> at 
> org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.heartbeat(HiveMetaStore.java:8452)
> at sun.reflect.GeneratedMethodAccessor415.invoke(Unknown Source)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at 
> org.apache.hadoop.hive.metastore.RetryingHMSHandler.invokeInternal(RetryingHMSHandler.java:147)
> at 
> org.apache.hadoop.hive.metastore.RetryingHMSHandler.invoke(RetryingHMSHandler.java:108)
> at com.sun.proxy.$Proxy63.heartbeat(Unknown Source)
> at 
> org.apache.hadoop.hive.metastore.HiveMetaStoreClient.heartbeat(HiveMetaStoreClient.java:3247)
> at sun.reflect.GeneratedMethodAccessor414.invoke(Unknown Source)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at 
> org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:213)
> at com.sun.proxy.$Proxy64.heartbeat(Unknown Source)
> at 
> org.apache.hadoop.hive.ql.lockmgr.DbTxnManager.heartbeat(DbTxnManager.java:671)
> at 
> org.apache.hadoop.hive.ql.lockmgr.DbTxnManager$Heartbeater.lambda$run$0(DbTxnManager.java:1102)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1898)
> at 
> org.apache.hadoop.hive.ql.lockmgr.DbTxnManager$Heartbeater.run(DbTxnManager.java:1101)
> at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
> at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
> at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> Caused by: java.lang.InterruptedException
> at java.lang.Object.wait(Native Method)
> at 
> 

[jira] [Work logged] (HIVE-24203) Implement stats annotation rule for the LateralViewJoinOperator

2020-10-07 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24203?focusedWorklogId=496688=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-496688
 ]

ASF GitHub Bot logged work on HIVE-24203:
-

Author: ASF GitHub Bot
Created on: 07/Oct/20 15:35
Start Date: 07/Oct/20 15:35
Worklog Time Spent: 10m 
  Work Description: okumin commented on a change in pull request #1531:
URL: https://github.com/apache/hive/pull/1531#discussion_r501110922



##
File path: 
ql/src/java/org/apache/hadoop/hive/ql/optimizer/stats/annotation/StatsRulesProcFactory.java
##
@@ -2921,6 +2920,97 @@ public Object process(Node nd, Stack stack, 
NodeProcessorCtx procCtx,
 }
   }
 
+  /**
+   * LateralViewJoinOperator changes the data size and column level statistics.
+   *
+   * A diagram of LATERAL VIEW.
+   *
+   *   [Lateral View Forward]
+   *  / \
+   *[Select]  [Select]
+   *||
+   *| [UDTF]
+   *\   /
+   *   [Lateral View Join]
+   *
+   * For each row of the source, the left branch just picks columns and the 
right branch processes UDTF.
+   * And then LVJ joins a row from the left branch with rows from the right 
branch.
+   * The join has one-to-many relationship since UDTF can generate multiple 
rows.
+   *
+   * This rule multiplies the stats from the left branch by T(right) / T(left) 
and sums up the both sides.
+   */
+  public static class LateralViewJoinStatsRule extends DefaultStatsRule 
implements SemanticNodeProcessor {
+@Override
+public Object process(Node nd, Stack stack, NodeProcessorCtx procCtx,
+  Object... nodeOutputs) throws SemanticException {
+  final LateralViewJoinOperator lop = (LateralViewJoinOperator) nd;
+  final AnnotateStatsProcCtx aspCtx = (AnnotateStatsProcCtx) procCtx;
+  final HiveConf conf = aspCtx.getConf();
+
+  if (!isAllParentsContainStatistics(lop)) {
+return null;
+  }
+
+  final List> parents = 
lop.getParentOperators();
+  if (parents.size() != 2) {
+LOG.warn("LateralViewJoinOperator should have just two parents but 
actually has "
++ parents.size() + " parents.");
+return null;
+  }
+
+  final Statistics selectStats = 
parents.get(LateralViewJoinOperator.SELECT_TAG).getStatistics();
+  final Statistics udtfStats = 
parents.get(LateralViewJoinOperator.UDTF_TAG).getStatistics();
+
+  final double factor = (double) udtfStats.getNumRows() / (double) 
selectStats.getNumRows();
+  final long selectDataSize = 
StatsUtils.safeMult(selectStats.getDataSize(), factor);
+  final long dataSize = StatsUtils.safeAdd(selectDataSize, 
udtfStats.getDataSize());
+  Statistics joinedStats = new Statistics(udtfStats.getNumRows(), 
dataSize, 0, 0);
+
+  if (satisfyPrecondition(selectStats) && satisfyPrecondition(udtfStats)) {
+final Map columnExprMap = lop.getColumnExprMap();
+final RowSchema schema = lop.getSchema();
+
+joinedStats.updateColumnStatsState(selectStats.getColumnStatsState());
+final List selectColStats = StatsUtils
+.getColStatisticsFromExprMap(conf, selectStats, columnExprMap, 
schema);
+joinedStats.addToColumnStats(multiplyColStats(selectColStats, factor));
+
+joinedStats.updateColumnStatsState(udtfStats.getColumnStatsState());
+final List udtfColStats = StatsUtils
+.getColStatisticsFromExprMap(conf, udtfStats, columnExprMap, 
schema);
+joinedStats.addToColumnStats(udtfColStats);
+
+joinedStats = applyRuntimeStats(aspCtx.getParseContext().getContext(), 
joinedStats, lop);
+lop.setStatistics(joinedStats);
+
+if (LOG.isDebugEnabled()) {
+  LOG.debug("[0] STATS-" + lop.toString() + ": " + 
joinedStats.extendedToString());
+}
+  } else {
+joinedStats = applyRuntimeStats(aspCtx.getParseContext().getContext(), 
joinedStats, lop);
+lop.setStatistics(joinedStats);
+
+if (LOG.isDebugEnabled()) {
+  LOG.debug("[1] STATS-" + lop.toString() + ": " + 
joinedStats.extendedToString());
+}
+  }
+  return null;
+}
+
+private List multiplyColStats(List 
colStatistics, double factor) {
+  for (ColStatistics colStats : colStatistics) {
+colStats.setNumFalses(StatsUtils.safeMult(colStats.getNumFalses(), 
factor));
+colStats.setNumTrues(StatsUtils.safeMult(colStats.getNumTrues(), 
factor));
+colStats.setNumNulls(StatsUtils.safeMult(colStats.getNumNulls(), 
factor));
+// When factor > 1, the same records are duplicated and countDistinct 
never changes.
+if (factor < 1.0) {
+  
colStats.setCountDistint(StatsUtils.safeMult(colStats.getCountDistint(), 
factor));

Review comment:
   This method may include additional logging and logics to optimize JOIN 
such as 

[jira] [Work logged] (HIVE-24203) Implement stats annotation rule for the LateralViewJoinOperator

2020-10-07 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24203?focusedWorklogId=496681=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-496681
 ]

ASF GitHub Bot logged work on HIVE-24203:
-

Author: ASF GitHub Bot
Created on: 07/Oct/20 15:23
Start Date: 07/Oct/20 15:23
Worklog Time Spent: 10m 
  Work Description: okumin commented on a change in pull request #1531:
URL: https://github.com/apache/hive/pull/1531#discussion_r501101794



##
File path: 
ql/src/java/org/apache/hadoop/hive/ql/optimizer/stats/annotation/StatsRulesProcFactory.java
##
@@ -2921,6 +2920,97 @@ public Object process(Node nd, Stack stack, 
NodeProcessorCtx procCtx,
 }
   }
 
+  /**
+   * LateralViewJoinOperator changes the data size and column level statistics.
+   *
+   * A diagram of LATERAL VIEW.
+   *
+   *   [Lateral View Forward]
+   *  / \
+   *[Select]  [Select]
+   *||
+   *| [UDTF]
+   *\   /
+   *   [Lateral View Join]
+   *
+   * For each row of the source, the left branch just picks columns and the 
right branch processes UDTF.
+   * And then LVJ joins a row from the left branch with rows from the right 
branch.
+   * The join has one-to-many relationship since UDTF can generate multiple 
rows.
+   *
+   * This rule multiplies the stats from the left branch by T(right) / T(left) 
and sums up the both sides.
+   */
+  public static class LateralViewJoinStatsRule extends DefaultStatsRule 
implements SemanticNodeProcessor {
+@Override
+public Object process(Node nd, Stack stack, NodeProcessorCtx procCtx,
+  Object... nodeOutputs) throws SemanticException {
+  final LateralViewJoinOperator lop = (LateralViewJoinOperator) nd;
+  final AnnotateStatsProcCtx aspCtx = (AnnotateStatsProcCtx) procCtx;
+  final HiveConf conf = aspCtx.getConf();
+
+  if (!isAllParentsContainStatistics(lop)) {
+return null;
+  }
+
+  final List> parents = 
lop.getParentOperators();
+  if (parents.size() != 2) {
+LOG.warn("LateralViewJoinOperator should have just two parents but 
actually has "
++ parents.size() + " parents.");
+return null;
+  }
+
+  final Statistics selectStats = 
parents.get(LateralViewJoinOperator.SELECT_TAG).getStatistics();
+  final Statistics udtfStats = 
parents.get(LateralViewJoinOperator.UDTF_TAG).getStatistics();
+
+  final double factor = (double) udtfStats.getNumRows() / (double) 
selectStats.getNumRows();
+  final long selectDataSize = 
StatsUtils.safeMult(selectStats.getDataSize(), factor);
+  final long dataSize = StatsUtils.safeAdd(selectDataSize, 
udtfStats.getDataSize());
+  Statistics joinedStats = new Statistics(udtfStats.getNumRows(), 
dataSize, 0, 0);
+
+  if (satisfyPrecondition(selectStats) && satisfyPrecondition(udtfStats)) {
+final Map columnExprMap = lop.getColumnExprMap();
+final RowSchema schema = lop.getSchema();
+
+joinedStats.updateColumnStatsState(selectStats.getColumnStatsState());
+final List selectColStats = StatsUtils
+.getColStatisticsFromExprMap(conf, selectStats, columnExprMap, 
schema);
+joinedStats.addToColumnStats(multiplyColStats(selectColStats, factor));
+
+joinedStats.updateColumnStatsState(udtfStats.getColumnStatsState());
+final List udtfColStats = StatsUtils
+.getColStatisticsFromExprMap(conf, udtfStats, columnExprMap, 
schema);
+joinedStats.addToColumnStats(udtfColStats);
+
+joinedStats = applyRuntimeStats(aspCtx.getParseContext().getContext(), 
joinedStats, lop);
+lop.setStatistics(joinedStats);
+
+if (LOG.isDebugEnabled()) {
+  LOG.debug("[0] STATS-" + lop.toString() + ": " + 
joinedStats.extendedToString());
+}

Review comment:
   I wonder if we should switch `[0]` or `[1]` based on a condition. I can 
see some rules use a different marker based on maybe the existence of column 
stats.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 496681)
Time Spent: 1h 40m  (was: 1.5h)

> Implement stats annotation rule for the LateralViewJoinOperator
> ---
>
> Key: HIVE-24203
> URL: https://issues.apache.org/jira/browse/HIVE-24203
> Project: Hive
>  Issue Type: Improvement
>  Components: Physical Optimizer
>Affects Versions: 4.0.0, 3.1.2, 2.3.7
>Reporter: okumin
>

[jira] [Work logged] (HIVE-24203) Implement stats annotation rule for the LateralViewJoinOperator

2020-10-07 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24203?focusedWorklogId=496677=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-496677
 ]

ASF GitHub Bot logged work on HIVE-24203:
-

Author: ASF GitHub Bot
Created on: 07/Oct/20 15:17
Start Date: 07/Oct/20 15:17
Worklog Time Spent: 10m 
  Work Description: okumin commented on a change in pull request #1531:
URL: https://github.com/apache/hive/pull/1531#discussion_r501096999



##
File path: 
ql/src/java/org/apache/hadoop/hive/ql/optimizer/stats/annotation/StatsRulesProcFactory.java
##
@@ -2921,6 +2920,97 @@ public Object process(Node nd, Stack stack, 
NodeProcessorCtx procCtx,
 }
   }
 
+  /**
+   * LateralViewJoinOperator changes the data size and column level statistics.
+   *
+   * A diagram of LATERAL VIEW.
+   *
+   *   [Lateral View Forward]
+   *  / \
+   *[Select]  [Select]
+   *||
+   *| [UDTF]
+   *\   /
+   *   [Lateral View Join]
+   *
+   * For each row of the source, the left branch just picks columns and the 
right branch processes UDTF.
+   * And then LVJ joins a row from the left branch with rows from the right 
branch.
+   * The join has one-to-many relationship since UDTF can generate multiple 
rows.
+   *
+   * This rule multiplies the stats from the left branch by T(right) / T(left) 
and sums up the both sides.
+   */
+  public static class LateralViewJoinStatsRule extends DefaultStatsRule 
implements SemanticNodeProcessor {
+@Override
+public Object process(Node nd, Stack stack, NodeProcessorCtx procCtx,
+  Object... nodeOutputs) throws SemanticException {
+  final LateralViewJoinOperator lop = (LateralViewJoinOperator) nd;
+  final AnnotateStatsProcCtx aspCtx = (AnnotateStatsProcCtx) procCtx;
+  final HiveConf conf = aspCtx.getConf();
+
+  if (!isAllParentsContainStatistics(lop)) {
+return null;
+  }
+
+  final List> parents = 
lop.getParentOperators();
+  if (parents.size() != 2) {
+LOG.warn("LateralViewJoinOperator should have just two parents but 
actually has "
++ parents.size() + " parents.");
+return null;
+  }
+
+  final Statistics selectStats = 
parents.get(LateralViewJoinOperator.SELECT_TAG).getStatistics();
+  final Statistics udtfStats = 
parents.get(LateralViewJoinOperator.UDTF_TAG).getStatistics();
+
+  final double factor = (double) udtfStats.getNumRows() / (double) 
selectStats.getNumRows();
+  final long selectDataSize = 
StatsUtils.safeMult(selectStats.getDataSize(), factor);
+  final long dataSize = StatsUtils.safeAdd(selectDataSize, 
udtfStats.getDataSize());
+  Statistics joinedStats = new Statistics(udtfStats.getNumRows(), 
dataSize, 0, 0);
+
+  if (satisfyPrecondition(selectStats) && satisfyPrecondition(udtfStats)) {
+final Map columnExprMap = lop.getColumnExprMap();
+final RowSchema schema = lop.getSchema();
+
+joinedStats.updateColumnStatsState(selectStats.getColumnStatsState());
+final List selectColStats = StatsUtils
+.getColStatisticsFromExprMap(conf, selectStats, columnExprMap, 
schema);
+joinedStats.addToColumnStats(multiplyColStats(selectColStats, factor));
+
+joinedStats.updateColumnStatsState(udtfStats.getColumnStatsState());
+final List udtfColStats = StatsUtils
+.getColStatisticsFromExprMap(conf, udtfStats, columnExprMap, 
schema);
+joinedStats.addToColumnStats(udtfColStats);
+
+joinedStats = applyRuntimeStats(aspCtx.getParseContext().getContext(), 
joinedStats, lop);
+lop.setStatistics(joinedStats);
+
+if (LOG.isDebugEnabled()) {
+  LOG.debug("[0] STATS-" + lop.toString() + ": " + 
joinedStats.extendedToString());
+}
+  } else {
+joinedStats = applyRuntimeStats(aspCtx.getParseContext().getContext(), 
joinedStats, lop);
+lop.setStatistics(joinedStats);
+
+if (LOG.isDebugEnabled()) {
+  LOG.debug("[1] STATS-" + lop.toString() + ": " + 
joinedStats.extendedToString());
+}
+  }
+  return null;
+}
+
+private List multiplyColStats(List 
colStatistics, double factor) {
+  for (ColStatistics colStats : colStatistics) {
+colStats.setNumFalses(StatsUtils.safeMult(colStats.getNumFalses(), 
factor));
+colStats.setNumTrues(StatsUtils.safeMult(colStats.getNumTrues(), 
factor));
+colStats.setNumNulls(StatsUtils.safeMult(colStats.getNumNulls(), 
factor));
+// When factor > 1, the same records are duplicated and countDistinct 
never changes.
+if (factor < 1.0) {
+  
colStats.setCountDistint(StatsUtils.safeMult(colStats.getCountDistint(), 
factor));

Review comment:
   Now I think this is available for this purpose if we add updating num 
trues and 

[jira] [Commented] (HIVE-24040) Slightly odd behaviour with CHAR comparisons and string literals

2020-10-07 Thread Zoltan Haindrich (Jira)


[ 
https://issues.apache.org/jira/browse/HIVE-24040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17209610#comment-17209610
 ] 

Zoltan Haindrich commented on HIVE-24040:
-

{code}
select cast('a' as char(10)) = cast('a ' as varchar(50))
{code}

in psql I got some interesting results:
{code}
select length(cast('a ' as varchar(10))),length(cast('a ' as char(10) ) 
),cast('a ' as varchar(10))=cast('a ' as char(10) );
 length | length | ?column? 
++--
  2 |  1 | t
{code}

in Hive for the above case the comparision should happen in "string" for which 
the lengths are different => will not match
{code}
select length(cast(cast('a' as char(10)) as string)),length(cast(cast('a ' as 
varchar(50)) as string))
+--+--+
| _c0  | _c1  |
+--+--+
| 1| 2|
+--+--+
{code}

I feel that this is somewhere in the gray zone...will dig into the sql specs...

> Slightly odd behaviour with CHAR comparisons and string literals
> 
>
> Key: HIVE-24040
> URL: https://issues.apache.org/jira/browse/HIVE-24040
> Project: Hive
>  Issue Type: Bug
>Reporter: Tim Armstrong
>Priority: Major
>
> If t is a char column, this statement behaves a bit strangely - since the RHS 
> is a STRING, I would have expected it to behave consistently with other 
> CHAR/STRING comparisons, where the CHAR column has its trailing spaces 
> removed and the STRING does not have its trailing spaces removed.
> {noformat}
> select count(*) from ax where t = cast('a ' as string);
> {noformat}
> Instead it seems to be treated the same as if it was a plain literal, 
> interpreted as CHAR, i.e.
> {noformat}
> select count(*) from ax where t = 'a ';
> {noformat}
> Here are some more experiments I did based on 
> https://github.com/apache/hive/blob/master/ql/src/test/queries/clientpositive/in_typecheck_char.q
>  that seem to show some inconsistencies.
> {noformat}
> -- Hive version 3.1.3000.7.2.1.0-287 r4e72e59f1c2a51a64e0ff37b14bd396cd4e97b98
> create table ax(s char(1),t char(10));
> insert into ax values ('a','a'),('a','a '),('b','bb');
> -- varchar literal preserves trailing space
> select count(*) from ax where t = cast('a ' as varchar(50));
> +--+
> | _c0  |
> +--+
> | 0|
> +--+
> -- explicit cast of literal to string removes trailing space
> select count(*) from ax where t = cast('a ' as string);
> +--+
> | _c0  |
> +--+
> | 2|
> +--+
> -- other string expressions preserve trailing space
> select count(*) from ax where t = concat('a', ' ');
> +--+
> | _c0  |
> +--+
> | 0|
> +--+
> -- varchar col preserves trailing space
> create table stringv as select cast('a  ' as varchar(50));
> select count(*) from ax, stringv where t = `_c0`;
> +--+
> | _c0  |
> +--+
> | 0|
> +--+
> -- string col preserves trailing space
> create table stringa as select 'a  ';
> select count(*) from ax, stringa where t = `_c0`;
> +--+
> | _c0  |
> +--+
> | 0|
> +--+
> {noformat}
> [~jcamachorodriguez] [~kgyrtkirk]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (HIVE-24203) Implement stats annotation rule for the LateralViewJoinOperator

2020-10-07 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24203?focusedWorklogId=496654=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-496654
 ]

ASF GitHub Bot logged work on HIVE-24203:
-

Author: ASF GitHub Bot
Created on: 07/Oct/20 14:53
Start Date: 07/Oct/20 14:53
Worklog Time Spent: 10m 
  Work Description: okumin commented on a change in pull request #1531:
URL: https://github.com/apache/hive/pull/1531#discussion_r501078148



##
File path: 
ql/src/test/results/clientpositive/llap/annotate_stats_lateral_view_join.q.out
##
@@ -503,14 +503,14 @@ STAGE PLANS:
 Statistics: Num rows: 1 Data size: 376 Basic 
stats: COMPLETE Column stats: COMPLETE
 Lateral View Join Operator
   outputColumnNames: _col0, _col1, _col5, _col6
-  Statistics: Num rows: 0 Data size: 24 Basic 
stats: PARTIAL Column stats: NONE
+  Statistics: Num rows: 0 Data size: 24 Basic 
stats: PARTIAL Column stats: COMPLETE

Review comment:
   This is an edge case since `HIVE_STATS_UDTF_FACTOR` is greater than or 
equal to 1. Anyway, I created a ticket.
   https://issues.apache.org/jira/browse/HIVE-24240





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 496654)
Time Spent: 1h 20m  (was: 1h 10m)

> Implement stats annotation rule for the LateralViewJoinOperator
> ---
>
> Key: HIVE-24203
> URL: https://issues.apache.org/jira/browse/HIVE-24203
> Project: Hive
>  Issue Type: Improvement
>  Components: Physical Optimizer
>Affects Versions: 4.0.0, 3.1.2, 2.3.7
>Reporter: okumin
>Assignee: okumin
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> StatsRulesProcFactory doesn't have any rules to handle a JOIN by LATERAL VIEW.
> This can cause an underestimation in case that UDTF in LATERAL VIEW generates 
> multiple rows.
> HIVE-20262 has already added the rule for UDTF.
> This issue would add the rule for LateralViewJoinOperator.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (HIVE-24240) Implement missing features in UDTFStatsRule

2020-10-07 Thread okumin (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

okumin reassigned HIVE-24240:
-


> Implement missing features in UDTFStatsRule
> ---
>
> Key: HIVE-24240
> URL: https://issues.apache.org/jira/browse/HIVE-24240
> Project: Hive
>  Issue Type: Improvement
>Affects Versions: 4.0.0
>Reporter: okumin
>Assignee: okumin
>Priority: Major
>
> Add the following steps.
>  * Handle the case in which the num row will be zero
>  * Compute runtime stats in case of a re-execution



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HIVE-23667) Incorrect output with option hive.auto.convert.join=fasle

2020-10-07 Thread Zoltan Haindrich (Jira)


[ 
https://issues.apache.org/jira/browse/HIVE-23667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17209579#comment-17209579
 ] 

Zoltan Haindrich commented on HIVE-23667:
-

could you please give a complete example to reproduce the issue?

> Incorrect output with option hive.auto.convert.join=fasle
> -
>
> Key: HIVE-23667
> URL: https://issues.apache.org/jira/browse/HIVE-23667
> Project: Hive
>  Issue Type: Bug
>Affects Versions: 3.1.0
>Reporter: gaozhan ding
>Priority: Critical
>
> We use hive with version 3.1.0 with tez engine 0.9.1.3
> I encountered an error when executing a hive SQL. This SQL is as follows
> {code:java}
> set mapreduce.job.queuename=root.xxx;
> set hive.exec.dynamic.partition.mode=nonstrict;
> set hive.exec.dynamic.partition=true;
> set hive.exec.max.dynamic.partitions.pernode=1;
> set hive.exec.max.dynamic.partitions=1;
> set hive.fileformat.check=false;
> set mapred.reduce.tasks=50;
> set hive.auto.convert.join=true;
> use xxx;
> select count(*) from   230_dim_site  join dw_fact_inverter_detail on  
> dw_fact_inverter_detail.site=230_dim_site.id;{code}
> with the output.
> {code:java}
> +--+ | _c0 | +--+ | 4954736 | +--+
> {code}
> But when the hive.auto.convert.join option is set to false,the utput is not 
> as expected。
> The SQL is as follows
> {code:java}
> set mapreduce.job.queuename=root.xxx;
> set hive.exec.dynamic.partition.mode=nonstrict;
> set hive.exec.dynamic.partition=true;
> set hive.exec.max.dynamic.partitions.pernode=1;
> set hive.exec.max.dynamic.partitions=1;
> set hive.fileformat.check=false;  
> set mapred.reduce.tasks=50;
> set hive.auto.convert.join=false; //changed
> use xxx;
> select count(*) from   230_dim_site  join dw_fact_inverter_detail on  
> dw_fact_inverter_detail.site=230_dim_site.id;{code}
> with output:
> {code:java}
> +--+ | _c0 | +--+ | 0 | +--+
> {code}
> Beside,both tables participating in the join are partition tables.
> Especially,if the option mapred.reduce.tasks=50 was not set,all above the sql 
> output expected results.
> We just upgraded hive from 1.2 to 3.1.0, and we found that these problems 
> only occurred in the old hive table.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (HIVE-24203) Implement stats annotation rule for the LateralViewJoinOperator

2020-10-07 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24203?focusedWorklogId=496615=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-496615
 ]

ASF GitHub Bot logged work on HIVE-24203:
-

Author: ASF GitHub Bot
Created on: 07/Oct/20 13:51
Start Date: 07/Oct/20 13:51
Worklog Time Spent: 10m 
  Work Description: kgyrtkirk commented on a change in pull request #1531:
URL: https://github.com/apache/hive/pull/1531#discussion_r500976202



##
File path: 
ql/src/java/org/apache/hadoop/hive/ql/optimizer/stats/annotation/StatsRulesProcFactory.java
##
@@ -2921,6 +2920,97 @@ public Object process(Node nd, Stack stack, 
NodeProcessorCtx procCtx,
 }
   }
 
+  /**
+   * LateralViewJoinOperator changes the data size and column level statistics.
+   *
+   * A diagram of LATERAL VIEW.
+   *
+   *   [Lateral View Forward]
+   *  / \
+   *[Select]  [Select]
+   *||
+   *| [UDTF]
+   *\   /
+   *   [Lateral View Join]
+   *
+   * For each row of the source, the left branch just picks columns and the 
right branch processes UDTF.
+   * And then LVJ joins a row from the left branch with rows from the right 
branch.
+   * The join has one-to-many relationship since UDTF can generate multiple 
rows.
+   *
+   * This rule multiplies the stats from the left branch by T(right) / T(left) 
and sums up the both sides.
+   */
+  public static class LateralViewJoinStatsRule extends DefaultStatsRule 
implements SemanticNodeProcessor {
+@Override
+public Object process(Node nd, Stack stack, NodeProcessorCtx procCtx,
+  Object... nodeOutputs) throws SemanticException {
+  final LateralViewJoinOperator lop = (LateralViewJoinOperator) nd;
+  final AnnotateStatsProcCtx aspCtx = (AnnotateStatsProcCtx) procCtx;
+  final HiveConf conf = aspCtx.getConf();
+
+  if (!isAllParentsContainStatistics(lop)) {
+return null;
+  }
+
+  final List> parents = 
lop.getParentOperators();
+  if (parents.size() != 2) {
+LOG.warn("LateralViewJoinOperator should have just two parents but 
actually has "
++ parents.size() + " parents.");
+return null;
+  }
+
+  final Statistics selectStats = 
parents.get(LateralViewJoinOperator.SELECT_TAG).getStatistics();
+  final Statistics udtfStats = 
parents.get(LateralViewJoinOperator.UDTF_TAG).getStatistics();
+
+  final double factor = (double) udtfStats.getNumRows() / (double) 
selectStats.getNumRows();

Review comment:
   I know `selectStats.getNumRows()` should not be zero - but just in 
case... could you also add the resulting logic as `StatsUtils` or something 
like that? 

##
File path: 
ql/src/java/org/apache/hadoop/hive/ql/optimizer/stats/annotation/StatsRulesProcFactory.java
##
@@ -2921,6 +2920,97 @@ public Object process(Node nd, Stack stack, 
NodeProcessorCtx procCtx,
 }
   }
 
+  /**
+   * LateralViewJoinOperator changes the data size and column level statistics.
+   *
+   * A diagram of LATERAL VIEW.
+   *
+   *   [Lateral View Forward]
+   *  / \
+   *[Select]  [Select]
+   *||
+   *| [UDTF]
+   *\   /
+   *   [Lateral View Join]
+   *
+   * For each row of the source, the left branch just picks columns and the 
right branch processes UDTF.
+   * And then LVJ joins a row from the left branch with rows from the right 
branch.
+   * The join has one-to-many relationship since UDTF can generate multiple 
rows.
+   *
+   * This rule multiplies the stats from the left branch by T(right) / T(left) 
and sums up the both sides.
+   */
+  public static class LateralViewJoinStatsRule extends DefaultStatsRule 
implements SemanticNodeProcessor {
+@Override
+public Object process(Node nd, Stack stack, NodeProcessorCtx procCtx,
+  Object... nodeOutputs) throws SemanticException {
+  final LateralViewJoinOperator lop = (LateralViewJoinOperator) nd;
+  final AnnotateStatsProcCtx aspCtx = (AnnotateStatsProcCtx) procCtx;
+  final HiveConf conf = aspCtx.getConf();
+
+  if (!isAllParentsContainStatistics(lop)) {
+return null;
+  }
+
+  final List> parents = 
lop.getParentOperators();
+  if (parents.size() != 2) {
+LOG.warn("LateralViewJoinOperator should have just two parents but 
actually has "
++ parents.size() + " parents.");
+return null;
+  }
+
+  final Statistics selectStats = 
parents.get(LateralViewJoinOperator.SELECT_TAG).getStatistics();
+  final Statistics udtfStats = 
parents.get(LateralViewJoinOperator.UDTF_TAG).getStatistics();
+
+  final double factor = (double) udtfStats.getNumRows() / (double) 
selectStats.getNumRows();
+  final long selectDataSize = 
StatsUtils.safeMult(selectStats.getDataSize(), factor);
+  final 

[jira] [Work logged] (HIVE-24229) DirectSql fails in case of OracleDB

2020-10-07 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24229?focusedWorklogId=496559=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-496559
 ]

ASF GitHub Bot logged work on HIVE-24229:
-

Author: ASF GitHub Bot
Created on: 07/Oct/20 12:47
Start Date: 07/Oct/20 12:47
Worklog Time Spent: 10m 
  Work Description: ayushtkn commented on pull request #1552:
URL: https://github.com/apache/hive/pull/1552#issuecomment-704911416


   Yes, This gets surface in an internal test when run on oracle DB, The table 
had a partition of type int, and I tried to access that using spark, using an 
extension.
   something like-
   `sql(“select * from store_sales where ss_store_sk=10”).show` 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 496559)
Time Spent: 0.5h  (was: 20m)

> DirectSql fails in case of OracleDB
> ---
>
> Key: HIVE-24229
> URL: https://issues.apache.org/jira/browse/HIVE-24229
> Project: Hive
>  Issue Type: Bug
>Reporter: Ayush Saxena
>Assignee: Ayush Saxena
>Priority: Critical
>  Labels: pull-request-available
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> Direct Sql fails due to different data type mapping incase of Oracle DB



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (HIVE-24229) DirectSql fails in case of OracleDB

2020-10-07 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24229?focusedWorklogId=496542=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-496542
 ]

ASF GitHub Bot logged work on HIVE-24229:
-

Author: ASF GitHub Bot
Created on: 07/Oct/20 12:22
Start Date: 07/Oct/20 12:22
Worklog Time Spent: 10m 
  Work Description: kgyrtkirk commented on pull request #1552:
URL: https://github.com/apache/hive/pull/1552#issuecomment-704898236


   this "clob" stuff keeps coming back again-and-again... do you have a way to 
reproduce the issue?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 496542)
Time Spent: 20m  (was: 10m)

> DirectSql fails in case of OracleDB
> ---
>
> Key: HIVE-24229
> URL: https://issues.apache.org/jira/browse/HIVE-24229
> Project: Hive
>  Issue Type: Bug
>Reporter: Ayush Saxena
>Assignee: Ayush Saxena
>Priority: Critical
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Direct Sql fails due to different data type mapping incase of Oracle DB



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (HIVE-23800) Add hooks when HiveServer2 stops due to OutOfMemoryError

2020-10-07 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-23800?focusedWorklogId=496538=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-496538
 ]

ASF GitHub Bot logged work on HIVE-23800:
-

Author: ASF GitHub Bot
Created on: 07/Oct/20 12:14
Start Date: 07/Oct/20 12:14
Worklog Time Spent: 10m 
  Work Description: kgyrtkirk commented on a change in pull request #1205:
URL: https://github.com/apache/hive/pull/1205#discussion_r500961287



##
File path: ql/src/java/org/apache/hadoop/hive/ql/HookRunner.java
##
@@ -39,57 +36,27 @@
 import org.apache.hadoop.hive.ql.parse.HiveSemanticAnalyzerHook;
 import org.apache.hadoop.hive.ql.parse.HiveSemanticAnalyzerHookContext;
 import org.apache.hadoop.hive.ql.session.SessionState;
-import org.apache.hadoop.hive.ql.session.SessionState.LogHelper;
 import org.apache.hive.common.util.HiveStringUtils;
 
+import static org.apache.hadoop.hive.ql.hooks.HookContext.HookType.*;
+
 /**
  * Handles hook executions for {@link Driver}.
  */
 public class HookRunner {
 
   private static final String CLASS_NAME = Driver.class.getName();
   private final HiveConf conf;
-  private LogHelper console;
-  private List queryHooks = new ArrayList<>();
-  private List saHooks = new ArrayList<>();
-  private List driverRunHooks = new ArrayList<>();
-  private List preExecHooks = new ArrayList<>();
-  private List postExecHooks = new ArrayList<>();
-  private List onFailureHooks = new ArrayList<>();
-  private boolean initialized = false;
+  private final HooksLoader loader;

Review comment:
   this is great!
   since from now on we can also dynamically add new hooks to it at runtime - 
we may rename it from "Loader" to something else.

##
File path: ql/src/java/org/apache/hadoop/hive/ql/hooks/HookContext.java
##
@@ -45,7 +47,50 @@
 public class HookContext {
 
   static public enum HookType {
-PRE_EXEC_HOOK, POST_EXEC_HOOK, ON_FAILURE_HOOK
+

Review comment:
   I like this approach - could you make a small check:
   
   * if we have hook compiled for the old api (which uses say the enum key 
`HookType.PRE_EXEC_HOOK`)
   * will it work or not  (without recompilation) with the new implementation





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 496538)
Time Spent: 5h 10m  (was: 5h)

> Add hooks when HiveServer2 stops due to OutOfMemoryError
> 
>
> Key: HIVE-23800
> URL: https://issues.apache.org/jira/browse/HIVE-23800
> Project: Hive
>  Issue Type: Improvement
>  Components: HiveServer2
>Reporter: Zhihua Deng
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 5h 10m
>  Remaining Estimate: 0h
>
> Make oom hook an interface of HiveServer2,  so user can implement the hook to 
> do something before HS2 stops, such as dumping the heap or altering the 
> devops.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (HIVE-23851) MSCK REPAIR Command With Partition Filtering Fails While Dropping Partitions

2020-10-07 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-23851?focusedWorklogId=496526=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-496526
 ]

ASF GitHub Bot logged work on HIVE-23851:
-

Author: ASF GitHub Bot
Created on: 07/Oct/20 12:03
Start Date: 07/Oct/20 12:03
Worklog Time Spent: 10m 
  Work Description: kgyrtkirk edited a comment on pull request #1271:
URL: https://github.com/apache/hive/pull/1271#issuecomment-704887635


   first of all; sorry for being very slow to respond - there were a bunch of 
things (renovation things :D) ...things look better now; so I'll be more likely 
to respond in a reasonbable timeframe :)
   
   I now wonder what's the benefit of this kryo stuff...I think there is no 
client in the world which could really use that correctly - I think we even 
bind our metastore/hive versions together - since it uses some internal ql 
classes inside the kryo byte array
   
   what do you think about the following - would it be possibile:
   * remove(or at least deprecate) the `byte[]` kryo stuff from the thrift api
   * replace it with the string based approach...



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 496526)
Time Spent: 4h 50m  (was: 4h 40m)

> MSCK REPAIR Command With Partition Filtering Fails While Dropping Partitions
> 
>
> Key: HIVE-23851
> URL: https://issues.apache.org/jira/browse/HIVE-23851
> Project: Hive
>  Issue Type: Bug
>Affects Versions: 4.0.0
>Reporter: Syed Shameerur Rahman
>Assignee: Syed Shameerur Rahman
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 4h 50m
>  Remaining Estimate: 0h
>
> *Steps to reproduce:*
> # Create external table
> # Run msck command to sync all the partitions with metastore
> # Remove one of the partition path
> # Run msck repair with partition filtering
> *Stack Trace:*
> {code:java}
>  2020-07-15T02:10:29,045 ERROR [4dad298b-28b1-4e6b-94b6-aa785b60c576 main] 
> ppr.PartitionExpressionForMetastore: Failed to deserialize the expression
>  java.lang.IndexOutOfBoundsException: Index: 110, Size: 0
>  at java.util.ArrayList.rangeCheck(ArrayList.java:657) ~[?:1.8.0_192]
>  at java.util.ArrayList.get(ArrayList.java:433) ~[?:1.8.0_192]
>  at 
> org.apache.hive.com.esotericsoftware.kryo.util.MapReferenceResolver.getReadObject(MapReferenceResolver.java:60)
>  ~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
>  at 
> org.apache.hive.com.esotericsoftware.kryo.Kryo.readReferenceOrNull(Kryo.java:857)
>  ~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
>  at org.apache.hive.com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:707) 
> ~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
>  at 
> org.apache.hadoop.hive.ql.exec.SerializationUtilities$KryoWithHooks.readObject(SerializationUtilities.java:211)
>  ~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
>  at 
> org.apache.hadoop.hive.ql.exec.SerializationUtilities.deserializeObjectFromKryo(SerializationUtilities.java:806)
>  ~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
>  at 
> org.apache.hadoop.hive.ql.exec.SerializationUtilities.deserializeExpressionFromKryo(SerializationUtilities.java:775)
>  ~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
>  at 
> org.apache.hadoop.hive.ql.optimizer.ppr.PartitionExpressionForMetastore.deserializeExpr(PartitionExpressionForMetastore.java:96)
>  [hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
>  at 
> org.apache.hadoop.hive.ql.optimizer.ppr.PartitionExpressionForMetastore.convertExprToFilter(PartitionExpressionForMetastore.java:52)
>  [hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
>  at 
> org.apache.hadoop.hive.metastore.PartFilterExprUtil.makeExpressionTree(PartFilterExprUtil.java:48)
>  [hive-standalone-metastore-server-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
>  at 
> org.apache.hadoop.hive.metastore.ObjectStore.getPartitionsByExprInternal(ObjectStore.java:3593)
>  [hive-standalone-metastore-server-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
>  at 
> org.apache.hadoop.hive.metastore.VerifyingObjectStore.getPartitionsByExpr(VerifyingObjectStore.java:80)
>  [hive-standalone-metastore-server-4.0.0-SNAPSHOT-tests.jar:4.0.0-SNAPSHOT]
>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:1.8.0_192]
>  at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 
> ~[?:1.8.0_192]
> {code}
> *Cause:*
> In case of msck repair with partition filtering we expect expression proxy 

[jira] [Work logged] (HIVE-23851) MSCK REPAIR Command With Partition Filtering Fails While Dropping Partitions

2020-10-07 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-23851?focusedWorklogId=496523=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-496523
 ]

ASF GitHub Bot logged work on HIVE-23851:
-

Author: ASF GitHub Bot
Created on: 07/Oct/20 12:01
Start Date: 07/Oct/20 12:01
Worklog Time Spent: 10m 
  Work Description: kgyrtkirk commented on pull request #1271:
URL: https://github.com/apache/hive/pull/1271#issuecomment-704887635


   I now wonder what's the benefit of this kryo stuff...I think there is no 
client in the world which could really use that correctly - I think we even 
bind our metastore/hive versions together - since it uses some internal ql 
classes inside the kryo byte array
   
   what do you think about the following - would it be possibile:
   * remove(or at least deprecate) the `byte[]` kryo stuff from the thrift api
   * replace it with the string based approach...



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 496523)
Time Spent: 4h 40m  (was: 4.5h)

> MSCK REPAIR Command With Partition Filtering Fails While Dropping Partitions
> 
>
> Key: HIVE-23851
> URL: https://issues.apache.org/jira/browse/HIVE-23851
> Project: Hive
>  Issue Type: Bug
>Affects Versions: 4.0.0
>Reporter: Syed Shameerur Rahman
>Assignee: Syed Shameerur Rahman
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 4h 40m
>  Remaining Estimate: 0h
>
> *Steps to reproduce:*
> # Create external table
> # Run msck command to sync all the partitions with metastore
> # Remove one of the partition path
> # Run msck repair with partition filtering
> *Stack Trace:*
> {code:java}
>  2020-07-15T02:10:29,045 ERROR [4dad298b-28b1-4e6b-94b6-aa785b60c576 main] 
> ppr.PartitionExpressionForMetastore: Failed to deserialize the expression
>  java.lang.IndexOutOfBoundsException: Index: 110, Size: 0
>  at java.util.ArrayList.rangeCheck(ArrayList.java:657) ~[?:1.8.0_192]
>  at java.util.ArrayList.get(ArrayList.java:433) ~[?:1.8.0_192]
>  at 
> org.apache.hive.com.esotericsoftware.kryo.util.MapReferenceResolver.getReadObject(MapReferenceResolver.java:60)
>  ~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
>  at 
> org.apache.hive.com.esotericsoftware.kryo.Kryo.readReferenceOrNull(Kryo.java:857)
>  ~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
>  at org.apache.hive.com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:707) 
> ~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
>  at 
> org.apache.hadoop.hive.ql.exec.SerializationUtilities$KryoWithHooks.readObject(SerializationUtilities.java:211)
>  ~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
>  at 
> org.apache.hadoop.hive.ql.exec.SerializationUtilities.deserializeObjectFromKryo(SerializationUtilities.java:806)
>  ~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
>  at 
> org.apache.hadoop.hive.ql.exec.SerializationUtilities.deserializeExpressionFromKryo(SerializationUtilities.java:775)
>  ~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
>  at 
> org.apache.hadoop.hive.ql.optimizer.ppr.PartitionExpressionForMetastore.deserializeExpr(PartitionExpressionForMetastore.java:96)
>  [hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
>  at 
> org.apache.hadoop.hive.ql.optimizer.ppr.PartitionExpressionForMetastore.convertExprToFilter(PartitionExpressionForMetastore.java:52)
>  [hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
>  at 
> org.apache.hadoop.hive.metastore.PartFilterExprUtil.makeExpressionTree(PartFilterExprUtil.java:48)
>  [hive-standalone-metastore-server-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
>  at 
> org.apache.hadoop.hive.metastore.ObjectStore.getPartitionsByExprInternal(ObjectStore.java:3593)
>  [hive-standalone-metastore-server-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
>  at 
> org.apache.hadoop.hive.metastore.VerifyingObjectStore.getPartitionsByExpr(VerifyingObjectStore.java:80)
>  [hive-standalone-metastore-server-4.0.0-SNAPSHOT-tests.jar:4.0.0-SNAPSHOT]
>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:1.8.0_192]
>  at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 
> ~[?:1.8.0_192]
> {code}
> *Cause:*
> In case of msck repair with partition filtering we expect expression proxy 
> class to be set as PartitionExpressionForMetastore ( 
> https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/ddl/misc/msck/MsckAnalyzer.java#L78
>  ), While dropping partition we 

[jira] [Work logged] (HIVE-24225) FIX S3A recordReader policy selection

2020-10-07 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24225?focusedWorklogId=496489=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-496489
 ]

ASF GitHub Bot logged work on HIVE-24225:
-

Author: ASF GitHub Bot
Created on: 07/Oct/20 10:54
Start Date: 07/Oct/20 10:54
Worklog Time Spent: 10m 
  Work Description: pgaref edited a comment on pull request #1547:
URL: https://github.com/apache/hive/pull/1547#issuecomment-704856290


   Hey @steveloughran  --- the approach of the above patch was a bit off, one 
problem was that the Fs objects were lazility initalized and could end up 
throwing exceptions when setting the option eagerly.
   The most important issue was that the LLAP IO creates its own FS object (and 
the above where only used for output) so the option itself was not properly 
propagated.
   
   A solution for all this could be the S3A **openFileWithOptions**  call that 
adds file options for the open File call instead of on the FS (still needs to 
add support for **fadvise**  though)
   
https://github.com/apache/hadoop/blob/trunk/hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AFileSystem.java#L4828
   
   Talking about this TODO:
   
https://github.com/apache/hadoop/blob/trunk/hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AFileSystem.java#L1136



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 496489)
Time Spent: 1h  (was: 50m)

> FIX S3A recordReader policy selection
> -
>
> Key: HIVE-24225
> URL: https://issues.apache.org/jira/browse/HIVE-24225
> Project: Hive
>  Issue Type: Bug
>Reporter: Panagiotis Garefalakis
>Assignee: Panagiotis Garefalakis
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> Dynamic S3A recordReader policy selection can cause issues on lazy 
> initialized FS objects



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (HIVE-24225) FIX S3A recordReader policy selection

2020-10-07 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24225?focusedWorklogId=496487=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-496487
 ]

ASF GitHub Bot logged work on HIVE-24225:
-

Author: ASF GitHub Bot
Created on: 07/Oct/20 10:53
Start Date: 07/Oct/20 10:53
Worklog Time Spent: 10m 
  Work Description: pgaref edited a comment on pull request #1547:
URL: https://github.com/apache/hive/pull/1547#issuecomment-704856290


   Hey @steveloughran  --- the approach of the above patch was a bit off, one 
problem was that the Fs objects were lazility initalized and could end up 
throwing exceptions when setting the option eagerly.
   The most important issue was that the LLAP IO creates its own FS object (and 
the above where only used for output) so the option itself was not properly 
propagated.
   
   A solution for all this could be the S3A **openFileWithOptions**  call that 
adds file options for the open File call instead of on the FS (still needs to 
add support for **fadvise**  though)
   
https://github.com/apache/hadoop/blob/trunk/hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AFileSystem.java#L4828



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 496487)
Time Spent: 50m  (was: 40m)

> FIX S3A recordReader policy selection
> -
>
> Key: HIVE-24225
> URL: https://issues.apache.org/jira/browse/HIVE-24225
> Project: Hive
>  Issue Type: Bug
>Reporter: Panagiotis Garefalakis
>Assignee: Panagiotis Garefalakis
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> Dynamic S3A recordReader policy selection can cause issues on lazy 
> initialized FS objects



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (HIVE-24225) FIX S3A recordReader policy selection

2020-10-07 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24225?focusedWorklogId=496486=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-496486
 ]

ASF GitHub Bot logged work on HIVE-24225:
-

Author: ASF GitHub Bot
Created on: 07/Oct/20 10:52
Start Date: 07/Oct/20 10:52
Worklog Time Spent: 10m 
  Work Description: pgaref commented on pull request #1547:
URL: https://github.com/apache/hive/pull/1547#issuecomment-704856290


   Hey @steveloughran  --- the approach of the above patch was a bit off, one 
problem was that the Fs objects were lazility initalized and could end up 
throwing exceptions when setting the option eagerly.
   The most important issue was that the LLAP IO creates its own FS object (and 
the above where only used for output).
   
   A solution for all this could be the S3A **openFileWithOptions**  call that 
adds file options for the open File call instead of on the FS (still needs to 
add support for **fadvise**  though)
   
https://github.com/apache/hadoop/blob/trunk/hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AFileSystem.java#L4828



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 496486)
Time Spent: 40m  (was: 0.5h)

> FIX S3A recordReader policy selection
> -
>
> Key: HIVE-24225
> URL: https://issues.apache.org/jira/browse/HIVE-24225
> Project: Hive
>  Issue Type: Bug
>Reporter: Panagiotis Garefalakis
>Assignee: Panagiotis Garefalakis
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> Dynamic S3A recordReader policy selection can cause issues on lazy 
> initialized FS objects



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (HIVE-24199) Incorrect result when subquey in exists contains limit

2020-10-07 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24199?focusedWorklogId=496476=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-496476
 ]

ASF GitHub Bot logged work on HIVE-24199:
-

Author: ASF GitHub Bot
Created on: 07/Oct/20 10:39
Start Date: 07/Oct/20 10:39
Worklog Time Spent: 10m 
  Work Description: kasakrisz commented on a change in pull request #1525:
URL: https://github.com/apache/hive/pull/1525#discussion_r500910826



##
File path: 
ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/rules/HiveSubQueryRemoveRule.java
##
@@ -406,6 +409,16 @@ private RexNode rewriteInExists(RexSubQuery e, 
Set variablesSet,
 offset = offset + 1;
 builder.push(e.rel);
   }
+} else if (e.getKind() == SqlKind.EXISTS && !variablesSet.isEmpty()) {
+  // Query has 'exists' and correlation:

Review comment:
   Added comment





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 496476)
Time Spent: 40m  (was: 0.5h)

> Incorrect result when subquey in exists contains limit
> --
>
> Key: HIVE-24199
> URL: https://issues.apache.org/jira/browse/HIVE-24199
> Project: Hive
>  Issue Type: Bug
>Reporter: Krisztian Kasa
>Assignee: Krisztian Kasa
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> {code:java}
> create table web_sales (ws_order_number int, ws_warehouse_sk int) stored as 
> orc;
> insert into web_sales values
> (1, 1),
> (1, 2),
> (2, 1),
> (2, 2);
> select * from web_sales ws1
> where exists (select 1 from web_sales ws2 where ws1.ws_order_number = 
> ws2.ws_order_number limit 1);
> 1 1
> 1 2
> {code}
> {code:java}
> CBO PLAN:
> HiveSemiJoin(condition=[=($0, $2)], joinType=[semi])
>   HiveProject(ws_order_number=[$0], ws_warehouse_sk=[$1])
> HiveFilter(condition=[IS NOT NULL($0)])
>   HiveTableScan(table=[[default, web_sales]], table:alias=[ws1])
>   HiveProject(ws_order_number=[$0])
> HiveSortLimit(fetch=[1])  <-- This shouldn't be added
>   HiveProject(ws_order_number=[$0])
> HiveFilter(condition=[IS NOT NULL($0)])
>   HiveTableScan(table=[[default, web_sales]], table:alias=[ws2])
> {code}
> Limit n on the right side of the join reduces the result set coming from the 
> right to only n record hence not all the ws_order_number values are included 
> which leads to correctness issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (HIVE-24225) FIX S3A recordReader policy selection

2020-10-07 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24225?focusedWorklogId=496472=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-496472
 ]

ASF GitHub Bot logged work on HIVE-24225:
-

Author: ASF GitHub Bot
Created on: 07/Oct/20 10:33
Start Date: 07/Oct/20 10:33
Worklog Time Spent: 10m 
  Work Description: steveloughran commented on pull request #1547:
URL: https://github.com/apache/hive/pull/1547#issuecomment-704847820


   why the revert?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 496472)
Time Spent: 0.5h  (was: 20m)

> FIX S3A recordReader policy selection
> -
>
> Key: HIVE-24225
> URL: https://issues.apache.org/jira/browse/HIVE-24225
> Project: Hive
>  Issue Type: Bug
>Reporter: Panagiotis Garefalakis
>Assignee: Panagiotis Garefalakis
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> Dynamic S3A recordReader policy selection can cause issues on lazy 
> initialized FS objects



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HIVE-24234) Improve checkHashModeEfficiency in VectorGroupByOperator

2020-10-07 Thread Rajesh Balamohan (Jira)


[ 
https://issues.apache.org/jira/browse/HIVE-24234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17209327#comment-17209327
 ] 

Rajesh Balamohan commented on HIVE-24234:
-

Thanks [~mustafaiman]. 

>> (outputRecords) / (inputRecords * 1.0f) can be larger than 1 when grouping 
>> sets are present. 

No, it is other way around. {{sumBatchSize} already includes the computation 
needed for grouping sets. So in worst possible case, the max ratio would be 
"1.0". Since "1.0 > 1.0" would be false, the config still holds good. (i.e 
setting 1.0 would never move to streaming mode.)

https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/vector/VectorGroupByOperator.java#L206
https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/vector/VectorGroupByOperator.java#L494


Basic idea is to ensure that, hashing with groupingsets has to be super 
effective (otherwise we end up paying the penalty of JVM mem pressure). 
Otherwise, it needs to bail out quickly and move to streaming mode. 

> Improve checkHashModeEfficiency in VectorGroupByOperator
> 
>
> Key: HIVE-24234
> URL: https://issues.apache.org/jira/browse/HIVE-24234
> Project: Hive
>  Issue Type: Improvement
>Reporter: Rajesh Balamohan
>Priority: Major
>  Labels: pull-request-available
> Attachments: HIVE-24234.wip.patch
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Currently, {{VectorGroupByOperator::checkHashModeEfficiency}} compares the 
> number of entries with the number input records that have been processed. For 
> grouping sets, it accounts for grouping set length as well.
> Issue is that, the condition becomes invalid after processing large number of 
> input records. This prevents the system from switching over to streaming 
> mode. 
> e.g Assume 500,000 input records processed, with 9 grouping sets, with 
> 100,000 entries in hashtable. Hashtable would never cross 4,500, entries 
> as the max size itself is 1M by default. 
> It would be good to compare the input records (adjusted for grouping sets) 
> with number of output records (along with size of hashtable size) to 
> determine hashing or streaming mode.
> E.g Q67.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HIVE-24234) Improve checkHashModeEfficiency in VectorGroupByOperator

2020-10-07 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HIVE-24234:
--
Labels: pull-request-available  (was: )

> Improve checkHashModeEfficiency in VectorGroupByOperator
> 
>
> Key: HIVE-24234
> URL: https://issues.apache.org/jira/browse/HIVE-24234
> Project: Hive
>  Issue Type: Improvement
>Reporter: Rajesh Balamohan
>Priority: Major
>  Labels: pull-request-available
> Attachments: HIVE-24234.wip.patch
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Currently, {{VectorGroupByOperator::checkHashModeEfficiency}} compares the 
> number of entries with the number input records that have been processed. For 
> grouping sets, it accounts for grouping set length as well.
> Issue is that, the condition becomes invalid after processing large number of 
> input records. This prevents the system from switching over to streaming 
> mode. 
> e.g Assume 500,000 input records processed, with 9 grouping sets, with 
> 100,000 entries in hashtable. Hashtable would never cross 4,500, entries 
> as the max size itself is 1M by default. 
> It would be good to compare the input records (adjusted for grouping sets) 
> with number of output records (along with size of hashtable size) to 
> determine hashing or streaming mode.
> E.g Q67.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (HIVE-24234) Improve checkHashModeEfficiency in VectorGroupByOperator

2020-10-07 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24234?focusedWorklogId=496334=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-496334
 ]

ASF GitHub Bot logged work on HIVE-24234:
-

Author: ASF GitHub Bot
Created on: 07/Oct/20 06:39
Start Date: 07/Oct/20 06:39
Worklog Time Spent: 10m 
  Work Description: rbalamohan opened a new pull request #1560:
URL: https://github.com/apache/hive/pull/1560


   https://issues.apache.org/jira/browse/HIVE-24234
   
   Queries with grouping sets process input records multiple times and 
increases significantly the number hash aggregation lookup operations. When 
there is not significant reduction with aggregation, it becomes memory 
intensive and adds up to JVM mem pressure. 
   
   Earlier, due to minor bug, it wasn't switching over to hash mode. This has 
been fixed in the current patch and also takes care of the situation, when 
grouping sets are not very effective in reduction. 
   
   Tried out Q67 in TPCDS in internal cluster which shows significant 
improvement with this.
   
   For standalone tests, TestVectorGroupByOperator covers this.
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 496334)
Remaining Estimate: 0h
Time Spent: 10m

> Improve checkHashModeEfficiency in VectorGroupByOperator
> 
>
> Key: HIVE-24234
> URL: https://issues.apache.org/jira/browse/HIVE-24234
> Project: Hive
>  Issue Type: Improvement
>Reporter: Rajesh Balamohan
>Priority: Major
> Attachments: HIVE-24234.wip.patch
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Currently, {{VectorGroupByOperator::checkHashModeEfficiency}} compares the 
> number of entries with the number input records that have been processed. For 
> grouping sets, it accounts for grouping set length as well.
> Issue is that, the condition becomes invalid after processing large number of 
> input records. This prevents the system from switching over to streaming 
> mode. 
> e.g Assume 500,000 input records processed, with 9 grouping sets, with 
> 100,000 entries in hashtable. Hashtable would never cross 4,500, entries 
> as the max size itself is 1M by default. 
> It would be good to compare the input records (adjusted for grouping sets) 
> with number of output records (along with size of hashtable size) to 
> determine hashing or streaming mode.
> E.g Q67.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HIVE-24238) ClassCastException in vectorized order-by query over avro table with uniontype column

2020-10-07 Thread Gabriel C Balan (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabriel C Balan updated HIVE-24238:
---
Component/s: Vectorization

> ClassCastException in vectorized order-by query over avro table with 
> uniontype column
> -
>
> Key: HIVE-24238
> URL: https://issues.apache.org/jira/browse/HIVE-24238
> Project: Hive
>  Issue Type: Bug
>  Components: Avro, Vectorization
>Affects Versions: 3.1.0, 3.1.2
>Reporter: Gabriel C Balan
>Priority: Minor
>
> {noformat:title=Reproducer}
> create table avro_reproducer (key int, union_col uniontype ) 
> stored as avro location '/tmp/avro_reproducer';
> INSERT INTO TABLE avro_reproducer values (0, create_union(0, 123, 'not me')), 
>  (1, create_union(1, -1, 'me, me, me!'));
> --these queries are ok:
> select count(*) from avro_reproducer;  
> select * from avro_reproducer;  
> --these queries are not ok
> select * from avro_reproducer order by union_col; 
> select * from avro_reproducer sort by key; 
> select * from avro_reproducer order by 'does not have to be a column, 
> really'; 
> {noformat}
> I have verified this reproducer on CDH703, HDP301.
>  It seems the issue is restricted to AVRO; this reproducer does not trigger 
> failures against textfile tables, orc tables, and parquet tables.
> Also, the issue is restricted to vectorized execution; it goes away if I set 
> hive.vectorized.execution.enabled=false
> {noformat:title=Error message in CLI}
> Driver stacktrace:
> at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1889)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1877)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1876)
> at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
> at 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1876)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:926)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:926)
> at scala.Option.foreach(Option.scala:257)
> at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:926)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2110)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2059)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2048)
> at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
> Caused by: java.lang.RuntimeException: Error processing row: 
> org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while 
> processing row 
> at 
> org.apache.hadoop.hive.ql.exec.spark.SparkMapRecordHandler.processRow(SparkMapRecordHandler.java:155)
> at 
> org.apache.hadoop.hive.ql.exec.spark.HiveMapFunctionResultList.processNextRecord(HiveMapFunctionResultList.java:48)
> at 
> org.apache.hadoop.hive.ql.exec.spark.HiveMapFunctionResultList.processNextRecord(HiveMapFunctionResultList.java:27)
> at 
> org.apache.hadoop.hive.ql.exec.spark.HiveBaseFunctionResultList.hasNext(HiveBaseFunctionResultList.java:85)
> at 
> scala.collection.convert.Wrappers$JIteratorWrapper.hasNext(Wrappers.scala:42)
> at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)
> at org.apache.spark.scheduler.Task.run(Task.scala:123)
> at 
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
> at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1315)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime 
> Error while processing row 
> at 
> org.apache.hadoop.hive.ql.exec.vector.VectorMapOperator.process(VectorMapOperator.java:970)
> at 
> 

[jira] [Work logged] (HIVE-24082) Expose information whether AcidUtils.ParsedDelta contains statementId

2020-10-07 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24082?focusedWorklogId=496321=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-496321
 ]

ASF GitHub Bot logged work on HIVE-24082:
-

Author: ASF GitHub Bot
Created on: 07/Oct/20 06:10
Start Date: 07/Oct/20 06:10
Worklog Time Spent: 10m 
  Work Description: harmandeeps commented on a change in pull request #1438:
URL: https://github.com/apache/hive/pull/1438#discussion_r500759183



##
File path: ql/src/java/org/apache/hadoop/hive/ql/io/AcidUtils.java
##
@@ -1031,8 +1031,12 @@ public Path getPath() {
   return path;
 }
 
+public boolean hasStatementId() {

Review comment:
   yeah, we may need this information outside the Hive to figure out 
whether statementId is present for the delta.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 496321)
Time Spent: 2h 20m  (was: 2h 10m)

> Expose information whether AcidUtils.ParsedDelta contains statementId
> -
>
> Key: HIVE-24082
> URL: https://issues.apache.org/jira/browse/HIVE-24082
> Project: Hive
>  Issue Type: Improvement
>Reporter: Piotr Findeisen
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 2h 20m
>  Remaining Estimate: 0h
>
> In [Presto|https://prestosql.io] we support reading ORC ACID tables by 
> leveraging AcidUtils rather than duplicate the file name parsing logic in our 
> code.
> To do this fully correctly, we need information whether 
> {{org.apache.hadoop.hive.ql.io.AcidUtils.ParsedDelta}} contains 
> {{statementId}} information or not. 
> Currently, a getter of that property does not allow us to access this 
> information.
> [https://github.com/apache/hive/blob/468907eab36f78df3e14a24005153c9a23d62555/ql/src/java/org/apache/hadoop/hive/ql/io/AcidUtils.java#L804-L806]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)