date:20200803

[jira] [Work logged] (HIVE-23953) Use task counter information to compute keycount during hashtable loading

2020-08-03 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-23953?focusedWorklogId=466033=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-466033
 ]

ASF GitHub Bot logged work on HIVE-23953:
-

Author: ASF GitHub Bot
Created on: 04/Aug/20 05:56
Start Date: 04/Aug/20 05:56
Worklog Time Spent: 10m 
  Work Description: rbalamohan commented on a change in pull request #1340:
URL: https://github.com/apache/hive/pull/1340#discussion_r464815905



##
File path: 
ql/src/java/org/apache/hadoop/hive/ql/exec/vector/mapjoin/fast/VectorMapJoinFastHashTableLoader.java
##
@@ -125,13 +126,24 @@ public void load(MapJoinTableContainer[] mapJoinTables,
 KeyValueReader kvReader = (KeyValueReader) input.getReader();
 
 Long keyCountObj = parentKeyCounts.get(pos);
-long keyCount = (keyCountObj == null) ? -1 : keyCountObj.longValue();
+long estKeyCount = (keyCountObj == null) ? -1 : keyCountObj;
+
+long inputRecords = -1;
+try {
+  inputRecords = ((AbstractLogicalInput) 
input).getContext().getCounters().

Review comment:
   Can you add TODO or a followup ticket that would replace this string 
with actual TaskCounter enum from tez (in next subsequent tez release) ?

##
File path: 
ql/src/java/org/apache/hadoop/hive/ql/exec/vector/mapjoin/fast/VectorMapJoinFastHashTableLoader.java
##
@@ -125,13 +126,24 @@ public void load(MapJoinTableContainer[] mapJoinTables,
 KeyValueReader kvReader = (KeyValueReader) input.getReader();
 
 Long keyCountObj = parentKeyCounts.get(pos);
-long keyCount = (keyCountObj == null) ? -1 : keyCountObj.longValue();
+long estKeyCount = (keyCountObj == null) ? -1 : keyCountObj;
+
+long inputRecords = -1;
+try {
+  inputRecords = ((AbstractLogicalInput) 
input).getContext().getCounters().
+  findCounter("org.apache.tez.common.counters.TaskCounter",
+  "APPROXIMATE_INPUT_RECORDS").getValue();
+} catch (Exception e) {
+  LOG.debug("Failed to get value for counter 
APPROXIMATE_INPUT_RECORDS", e);
+}
+long keyCount = Math.max(estKeyCount, inputRecords);
 
 VectorMapJoinFastTableContainer vectorMapJoinFastTableContainer =
 new VectorMapJoinFastTableContainer(desc, hconf, keyCount);
 
-LOG.info("Loading hash table for input: {} cacheKey: {} 
tableContainer: {} smallTablePos: {}", inputName,
-  cacheKey, 
vectorMapJoinFastTableContainer.getClass().getSimpleName(), pos);
+LOG.info("Loading hash table for input: {} cacheKey: {} 
tableContainer: {} smallTablePos: {} " +

Review comment:
   Can you add "delta" (line 171) in the log as well, to have details on 
the hash table load time?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 466033)
Time Spent: 20m  (was: 10m)

> Use task counter information to compute keycount during hashtable loading
> -
>
> Key: HIVE-23953
> URL: https://issues.apache.org/jira/browse/HIVE-23953
> Project: Hive
>  Issue Type: Bug
>Reporter: Rajesh Balamohan
>Assignee: mahesh kumar behera
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> There are cases when compiler misestimates key count and this results in a 
> number of hashtable resizes during runtime.
> [https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/vector/mapjoin/fast/VectorMapJoinFastHashTableLoader.java#L128]
> In such cases, it would be good to get "approximate_input_records" (TEZ-4207) 
> counter from upstream to compute the key count more accurately at runtime.
>  
>  * 
>  * 
> Options
> h4.  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (HIVE-21196) Support semijoin reduction on multiple column join

2020-08-03 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-21196?focusedWorklogId=466026=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-466026
 ]

ASF GitHub Bot logged work on HIVE-21196:
-

Author: ASF GitHub Bot
Created on: 04/Aug/20 04:26
Start Date: 04/Aug/20 04:26
Worklog Time Spent: 10m 
  Work Description: jcamachor commented on a change in pull request #1325:
URL: https://github.com/apache/hive/pull/1325#discussion_r464764059



##
File path: ql/src/java/org/apache/hadoop/hive/ql/exec/OperatorUtils.java
##
@@ -53,6 +53,34 @@
 
   private static final Logger LOG = 
LoggerFactory.getLogger(OperatorUtils.class);
 
+  /**
+   * Return the ancestor of the specified operator at the provided path or 
null if the path is invalid.
+   *
+   * The method is equivalent to following code:
+   * {@code
+   * op.getParentOperators().get(path[0])
+   * .getParentOperators().get(path[1])
+   * ...
+   * .getParentOperators().get(path[n])
+   * }
+   * with additional checks about the validity of the provided path and the 
type of the ancestor.
+   *
+   * @param op the operator for which we
+   * @param clazz the class of the ancestor operator
+   * @param path the path leading to the desired ancestor
+   * @param  the type of the ancestor
+   * @return the ancestor of the specified operator at the provided path or 
null if the path is invalid.
+   */
+  public static  T ancestor(Operator op, Class clazz, int... path) {
+Operator target = op;
+for (int i = 0; i < path.length; i++) {
+  if (target.getParentOperators() == null || path[i] > 
target.getParentOperators().size())

Review comment:
   nit. We use `{` `}` even for single line statements.
   
   Please, check below in other code changes in this PR too since I have seen 
the same.

##
File path: ql/src/java/org/apache/hadoop/hive/ql/exec/OperatorUtils.java
##
@@ -53,6 +53,34 @@
 
   private static final Logger LOG = 
LoggerFactory.getLogger(OperatorUtils.class);
 
+  /**
+   * Return the ancestor of the specified operator at the provided path or 
null if the path is invalid.
+   *
+   * The method is equivalent to following code:
+   * {@code
+   * op.getParentOperators().get(path[0])

Review comment:
   Neat! Interesting method... Reminds me of our good old times with XPath 
 

##
File path: 
ql/src/java/org/apache/hadoop/hive/ql/optimizer/SemiJoinReductionMerge.java
##
@@ -0,0 +1,399 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.hadoop.hive.ql.optimizer;
+
+import org.apache.hadoop.hive.conf.HiveConf;
+import org.apache.hadoop.hive.ql.exec.ColumnInfo;
+import org.apache.hadoop.hive.ql.exec.FilterOperator;
+import org.apache.hadoop.hive.ql.exec.GroupByOperator;
+import org.apache.hadoop.hive.ql.exec.Operator;
+import org.apache.hadoop.hive.ql.exec.OperatorFactory;
+import org.apache.hadoop.hive.ql.exec.OperatorUtils;
+import org.apache.hadoop.hive.ql.exec.ReduceSinkOperator;
+import org.apache.hadoop.hive.ql.exec.RowSchema;
+import org.apache.hadoop.hive.ql.exec.SelectOperator;
+import org.apache.hadoop.hive.ql.exec.TableScanOperator;
+import org.apache.hadoop.hive.ql.exec.Utilities;
+import org.apache.hadoop.hive.ql.io.AcidUtils;
+import org.apache.hadoop.hive.ql.parse.GenTezUtils;
+import org.apache.hadoop.hive.ql.parse.ParseContext;
+import org.apache.hadoop.hive.ql.parse.RuntimeValuesInfo;
+import org.apache.hadoop.hive.ql.parse.SemanticAnalyzer;
+import org.apache.hadoop.hive.ql.parse.SemanticException;
+import org.apache.hadoop.hive.ql.parse.SemiJoinBranchInfo;
+import org.apache.hadoop.hive.ql.plan.AggregationDesc;
+import org.apache.hadoop.hive.ql.plan.DynamicValue;
+import org.apache.hadoop.hive.ql.plan.ExprNodeColumnDesc;
+import org.apache.hadoop.hive.ql.plan.ExprNodeConstantDesc;
+import org.apache.hadoop.hive.ql.plan.ExprNodeDesc;
+import org.apache.hadoop.hive.ql.plan.ExprNodeDynamicValueDesc;
+import org.apache.hadoop.hive.ql.plan.ExprNodeGenericFuncDesc;
+import org.apache.hadoop.hive.ql.plan.FilterDesc;
+import org.apache.hadoop.hive.ql.plan.GroupByDesc;

[jira] [Updated] (HIVE-21196) Support semijoin reduction on multiple column join

2020-08-03 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-21196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HIVE-21196:
--
Labels: pull-request-available  (was: )

> Support semijoin reduction on multiple column join
> --
>
> Key: HIVE-21196
> URL: https://issues.apache.org/jira/browse/HIVE-21196
> Project: Hive
>  Issue Type: Bug
>Reporter: Deepak Jaiswal
>Assignee: Stamatis Zampetakis
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Currently for a query involving join on multiple columns creates  separate 
> semi join edges for each key which in turn create a bloom filter for each of 
> them, like below,
> EXPLAIN select count(*) from srcpart_date_n7 join srcpart_small_n3 on 
> (srcpart_date_n7.key = srcpart_small_n3.key1 and srcpart_date_n7.value = 
> srcpart_small_n3.value1)
> {code:java}
> Map 1 <- Reducer 5 (BROADCAST_EDGE)
> Reducer 2 <- Map 1 (SIMPLE_EDGE), Map 4 (SIMPLE_EDGE)
> Reducer 3 <- Reducer 2 (CUSTOM_SIMPLE_EDGE)
> Reducer 5 <- Map 4 (CUSTOM_SIMPLE_EDGE)
>  A masked pattern was here 
>   Vertices:
> Map 1 
> Map Operator Tree:
> TableScan
>   alias: srcpart_date_n7
>   filterExpr: (key is not null and value is not null and (key 
> BETWEEN DynamicValue(RS_7_srcpart_small_n3_key1_min) AND 
> DynamicValue(RS_7_srcpart_small_n3_key1_max) and in_bloom_filter(key, 
> DynamicValue(RS_7_srcpart_small_n3_key1_bloom_filter (type: boolean)
>   Statistics: Num rows: 2000 Data size: 356000 Basic stats: 
> COMPLETE Column stats: COMPLETE
>   Filter Operator
> predicate: ((key BETWEEN 
> DynamicValue(RS_7_srcpart_small_n3_key1_min) AND 
> DynamicValue(RS_7_srcpart_small_n3_key1_max) and in_bloom_filter(key, 
> DynamicValue(RS_7_srcpart_small_n3_key1_bloom_filter))) and key is not null 
> and value is not null) (type: boolean)
> Statistics: Num rows: 2000 Data size: 356000 Basic stats: 
> COMPLETE Column stats: COMPLETE
> Select Operator
>   expressions: key (type: string), value (type: string)
>   outputColumnNames: _col0, _col1
>   Statistics: Num rows: 2000 Data size: 356000 Basic 
> stats: COMPLETE Column stats: COMPLETE
>   Reduce Output Operator
> key expressions: _col0 (type: string), _col1 (type: 
> string)
> sort order: ++
> Map-reduce partition columns: _col0 (type: string), 
> _col1 (type: string)
> Statistics: Num rows: 2000 Data size: 356000 Basic 
> stats: COMPLETE Column stats: COMPLETE
> Execution mode: vectorized, llap
> LLAP IO: all inputs
> Map 4 
> Map Operator Tree:
> TableScan
>   alias: srcpart_small_n3
>   filterExpr: (key1 is not null and value1 is not null) 
> (type: boolean)
>   Statistics: Num rows: 20 Data size: 3560 Basic stats: 
> PARTIAL Column stats: PARTIAL
>   Filter Operator
> predicate: (key1 is not null and value1 is not null) 
> (type: boolean)
> Statistics: Num rows: 20 Data size: 3560 Basic stats: 
> PARTIAL Column stats: PARTIAL
> Select Operator
>   expressions: key1 (type: string), value1 (type: string)
>   outputColumnNames: _col0, _col1
>   Statistics: Num rows: 20 Data size: 3560 Basic stats: 
> PARTIAL Column stats: PARTIAL
>   Reduce Output Operator
> key expressions: _col0 (type: string), _col1 (type: 
> string)
> sort order: ++
> Map-reduce partition columns: _col0 (type: string), 
> _col1 (type: string)
> Statistics: Num rows: 20 Data size: 3560 Basic stats: 
> PARTIAL Column stats: PARTIAL
>   Select Operator
> expressions: _col0 (type: string)
> outputColumnNames: _col0
> Statistics: Num rows: 20 Data size: 3560 Basic stats: 
> PARTIAL Column stats: PARTIAL
> Group By Operator
>   aggregations: min(_col0), max(_col0), 
> bloom_filter(_col0, expectedEntries=20)
>   mode: hash
>   outputColumnNames: _col0, _col1, _col2
>   Statistics: Num rows: 1 Data size: 730 Basic stats: 
> PARTIAL Column stats: PARTIAL
>   Reduce

[jira] [Commented] (HIVE-23851) MSCK REPAIR Command With Partition Filtering Fails While Dropping Partitions

2020-08-03 Thread Syed Shameerur Rahman (Jira)



[ 
https://issues.apache.org/jira/browse/HIVE-23851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17170537#comment-17170537
 ] 

Syed Shameerur Rahman commented on HIVE-23851:
--

[~kgyrtkirk] [~jcamachorodriguez] Ping for review request!

> MSCK REPAIR Command With Partition Filtering Fails While Dropping Partitions
> 
>
> Key: HIVE-23851
> URL: https://issues.apache.org/jira/browse/HIVE-23851
> Project: Hive
>  Issue Type: Bug
>Affects Versions: 4.0.0
>Reporter: Syed Shameerur Rahman
>Assignee: Syed Shameerur Rahman
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 1h 50m
>  Remaining Estimate: 0h
>
> *Steps to reproduce:*
> # Create external table
> # Run msck command to sync all the partitions with metastore
> # Remove one of the partition path
> # Run msck repair with partition filtering
> *Stack Trace:*
> {code:java}
>  2020-07-15T02:10:29,045 ERROR [4dad298b-28b1-4e6b-94b6-aa785b60c576 main] 
> ppr.PartitionExpressionForMetastore: Failed to deserialize the expression
>  java.lang.IndexOutOfBoundsException: Index: 110, Size: 0
>  at java.util.ArrayList.rangeCheck(ArrayList.java:657) ~[?:1.8.0_192]
>  at java.util.ArrayList.get(ArrayList.java:433) ~[?:1.8.0_192]
>  at 
> org.apache.hive.com.esotericsoftware.kryo.util.MapReferenceResolver.getReadObject(MapReferenceResolver.java:60)
>  ~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
>  at 
> org.apache.hive.com.esotericsoftware.kryo.Kryo.readReferenceOrNull(Kryo.java:857)
>  ~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
>  at org.apache.hive.com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:707) 
> ~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
>  at 
> org.apache.hadoop.hive.ql.exec.SerializationUtilities$KryoWithHooks.readObject(SerializationUtilities.java:211)
>  ~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
>  at 
> org.apache.hadoop.hive.ql.exec.SerializationUtilities.deserializeObjectFromKryo(SerializationUtilities.java:806)
>  ~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
>  at 
> org.apache.hadoop.hive.ql.exec.SerializationUtilities.deserializeExpressionFromKryo(SerializationUtilities.java:775)
>  ~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
>  at 
> org.apache.hadoop.hive.ql.optimizer.ppr.PartitionExpressionForMetastore.deserializeExpr(PartitionExpressionForMetastore.java:96)
>  [hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
>  at 
> org.apache.hadoop.hive.ql.optimizer.ppr.PartitionExpressionForMetastore.convertExprToFilter(PartitionExpressionForMetastore.java:52)
>  [hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
>  at 
> org.apache.hadoop.hive.metastore.PartFilterExprUtil.makeExpressionTree(PartFilterExprUtil.java:48)
>  [hive-standalone-metastore-server-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
>  at 
> org.apache.hadoop.hive.metastore.ObjectStore.getPartitionsByExprInternal(ObjectStore.java:3593)
>  [hive-standalone-metastore-server-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
>  at 
> org.apache.hadoop.hive.metastore.VerifyingObjectStore.getPartitionsByExpr(VerifyingObjectStore.java:80)
>  [hive-standalone-metastore-server-4.0.0-SNAPSHOT-tests.jar:4.0.0-SNAPSHOT]
>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:1.8.0_192]
>  at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 
> ~[?:1.8.0_192]
> {code}
> *Cause:*
> In case of msck repair with partition filtering we expect expression proxy 
> class to be set as PartitionExpressionForMetastore ( 
> https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/ddl/misc/msck/MsckAnalyzer.java#L78
>  ), While dropping partition we serialize the drop partition filter 
> expression as ( 
> https://github.com/apache/hive/blob/master/standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/Msck.java#L589
>  ) which is incompatible during deserializtion happening in 
> PartitionExpressionForMetastore ( 
> https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/ppr/PartitionExpressionForMetastore.java#L52
>  ) hence the query fails with Failed to deserialize the expression.
> *Solutions*:
> I could think of two approaches to this problem
> # Since PartitionExpressionForMetastore is required only during parition 
> pruning step, We can switch back the expression proxy class to 
> MsckPartitionExpressionProxy once the partition pruning step is done.
> # The other solution is to make serialization process in msck drop partition 
> filter expression compatible with the one with 
> PartitionExpressionForMetastore, We can do this via Reflection since the drop 
> partition serialization happens in Msck class (standadlone-metatsore) by this 
> way we can completely remove the need for

[jira] [Work logged] (HIVE-23851) MSCK REPAIR Command With Partition Filtering Fails While Dropping Partitions

2020-08-03 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-23851?focusedWorklogId=466018=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-466018
 ]

ASF GitHub Bot logged work on HIVE-23851:
-

Author: ASF GitHub Bot
Created on: 04/Aug/20 04:08
Start Date: 04/Aug/20 04:08
Worklog Time Spent: 10m 
  Work Description: shameersss1 commented on pull request #1271:
URL: https://github.com/apache/hive/pull/1271#issuecomment-668367912


   @kgyrtkirk @jcamachor Could you please take a look?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 466018)
Time Spent: 1h 50m  (was: 1h 40m)

> MSCK REPAIR Command With Partition Filtering Fails While Dropping Partitions
> 
>
> Key: HIVE-23851
> URL: https://issues.apache.org/jira/browse/HIVE-23851
> Project: Hive
>  Issue Type: Bug
>Affects Versions: 4.0.0
>Reporter: Syed Shameerur Rahman
>Assignee: Syed Shameerur Rahman
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 1h 50m
>  Remaining Estimate: 0h
>
> *Steps to reproduce:*
> # Create external table
> # Run msck command to sync all the partitions with metastore
> # Remove one of the partition path
> # Run msck repair with partition filtering
> *Stack Trace:*
> {code:java}
>  2020-07-15T02:10:29,045 ERROR [4dad298b-28b1-4e6b-94b6-aa785b60c576 main] 
> ppr.PartitionExpressionForMetastore: Failed to deserialize the expression
>  java.lang.IndexOutOfBoundsException: Index: 110, Size: 0
>  at java.util.ArrayList.rangeCheck(ArrayList.java:657) ~[?:1.8.0_192]
>  at java.util.ArrayList.get(ArrayList.java:433) ~[?:1.8.0_192]
>  at 
> org.apache.hive.com.esotericsoftware.kryo.util.MapReferenceResolver.getReadObject(MapReferenceResolver.java:60)
>  ~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
>  at 
> org.apache.hive.com.esotericsoftware.kryo.Kryo.readReferenceOrNull(Kryo.java:857)
>  ~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
>  at org.apache.hive.com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:707) 
> ~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
>  at 
> org.apache.hadoop.hive.ql.exec.SerializationUtilities$KryoWithHooks.readObject(SerializationUtilities.java:211)
>  ~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
>  at 
> org.apache.hadoop.hive.ql.exec.SerializationUtilities.deserializeObjectFromKryo(SerializationUtilities.java:806)
>  ~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
>  at 
> org.apache.hadoop.hive.ql.exec.SerializationUtilities.deserializeExpressionFromKryo(SerializationUtilities.java:775)
>  ~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
>  at 
> org.apache.hadoop.hive.ql.optimizer.ppr.PartitionExpressionForMetastore.deserializeExpr(PartitionExpressionForMetastore.java:96)
>  [hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
>  at 
> org.apache.hadoop.hive.ql.optimizer.ppr.PartitionExpressionForMetastore.convertExprToFilter(PartitionExpressionForMetastore.java:52)
>  [hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
>  at 
> org.apache.hadoop.hive.metastore.PartFilterExprUtil.makeExpressionTree(PartFilterExprUtil.java:48)
>  [hive-standalone-metastore-server-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
>  at 
> org.apache.hadoop.hive.metastore.ObjectStore.getPartitionsByExprInternal(ObjectStore.java:3593)
>  [hive-standalone-metastore-server-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
>  at 
> org.apache.hadoop.hive.metastore.VerifyingObjectStore.getPartitionsByExpr(VerifyingObjectStore.java:80)
>  [hive-standalone-metastore-server-4.0.0-SNAPSHOT-tests.jar:4.0.0-SNAPSHOT]
>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:1.8.0_192]
>  at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 
> ~[?:1.8.0_192]
> {code}
> *Cause:*
> In case of msck repair with partition filtering we expect expression proxy 
> class to be set as PartitionExpressionForMetastore ( 
> https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/ddl/misc/msck/MsckAnalyzer.java#L78
>  ), While dropping partition we serialize the drop partition filter 
> expression as ( 
> https://github.com/apache/hive/blob/master/standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/Msck.java#L589
>  ) which is incompatible during deserializtion happening in 
> PartitionExpressionForMetastore ( 
>

[jira] [Work logged] (HIVE-20441) NPE in GenericUDF when hive.allow.udf.load.on.demand is set to true

2020-08-03 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-20441?focusedWorklogId=466016=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-466016
 ]

ASF GitHub Bot logged work on HIVE-20441:
-

Author: ASF GitHub Bot
Created on: 04/Aug/20 04:01
Start Date: 04/Aug/20 04:01
Worklog Time Spent: 10m 
  Work Description: dengzhhu653 commented on pull request #1242:
URL: https://github.com/apache/hive/pull/1242#issuecomment-668366427


   Hi @pvary, Is there anything I can do to move this forward?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 466016)
Time Spent: 2h 10m  (was: 2h)

> NPE in GenericUDF  when hive.allow.udf.load.on.demand is set to true
> 
>
> Key: HIVE-20441
> URL: https://issues.apache.org/jira/browse/HIVE-20441
> Project: Hive
>  Issue Type: Bug
>  Components: CLI, HiveServer2
>Affects Versions: 1.2.1, 2.3.3
>Reporter: Hui Huang
>Assignee: Zhihua Deng
>Priority: Major
>  Labels: pull-request-available
> Attachments: HIVE-20441.1.patch, HIVE-20441.2.patch, 
> HIVE-20441.3.patch, HIVE-20441.4.patch, HIVE-20441.patch
>
>  Time Spent: 2h 10m
>  Remaining Estimate: 0h
>
> When hive.allow.udf.load.on.demand is set to true and hiveserver2 has been 
> started, the new created function from other clients or hiveserver2 will be 
> loaded from the metastore at the first time. 
> When the udf is used in where clause, we got a NPE like:
> {code:java}
> Error executing statement:
> org.apache.hive.service.cli.HiveSQLException: Error while compiling 
> statement: FAILED: NullPointerException null
> at 
> org.apache.hive.service.cli.operation.Operation.toSQLException(Operation.java:380)
>  ~[hive-service-2.3.4-SNAPSHOT.jar:2.3.4-SNAPSHOT]
> at 
> org.apache.hive.service.cli.operation.SQLOperation.prepare(SQLOperation.java:206)
>  ~[hive-service-2.3.4-SNAPSHOT.jar:2.3.4-SNAPSHOT]
> at 
> org.apache.hive.service.cli.operation.SQLOperation.runInternal(SQLOperation.java:290)
>  ~[hive-service-2.3.4-SNAPSHOT.jar:2.3.4-SNAPSHOT]
> at 
> org.apache.hive.service.cli.operation.Operation.run(Operation.java:320) 
> ~[hive-service-2.3.4-SNAPSHOT.jar:2.3.4-SNAPSHOT]
> at 
> org.apache.hive.service.cli.session.HiveSessionImpl.executeStatementInternal(HiveSessionImpl.java:530)
>  ~[hive-service-2.3.4-SNAPSHOT.jar:2.3.4-SNAP
> SHOT]
> at 
> org.apache.hive.service.cli.session.HiveSessionImpl.executeStatementAsync(HiveSessionImpl.java:517)
>  ~[hive-service-2.3.4-SNAPSHOT.jar:2.3.4-SNAPSHO
> T]
> at 
> org.apache.hive.service.cli.CLIService.executeStatementAsync(CLIService.java:310)
>  ~[hive-service-2.3.4-SNAPSHOT.jar:2.3.4-SNAPSHOT]
> at 
> org.apache.hive.service.cli.thrift.ThriftCLIService.ExecuteStatement(ThriftCLIService.java:542)
>  ~[hive-service-2.3.4-SNAPSHOT.jar:2.3.4-SNAPSHOT]
> at 
> org.apache.hive.service.rpc.thrift.TCLIService$Processor$ExecuteStatement.getResult(TCLIService.java:1437)
>  ~[hive-exec-2.3.4-SNAPSHOT.jar:2.3.4-SNA
> PSHOT]
> at 
> org.apache.hive.service.rpc.thrift.TCLIService$Processor$ExecuteStatement.getResult(TCLIService.java:1422)
>  ~[hive-exec-2.3.4-SNAPSHOT.jar:2.3.4-SNA
> PSHOT]
> at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39) 
> ~[hive-exec-2.3.4-SNAPSHOT.jar:2.3.4-SNAPSHOT]
> at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39) 
> ~[hive-exec-2.3.4-SNAPSHOT.jar:2.3.4-SNAPSHOT]
> at 
> org.apache.hive.service.auth.TSetIpAddressProcessor.process(TSetIpAddressProcessor.java:57)
>  ~[hive-service-2.3.4-SNAPSHOT.jar:2.3.4-SNAPSHOT]
> at 
> org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:286)
>  ~[hive-exec-2.3.4-SNAPSHOT.jar:2.3.4-SNAPSHOT]
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>  [?:1.8.0_77]
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>  [?:1.8.0_77]
> at java.lang.Thread.run(Thread.java:745) [?:1.8.0_77]
> Caused by: java.lang.NullPointerException
> at 
> org.apache.hadoop.hive.ql.plan.ExprNodeGenericFuncDesc.newInstance(ExprNodeGenericFuncDesc.java:236)
>  ~[hive-exec-2.3.4-SNAPSHOT.jar:2.3.4-SNAPSHOT]
> at 
>

[jira] [Work logged] (HIVE-23980) Shade guava from existing Hive versions

2020-08-03 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-23980?focusedWorklogId=466011=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-466011
 ]

ASF GitHub Bot logged work on HIVE-23980:
-

Author: ASF GitHub Bot
Created on: 04/Aug/20 03:25
Start Date: 04/Aug/20 03:25
Worklog Time Spent: 10m 
  Work Description: viirya commented on pull request #1356:
URL: https://github.com/apache/hive/pull/1356#issuecomment-668358395


   cc @sunchao 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 466011)
Time Spent: 20m  (was: 10m)

> Shade guava from existing Hive versions
> ---
>
> Key: HIVE-23980
> URL: https://issues.apache.org/jira/browse/HIVE-23980
> Project: Hive
>  Issue Type: Bug
>Affects Versions: 2.3.7
>Reporter: L. C. Hsieh
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> I'm trying to upgrade Guava version in Spark. The JIRA ticket is SPARK-32502.
> Running test hits an error:
> {code}
> sbt.ForkMain$ForkError: sbt.ForkMain$ForkError: java.lang.IllegalAccessError: 
> tried to access method 
> com.google.common.collect.Iterators.emptyIterator()Lcom/google/common/collect/UnmodifiableIterator;
>  from class org.apache.hadoop.hive.ql.exec.FetchOperator
>   at 
> org.apache.hadoop.hive.ql.exec.FetchOperator.(FetchOperator.java:108)
>   at 
> org.apache.hadoop.hive.ql.exec.FetchTask.initialize(FetchTask.java:87)
>   at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:541)
>   at org.apache.hadoop.hive.ql.Driver.compileInternal(Driver.java:1317)
>   at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1457)
>   at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1237)
>   at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1227)
> {code}
> I know that hive-exec doesn't shade Guava until HIVE-22126 but that work 
> targets 4.0.0. I'm wondering if there is a solution for current Hive 
> versions, e.g. Hive 2.3.7? Any ideas?
> Thanks.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HIVE-23980) Shade guava from existing Hive versions

2020-08-03 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-23980?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HIVE-23980:
--
Labels: pull-request-available  (was: )

> Shade guava from existing Hive versions
> ---
>
> Key: HIVE-23980
> URL: https://issues.apache.org/jira/browse/HIVE-23980
> Project: Hive
>  Issue Type: Bug
>Affects Versions: 2.3.7
>Reporter: L. C. Hsieh
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> I'm trying to upgrade Guava version in Spark. The JIRA ticket is SPARK-32502.
> Running test hits an error:
> {code}
> sbt.ForkMain$ForkError: sbt.ForkMain$ForkError: java.lang.IllegalAccessError: 
> tried to access method 
> com.google.common.collect.Iterators.emptyIterator()Lcom/google/common/collect/UnmodifiableIterator;
>  from class org.apache.hadoop.hive.ql.exec.FetchOperator
>   at 
> org.apache.hadoop.hive.ql.exec.FetchOperator.(FetchOperator.java:108)
>   at 
> org.apache.hadoop.hive.ql.exec.FetchTask.initialize(FetchTask.java:87)
>   at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:541)
>   at org.apache.hadoop.hive.ql.Driver.compileInternal(Driver.java:1317)
>   at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1457)
>   at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1237)
>   at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1227)
> {code}
> I know that hive-exec doesn't shade Guava until HIVE-22126 but that work 
> targets 4.0.0. I'm wondering if there is a solution for current Hive 
> versions, e.g. Hive 2.3.7? Any ideas?
> Thanks.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (HIVE-23980) Shade guava from existing Hive versions

2020-08-03 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-23980?focusedWorklogId=466010=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-466010
 ]

ASF GitHub Bot logged work on HIVE-23980:
-

Author: ASF GitHub Bot
Created on: 04/Aug/20 03:25
Start Date: 04/Aug/20 03:25
Worklog Time Spent: 10m 
  Work Description: viirya opened a new pull request #1356:
URL: https://github.com/apache/hive/pull/1356


   
   
   ### What changes were proposed in this pull request?
   
   
   This PR proposes to shade Guava from hive-exec in Hive 2.3 branch.
   
   ### Why are the changes needed?
   
   
   When trying to upgrade Guava in Spark, found the following error. A Guava 
method became package-private since Guava version 20. So there is 
incompatibility with Guava versions > 19.0.
   
   ```
   sbt.ForkMain$ForkError: sbt.ForkMain$ForkError: 
java.lang.IllegalAccessError: tried to access method 
com.google.common.collect.Iterators.emptyIterator()Lcom/google/common/collect/UnmodifiableIterator;
 from class org.apache.hadoop.hive.ql.exec.FetchOperator
at 
org.apache.hadoop.hive.ql.exec.FetchOperator.(FetchOperator.java:108)
at 
org.apache.hadoop.hive.ql.exec.FetchTask.initialize(FetchTask.java:87)
at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:541)
at org.apache.hadoop.hive.ql.Driver.compileInternal(Driver.java:1317)
at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1457)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1237)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1227)
   ```
   
   This is a problem for downstream clients. Hive project noticed that problem 
too in [HIVE-22126](https://issues.apache.org/jira/browse/HIVE-22126), however 
that only targets 4.0.0. It'd be nicer if we can also shade Guava from current 
Hive versions, e.g. Hive 2.3 line.
   
   ### Does this PR introduce _any_ user-facing change?
   
   
   Yes. Guava will be shaded from hive-exec.
   
   ### How was this patch tested?
   
   
   Built Hive locally and checked jar content.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 466010)
Remaining Estimate: 0h
Time Spent: 10m

> Shade guava from existing Hive versions
> ---
>
> Key: HIVE-23980
> URL: https://issues.apache.org/jira/browse/HIVE-23980
> Project: Hive
>  Issue Type: Bug
>Affects Versions: 2.3.7
>Reporter: L. C. Hsieh
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> I'm trying to upgrade Guava version in Spark. The JIRA ticket is SPARK-32502.
> Running test hits an error:
> {code}
> sbt.ForkMain$ForkError: sbt.ForkMain$ForkError: java.lang.IllegalAccessError: 
> tried to access method 
> com.google.common.collect.Iterators.emptyIterator()Lcom/google/common/collect/UnmodifiableIterator;
>  from class org.apache.hadoop.hive.ql.exec.FetchOperator
>   at 
> org.apache.hadoop.hive.ql.exec.FetchOperator.(FetchOperator.java:108)
>   at 
> org.apache.hadoop.hive.ql.exec.FetchTask.initialize(FetchTask.java:87)
>   at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:541)
>   at org.apache.hadoop.hive.ql.Driver.compileInternal(Driver.java:1317)
>   at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1457)
>   at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1237)
>   at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1227)
> {code}
> I know that hive-exec doesn't shade Guava until HIVE-22126 but that work 
> targets 4.0.0. I'm wondering if there is a solution for current Hive 
> versions, e.g. Hive 2.3.7? Any ideas?
> Thanks.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (HIVE-23981) Use task counter enum to get the approximate counter value

2020-08-03 Thread mahesh kumar behera (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-23981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

mahesh kumar behera reassigned HIVE-23981:
--

Assignee: mahesh kumar behera

> Use task counter enum to get the approximate counter value
> --
>
> Key: HIVE-23981
> URL: https://issues.apache.org/jira/browse/HIVE-23981
> Project: Hive
>  Issue Type: Bug
>Reporter: mahesh kumar behera
>Assignee: mahesh kumar behera
>Priority: Major
>  Labels: pull-request-available
>
> The value for APPROXIMATE_INPUT_RECORDS should be obtained using the enum 
> name instead of static string. Once Tez release is done with the specific 
> information we should change it to 
> org.apache.tez.common.counters.TaskCounter.APPROXIMATE_INPUT_RECORDS.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (HIVE-23953) Use task counter information to compute keycount during hashtable loading

2020-08-03 Thread mahesh kumar behera (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-23953?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

mahesh kumar behera reassigned HIVE-23953:
--

Assignee: mahesh kumar behera

> Use task counter information to compute keycount during hashtable loading
> -
>
> Key: HIVE-23953
> URL: https://issues.apache.org/jira/browse/HIVE-23953
> Project: Hive
>  Issue Type: Bug
>Reporter: Rajesh Balamohan
>Assignee: mahesh kumar behera
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> There are cases when compiler misestimates key count and this results in a 
> number of hashtable resizes during runtime.
> [https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/vector/mapjoin/fast/VectorMapJoinFastHashTableLoader.java#L128]
> In such cases, it would be good to get "approximate_input_records" (TEZ-4207) 
> counter from upstream to compute the key count more accurately at runtime.
>  
>  * 
>  * 
> Options
> h4.  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HIVE-23981) Use task counter enum to get the approximate counter value

2020-08-03 Thread mahesh kumar behera (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-23981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

mahesh kumar behera updated HIVE-23981:
---
Description: The value for APPROXIMATE_INPUT_RECORDS should be obtained 
using the enum name instead of static string. Once Tez release is done with the 
specific information we should change it to 
org.apache.tez.common.counters.TaskCounter.APPROXIMATE_INPUT_RECORDS.  (was: 
There are cases when compiler misestimates key count and this results in a 
number of hashtable resizes during runtime.

[https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/vector/mapjoin/fast/VectorMapJoinFastHashTableLoader.java#L128]

In such cases, it would be good to get "approximate_input_records" (TEZ-4207) 
counter from upstream to compute the key count more accurately at runtime.

 
 * 
 * 
Options
h4.  )

> Use task counter enum to get the approximate counter value
> --
>
> Key: HIVE-23981
> URL: https://issues.apache.org/jira/browse/HIVE-23981
> Project: Hive
>  Issue Type: Bug
>Reporter: mahesh kumar behera
>Priority: Major
>  Labels: pull-request-available
>
> The value for APPROXIMATE_INPUT_RECORDS should be obtained using the enum 
> name instead of static string. Once Tez release is done with the specific 
> information we should change it to 
> org.apache.tez.common.counters.TaskCounter.APPROXIMATE_INPUT_RECORDS.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (HIVE-23959) Provide an option to wipe out column stats for partitioned tables in case of column removal

2020-08-03 Thread Yushi Hayasaka (Jira)



[ 
https://issues.apache.org/jira/browse/HIVE-23959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17170498#comment-17170498
 ] 

Yushi Hayasaka commented on HIVE-23959:
---

Hello, I'm interested in dealing with the issue since we have difficulty with 
it.
Just curious, how does it affect performance?
Also, it seems to replace calling `clearColumnStatsState` instead of 
`updateOrGetPartitionColumnStats` for partitions. I think the performance 
improvement is here.
Is it correct? Or does it have any other improvement too?

> Provide an option to wipe out column stats for partitioned tables in case of 
> column removal
> ---
>
> Key: HIVE-23959
> URL: https://issues.apache.org/jira/browse/HIVE-23959
> Project: Hive
>  Issue Type: Improvement
>Reporter: Zoltan Haindrich
>Assignee: Zoltan Haindrich
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> in case of column removal / replacement - an update for each partition is 
> neccessary; which could take a while.
> goal here is to provide an option to switch to the bulk removal of column 
> statistics instead of working hard to retain as much as possible from the old 
> stats.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HIVE-23954) count(*) with count(distinct) gives wrong results with hive.optimize.countdistinct=true

2020-08-03 Thread Eugene Chung (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-23954?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eugene Chung updated HIVE-23954:

Description: 
{code:java}
select count(*), count(distinct mid) from db1.table1 where partitioned_column = 
'...'{code}
 

is not working properly when hive.optimize.countdistinct is true. By default, 
it's true for all 3.x versions.

In the two plans below, the aggregations part in the Output of Group By 
Operator of Map 1 are different.

 

- hive.optimize.countdistinct=false
{code:java}
++
|  Explain   |
++
| Plan optimized by CBO. |
||
| Vertex dependency in root stage|
| Reducer 2 <- Map 1 (SIMPLE_EDGE)   |
||
| Stage-0|
|   Fetch Operator   |
| limit:-1   |
| Stage-1|
|   Reducer 2|
|   File Output Operator [FS_7]  |
| Group By Operator [GBY_5] (rows=1 width=24) |
|   
Output:["_col0","_col1"],aggregations:["count(VALUE._col0)","count(DISTINCT 
KEY._col0:0._col0)"] |
| <-Map 1 [SIMPLE_EDGE]  |
|   SHUFFLE [RS_4]   |
| Group By Operator [GBY_3] (rows=343640771 width=4160) |
|   
Output:["_col0","_col1","_col2"],aggregations:["count()","count(DISTINCT 
mid)"],keys:mid |
|   Select Operator [SEL_2] (rows=343640771 width=4160) |
| Output:["mid"] |
| TableScan [TS_0] (rows=343640771 width=4160) |
|   db1@table1,table1,Tbl:COMPLETE,Col:NONE,Output:["mid"] |
||
++{code}
 

- hive.optimize.countdistinct=true
{code:java}
++
|  Explain   |
++
| Plan optimized by CBO. |
||
| Vertex dependency in root stage|
| Reducer 2 <- Map 1 (SIMPLE_EDGE)   |
||
| Stage-0|
|   Fetch Operator   |
| limit:-1   |
| Stage-1|
|   Reducer 2|
|   File Output Operator [FS_7]  |
| Group By Operator [GBY_14] (rows=1 width=16) |
|   
Output:["_col0","_col1"],aggregations:["count(_col1)","count(_col0)"] |
|   Group By Operator [GBY_11] (rows=343640771 width=4160) |
| 
Output:["_col0","_col1"],aggregations:["count(VALUE._col0)"],keys:KEY._col0 |
|   <-Map 1 [SIMPLE_EDGE]|
| SHUFFLE [RS_10]|
|   PartitionCols:_col0  |
|   Group By Operator [GBY_9] (rows=343640771 width=4160) |
| Output:["_col0","_col1"],aggregations:["count()"],keys:mid |
| Select Operator [SEL_2] (rows=343640771 width=4160) |
|   Output:["mid"]   |
|   TableScan [TS_0] (rows=343640771 width=4160) |
| db1@table1,table1,Tbl:COMPLETE,Col:NONE,Output:["mid"] |
||
++
{code}

  was:
{code:java}
select count(*), count(distinct mycol) from db1.table1 where partitioned_column 
= '...'{code}
 

is not working properly when hive.optimize.countdistinct is true. By default, 
it's true for all 3.x versions.

In the two plans below, the aggregations part in the Output of Group By 
Operator of Map 1 are different.

 

- hive.optimize.countdistinct=false
{code:java}
++
|  Explain   |
++
| Plan optimized by CBO. |
||
| Vertex dependency in root stage|
| Reducer 2 <- Map 1 (SIMPLE_EDGE)   |
||
| Stage-0|
|   Fetch Operator   |
| limit:-1

[jira] [Updated] (HIVE-23980) Shade guava from existing Hive versions

2020-08-03 Thread L. C. Hsieh (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-23980?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

L. C. Hsieh updated HIVE-23980:
---
Affects Version/s: 2.3.7

> Shade guava from existing Hive versions
> ---
>
> Key: HIVE-23980
> URL: https://issues.apache.org/jira/browse/HIVE-23980
> Project: Hive
>  Issue Type: Bug
>Affects Versions: 2.3.7
>Reporter: L. C. Hsieh
>Priority: Major
>
> I'm trying to upgrade Guava version in Spark. The JIRA ticket is SPARK-32502.
> Running test hits an error:
> {code}
> sbt.ForkMain$ForkError: sbt.ForkMain$ForkError: java.lang.IllegalAccessError: 
> tried to access method 
> com.google.common.collect.Iterators.emptyIterator()Lcom/google/common/collect/UnmodifiableIterator;
>  from class org.apache.hadoop.hive.ql.exec.FetchOperator
>   at 
> org.apache.hadoop.hive.ql.exec.FetchOperator.(FetchOperator.java:108)
>   at 
> org.apache.hadoop.hive.ql.exec.FetchTask.initialize(FetchTask.java:87)
>   at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:541)
>   at org.apache.hadoop.hive.ql.Driver.compileInternal(Driver.java:1317)
>   at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1457)
>   at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1237)
>   at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1227)
> {code}
> I know that hive-exec doesn't shade Guava until HIVE-22126 but that work 
> targets 4.0.0. I'm wondering if there is a solution for current Hive 
> versions, e.g. Hive 2.3.7? Any ideas?
> Thanks.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (HIVE-23353) Atlas metadata replication scheduling

2020-08-03 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-23353?focusedWorklogId=465974=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-465974
 ]

ASF GitHub Bot logged work on HIVE-23353:
-

Author: ASF GitHub Bot
Created on: 04/Aug/20 00:36
Start Date: 04/Aug/20 00:36
Worklog Time Spent: 10m 
  Work Description: github-actions[bot] closed pull request #1021:
URL: https://github.com/apache/hive/pull/1021


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 465974)
Time Spent: 4h 20m  (was: 4h 10m)

> Atlas metadata replication scheduling
> -
>
> Key: HIVE-23353
> URL: https://issues.apache.org/jira/browse/HIVE-23353
> Project: Hive
>  Issue Type: Task
>Reporter: Pravin Sinha
>Assignee: Pravin Sinha
>Priority: Major
>  Labels: pull-request-available
> Attachments: HIVE-23353.01.patch, HIVE-23353.02.patch, 
> HIVE-23353.03.patch, HIVE-23353.04.patch, HIVE-23353.05.patch, 
> HIVE-23353.06.patch, HIVE-23353.07.patch, HIVE-23353.08.patch, 
> HIVE-23353.08.patch, HIVE-23353.08.patch, HIVE-23353.08.patch, 
> HIVE-23353.09.patch, HIVE-23353.10.patch, HIVE-23353.10.patch
>
>  Time Spent: 4h 20m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (HIVE-23951) Support parameterized queries in WHERE/HAVING clause

2020-08-03 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-23951?focusedWorklogId=465969=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-465969
 ]

ASF GitHub Bot logged work on HIVE-23951:
-

Author: ASF GitHub Bot
Created on: 04/Aug/20 00:24
Start Date: 04/Aug/20 00:24
Worklog Time Spent: 10m 
  Work Description: jcamachor commented on a change in pull request #1315:
URL: https://github.com/apache/hive/pull/1315#discussion_r464729855



##
File path: ql/src/java/org/apache/hadoop/hive/ql/exec/OperatorUtils.java
##
@@ -49,6 +50,8 @@
 import com.google.common.collect.Lists;
 import com.google.common.collect.Multimap;
 
+import static 
org.apache.hadoop.hive.ql.optimizer.physical.AnnotateRunTimeStatsOptimizer.getAllOperatorsForSimpleFetch;

Review comment:
   Yes, I meant `getAllOperatorsForSimpleFetch`, since it seems it used 
beyond the scope of `AnnotateRunTimeStatsOptimizer`?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 465969)
Time Spent: 2h 40m  (was: 2.5h)

> Support parameterized queries in WHERE/HAVING clause
> 
>
> Key: HIVE-23951
> URL: https://issues.apache.org/jira/browse/HIVE-23951
> Project: Hive
>  Issue Type: Sub-task
>  Components: Query Planning
>Reporter: Vineet Garg
>Assignee: Vineet Garg
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 2h 40m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (HIVE-23951) Support parameterized queries in WHERE/HAVING clause

2020-08-03 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-23951?focusedWorklogId=465963=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-465963
 ]

ASF GitHub Bot logged work on HIVE-23951:
-

Author: ASF GitHub Bot
Created on: 03/Aug/20 23:57
Start Date: 03/Aug/20 23:57
Worklog Time Spent: 10m 
  Work Description: vineetgarg02 commented on a change in pull request 
#1315:
URL: https://github.com/apache/hive/pull/1315#discussion_r464722308



##
File path: ql/src/java/org/apache/hadoop/hive/ql/plan/HiveOperation.java
##
@@ -205,7 +205,9 @@
   DROP_MAPPING("DROP MAPPING", HiveParser.TOK_DROP_MAPPING, null, null, false, 
false),
   CREATE_SCHEDULED_QUERY("CREATE SCHEDULED QUERY", 
HiveParser.TOK_CREATE_SCHEDULED_QUERY, null, null),
   ALTER_SCHEDULED_QUERY("ALTER SCHEDULED QUERY", 
HiveParser.TOK_ALTER_SCHEDULED_QUERY, null, null),
-  DROP_SCHEDULED_QUERY("DROP SCHEDULED QUERY", 
HiveParser.TOK_DROP_SCHEDULED_QUERY, null, null)
+  DROP_SCHEDULED_QUERY("DROP SCHEDULED QUERY", 
HiveParser.TOK_DROP_SCHEDULED_QUERY, null, null),
+  PREPARE("PREPARE QUERY", HiveParser.TOK_PREPARE, null, null),
+  EXECUTE("EXECUTE QUERY", HiveParser.TOK_EXECUTE, null, null)

Review comment:
   IIRC I had to make this change for explain plan to work. Let me 
re-investigate, I will get back.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 465963)
Time Spent: 2.5h  (was: 2h 20m)

> Support parameterized queries in WHERE/HAVING clause
> 
>
> Key: HIVE-23951
> URL: https://issues.apache.org/jira/browse/HIVE-23951
> Project: Hive
>  Issue Type: Sub-task
>  Components: Query Planning
>Reporter: Vineet Garg
>Assignee: Vineet Garg
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 2.5h
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (HIVE-23951) Support parameterized queries in WHERE/HAVING clause

2020-08-03 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-23951?focusedWorklogId=465962=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-465962
 ]

ASF GitHub Bot logged work on HIVE-23951:
-

Author: ASF GitHub Bot
Created on: 03/Aug/20 23:56
Start Date: 03/Aug/20 23:56
Worklog Time Spent: 10m 
  Work Description: vineetgarg02 commented on a change in pull request 
#1315:
URL: https://github.com/apache/hive/pull/1315#discussion_r464722096



##
File path: 
ql/src/java/org/apache/hadoop/hive/ql/parse/type/TypeCheckProcFactory.java
##
@@ -283,6 +283,33 @@ public Object process(Node nd, Stack stack, 
NodeProcessorCtx procCtx,
 
   }
 
+  /**
+   * Processor for processing Dynamic expression.
+   */
+  public class DynamicParameterProcessor implements SemanticNodeProcessor {
+
+@Override
+public Object process(Node nd, Stack stack, NodeProcessorCtx procCtx,
+Object... nodeOutputs) throws SemanticException {
+  TypeCheckCtx ctx = (TypeCheckCtx) procCtx;
+  if (ctx.getError() != null) {
+return null;
+  }
+
+  T desc = processGByExpr(nd, procCtx);

Review comment:
   No I believe this is not required. I will remove it.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 465962)
Time Spent: 2h 20m  (was: 2h 10m)

> Support parameterized queries in WHERE/HAVING clause
> 
>
> Key: HIVE-23951
> URL: https://issues.apache.org/jira/browse/HIVE-23951
> Project: Hive
>  Issue Type: Sub-task
>  Components: Query Planning
>Reporter: Vineet Garg
>Assignee: Vineet Garg
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 2h 20m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (HIVE-23951) Support parameterized queries in WHERE/HAVING clause

2020-08-03 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-23951?focusedWorklogId=465959=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-465959
 ]

ASF GitHub Bot logged work on HIVE-23951:
-

Author: ASF GitHub Bot
Created on: 03/Aug/20 23:55
Start Date: 03/Aug/20 23:55
Worklog Time Spent: 10m 
  Work Description: vineetgarg02 commented on a change in pull request 
#1315:
URL: https://github.com/apache/hive/pull/1315#discussion_r464721831



##
File path: 
ql/src/java/org/apache/hadoop/hive/ql/ddl/table/drop/PrepareStatementAnalyzer.java
##
@@ -0,0 +1,81 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hadoop.hive.ql.ddl.table.drop;
+
+import org.apache.hadoop.hive.ql.QueryState;
+import org.apache.hadoop.hive.ql.ddl.DDLSemanticAnalyzerFactory.DDLType;
+import org.apache.hadoop.hive.ql.parse.ASTNode;
+import org.apache.hadoop.hive.ql.parse.CalcitePlanner;
+import org.apache.hadoop.hive.ql.parse.HiveParser;
+import org.apache.hadoop.hive.ql.parse.SemanticException;
+import org.apache.hadoop.hive.ql.session.SessionState;
+
+/**
+ * Analyzer for Prepare queries. This analyzer generates plan for the 
parameterized query
+ * and save it in cache
+ */
+@DDLType(types = HiveParser.TOK_PREPARE)
+public class PrepareStatementAnalyzer extends CalcitePlanner {
+
+  public PrepareStatementAnalyzer(QueryState queryState) throws 
SemanticException {
+super(queryState);
+  }
+
+  private String getQueryName(ASTNode root) {
+ASTNode queryNameAST = (ASTNode)(root.getChild(1));
+return queryNameAST.getText();
+  }
+
+  /**
+   * This method saves the current {@link PrepareStatementAnalyzer} object as 
well as
+   * the config used to compile the plan.
+   * @param root
+   * @throws SemanticException
+   */
+  private void savePlan(String queryName) throws SemanticException{
+SessionState ss = SessionState.get();
+assert(ss != null);
+
+if (ss.getPreparePlans().containsKey(queryName)) {
+  throw new SemanticException("Prepare query: " + queryName + " already 
exists.");
+}
+ss.getPreparePlans().put(queryName, this);

Review comment:
   Yes, will update the code.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 465959)
Time Spent: 2h  (was: 1h 50m)

> Support parameterized queries in WHERE/HAVING clause
> 
>
> Key: HIVE-23951
> URL: https://issues.apache.org/jira/browse/HIVE-23951
> Project: Hive
>  Issue Type: Sub-task
>  Components: Query Planning
>Reporter: Vineet Garg
>Assignee: Vineet Garg
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 2h
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (HIVE-23951) Support parameterized queries in WHERE/HAVING clause

2020-08-03 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-23951?focusedWorklogId=465961=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-465961
 ]

ASF GitHub Bot logged work on HIVE-23951:
-

Author: ASF GitHub Bot
Created on: 03/Aug/20 23:55
Start Date: 03/Aug/20 23:55
Worklog Time Spent: 10m 
  Work Description: vineetgarg02 commented on a change in pull request 
#1315:
URL: https://github.com/apache/hive/pull/1315#discussion_r464722011



##
File path: ql/src/java/org/apache/hadoop/hive/ql/exec/OperatorUtils.java
##
@@ -49,6 +50,8 @@
 import com.google.common.collect.Lists;
 import com.google.common.collect.Multimap;
 
+import static 
org.apache.hadoop.hive.ql.optimizer.physical.AnnotateRunTimeStatsOptimizer.getAllOperatorsForSimpleFetch;

Review comment:
   You mean move `getAllOperatorsForSimpleFetch` from 
`AnnotateRunTimeStatsOptimizer`?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 465961)
Time Spent: 2h 10m  (was: 2h)

> Support parameterized queries in WHERE/HAVING clause
> 
>
> Key: HIVE-23951
> URL: https://issues.apache.org/jira/browse/HIVE-23951
> Project: Hive
>  Issue Type: Sub-task
>  Components: Query Planning
>Reporter: Vineet Garg
>Assignee: Vineet Garg
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 2h 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (HIVE-23951) Support parameterized queries in WHERE/HAVING clause

2020-08-03 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-23951?focusedWorklogId=465951=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-465951
 ]

ASF GitHub Bot logged work on HIVE-23951:
-

Author: ASF GitHub Bot
Created on: 03/Aug/20 23:27
Start Date: 03/Aug/20 23:27
Worklog Time Spent: 10m 
  Work Description: vineetgarg02 commented on a change in pull request 
#1315:
URL: https://github.com/apache/hive/pull/1315#discussion_r464713600



##
File path: ql/src/test/results/clientpositive/llap/prepare_plan.q.out
##
@@ -0,0 +1,1575 @@
+PREHOOK: query: explain extended prepare pcount from select count(*) from src 
where key > ?
+PREHOOK: type: QUERY
+PREHOOK: Input: default@src
+ A masked pattern was here 
+POSTHOOK: query: explain extended prepare pcount from select count(*) from src 
where key > ?
+POSTHOOK: type: QUERY
+POSTHOOK: Input: default@src
+ A masked pattern was here 
+OPTIMIZED SQL: SELECT COUNT(*) AS `$f0`
+FROM `default`.`src`
+WHERE `key` > CAST(? AS STRING)
+STAGE DEPENDENCIES:
+  Stage-1 is a root stage
+  Stage-0 depends on stages: Stage-1
+
+STAGE PLANS:
+  Stage: Stage-1
+Tez
+ A masked pattern was here 
+  Edges:
+Reducer 2 <- Map 1 (CUSTOM_SIMPLE_EDGE)
+ A masked pattern was here 
+  Vertices:
+Map 1 
+Map Operator Tree:
+TableScan
+  alias: src
+  filterExpr: (key > CAST( Dynamic Parameter  index: 1 AS 
STRING)) (type: boolean)
+  Statistics: Num rows: 500 Data size: 43500 Basic stats: 
COMPLETE Column stats: COMPLETE
+  GatherStats: false
+  Filter Operator
+isSamplingPred: false
+predicate: (key > CAST( Dynamic Parameter  index: 1 AS 
STRING)) (type: boolean)
+Statistics: Num rows: 166 Data size: 14442 Basic stats: 
COMPLETE Column stats: COMPLETE
+Select Operator
+  Statistics: Num rows: 166 Data size: 14442 Basic stats: 
COMPLETE Column stats: COMPLETE
+  Group By Operator
+aggregations: count()
+minReductionHashAggr: 0.99
+mode: hash
+outputColumnNames: _col0
+Statistics: Num rows: 1 Data size: 8 Basic stats: 
COMPLETE Column stats: COMPLETE
+Reduce Output Operator
+  bucketingVersion: 2
+  null sort order: 
+  numBuckets: -1
+  sort order: 
+  Statistics: Num rows: 1 Data size: 8 Basic stats: 
COMPLETE Column stats: COMPLETE
+  tag: -1
+  value expressions: _col0 (type: bigint)
+  auto parallelism: false
+Execution mode: llap
+LLAP IO: no inputs
+Path -> Alias:
+ A masked pattern was here 
+Path -> Partition:
+ A masked pattern was here 
+Partition
+  base file name: src
+  input format: org.apache.hadoop.mapred.TextInputFormat
+  output format: 
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
+  properties:
+bucket_count -1
+bucketing_version 2
+column.name.delimiter ,
+columns key,value
+columns.types string:string
+ A masked pattern was here 
+name default.src
+serialization.format 1
+serialization.lib 
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
+  serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
+
+input format: org.apache.hadoop.mapred.TextInputFormat
+output format: 
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
+properties:
+  bucketing_version 2
+  column.name.delimiter ,
+  columns key,value
+  columns.comments 'default','default'
+  columns.types string:string
+ A masked pattern was here 
+  name default.src
+  serialization.format 1
+  serialization.lib 
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
+serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
+name: default.src
+  name: default.src
+Truncated Path -> Alias:
+  /src [src]
+Reducer 2 
+Execution mode: llap
+Needs Tagging: false
+Reduce

[jira] [Work logged] (HIVE-23951) Support parameterized queries in WHERE/HAVING clause

2020-08-03 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-23951?focusedWorklogId=465952=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-465952
 ]

ASF GitHub Bot logged work on HIVE-23951:
-

Author: ASF GitHub Bot
Created on: 03/Aug/20 23:30
Start Date: 03/Aug/20 23:30
Worklog Time Spent: 10m 
  Work Description: vineetgarg02 commented on a change in pull request 
#1315:
URL: https://github.com/apache/hive/pull/1315#discussion_r464714632



##
File path: ql/src/java/org/apache/hadoop/hive/ql/stats/StatsUtils.java
##
@@ -1619,6 +1620,9 @@ public static ColStatistics 
getColStatisticsFromExpression(HiveConf conf, Statis
   colName = enfd.getFieldName();
   colType = enfd.getTypeString();
   countDistincts = numRows;
+} else if (end instanceof ExprDynamicParamDesc) {
+  //skip colecting stats for parameters

Review comment:
   This method tries to figure out column statistics involved in the given 
expression. I guess the stats are used by parent callers to do various 
estimation like map join, aggregate min/max. For dynamic expression stats are 
returned as null. I think it makes more sense to do what 
`buildColStatForConstant` is doing and return an estimation instead of null. I 
will update the code. 





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 465952)
Time Spent: 1h 50m  (was: 1h 40m)

> Support parameterized queries in WHERE/HAVING clause
> 
>
> Key: HIVE-23951
> URL: https://issues.apache.org/jira/browse/HIVE-23951
> Project: Hive
>  Issue Type: Sub-task
>  Components: Query Planning
>Reporter: Vineet Garg
>Assignee: Vineet Garg
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 1h 50m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (HIVE-23951) Support parameterized queries in WHERE/HAVING clause

2020-08-03 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-23951?focusedWorklogId=465948=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-465948
 ]

ASF GitHub Bot logged work on HIVE-23951:
-

Author: ASF GitHub Bot
Created on: 03/Aug/20 23:16
Start Date: 03/Aug/20 23:16
Worklog Time Spent: 10m 
  Work Description: vineetgarg02 commented on a change in pull request 
#1315:
URL: https://github.com/apache/hive/pull/1315#discussion_r464710291



##
File path: ql/src/test/results/clientpositive/llap/prepare_plan.q.out
##
@@ -0,0 +1,1575 @@
+PREHOOK: query: explain extended prepare pcount from select count(*) from src 
where key > ?
+PREHOOK: type: QUERY
+PREHOOK: Input: default@src
+ A masked pattern was here 
+POSTHOOK: query: explain extended prepare pcount from select count(*) from src 
where key > ?
+POSTHOOK: type: QUERY
+POSTHOOK: Input: default@src
+ A masked pattern was here 
+OPTIMIZED SQL: SELECT COUNT(*) AS `$f0`
+FROM `default`.`src`
+WHERE `key` > CAST(? AS STRING)
+STAGE DEPENDENCIES:
+  Stage-1 is a root stage
+  Stage-0 depends on stages: Stage-1
+
+STAGE PLANS:
+  Stage: Stage-1
+Tez
+ A masked pattern was here 
+  Edges:
+Reducer 2 <- Map 1 (CUSTOM_SIMPLE_EDGE)
+ A masked pattern was here 
+  Vertices:
+Map 1 
+Map Operator Tree:
+TableScan
+  alias: src
+  filterExpr: (key > CAST( Dynamic Parameter  index: 1 AS 
STRING)) (type: boolean)
+  Statistics: Num rows: 500 Data size: 43500 Basic stats: 
COMPLETE Column stats: COMPLETE
+  GatherStats: false
+  Filter Operator
+isSamplingPred: false
+predicate: (key > CAST( Dynamic Parameter  index: 1 AS 
STRING)) (type: boolean)
+Statistics: Num rows: 166 Data size: 14442 Basic stats: 
COMPLETE Column stats: COMPLETE
+Select Operator
+  Statistics: Num rows: 166 Data size: 14442 Basic stats: 
COMPLETE Column stats: COMPLETE
+  Group By Operator
+aggregations: count()
+minReductionHashAggr: 0.99
+mode: hash
+outputColumnNames: _col0
+Statistics: Num rows: 1 Data size: 8 Basic stats: 
COMPLETE Column stats: COMPLETE
+Reduce Output Operator
+  bucketingVersion: 2
+  null sort order: 
+  numBuckets: -1
+  sort order: 
+  Statistics: Num rows: 1 Data size: 8 Basic stats: 
COMPLETE Column stats: COMPLETE
+  tag: -1
+  value expressions: _col0 (type: bigint)
+  auto parallelism: false
+Execution mode: llap
+LLAP IO: no inputs
+Path -> Alias:
+ A masked pattern was here 
+Path -> Partition:
+ A masked pattern was here 
+Partition
+  base file name: src
+  input format: org.apache.hadoop.mapred.TextInputFormat
+  output format: 
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
+  properties:
+bucket_count -1
+bucketing_version 2
+column.name.delimiter ,
+columns key,value
+columns.types string:string
+ A masked pattern was here 
+name default.src
+serialization.format 1
+serialization.lib 
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
+  serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
+
+input format: org.apache.hadoop.mapred.TextInputFormat
+output format: 
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
+properties:
+  bucketing_version 2
+  column.name.delimiter ,
+  columns key,value
+  columns.comments 'default','default'
+  columns.types string:string
+ A masked pattern was here 
+  name default.src
+  serialization.format 1
+  serialization.lib 
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
+serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
+name: default.src
+  name: default.src
+Truncated Path -> Alias:
+  /src [src]
+Reducer 2 
+Execution mode: llap
+Needs Tagging: false
+Reduce

[jira] [Work logged] (HIVE-23951) Support parameterized queries in WHERE/HAVING clause

2020-08-03 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-23951?focusedWorklogId=465946=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-465946
 ]

ASF GitHub Bot logged work on HIVE-23951:
-

Author: ASF GitHub Bot
Created on: 03/Aug/20 23:13
Start Date: 03/Aug/20 23:13
Worklog Time Spent: 10m 
  Work Description: vineetgarg02 commented on a change in pull request 
#1315:
URL: https://github.com/apache/hive/pull/1315#discussion_r464709177



##
File path: ql/src/java/org/apache/hadoop/hive/ql/plan/ExprDynamicParamDesc.java
##
@@ -0,0 +1,113 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hadoop.hive.ql.plan;
+
+import java.io.Serializable;
+import java.util.List;
+
+import org.apache.commons.lang3.builder.HashCodeBuilder;
+import org.apache.hadoop.hive.common.StringInternUtils;
+import org.apache.hadoop.hive.serde.serdeConstants;
+import org.apache.hadoop.hive.serde2.objectinspector.ConstantObjectInspector;
+import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector.Category;
+import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorUtils;
+import org.apache.hadoop.hive.serde2.typeinfo.BaseCharTypeInfo;
+import org.apache.hadoop.hive.serde2.typeinfo.StructTypeInfo;
+import org.apache.hadoop.hive.serde2.typeinfo.TypeInfo;
+import org.apache.hadoop.hive.serde2.typeinfo.TypeInfoFactory;
+import org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils;
+
+/**
+ * A constant expression.
+ */
+public class ExprDynamicParamDesc extends ExprNodeDesc implements Serializable 
{
+  private static final long serialVersionUID = 1L;
+  final protected transient static char[] hexArray = 
"0123456789ABCDEF".toCharArray();
+
+  private int index;
+  private Object value;
+
+  public ExprDynamicParamDesc() {
+  }
+
+  public ExprDynamicParamDesc(TypeInfo typeInfo, int index, Object value) {
+super(typeInfo);
+this.index =  index;
+this.value = value;
+  }
+
+  public Object getValue() {
+return value;
+  }
+
+  public int getIndex() {
+return index;
+  }
+
+
+  @Override
+  public String toString() {
+return "Dynamic Parameter " + " index: " + index;

Review comment:
   "Dynamic Parameter" makes it clear that the expression in an explain 
plan is dynamic expression. Just showing index will make it hard to read.
   What is the benefit of making it more compact?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 465946)
Time Spent: 1h 20m  (was: 1h 10m)

> Support parameterized queries in WHERE/HAVING clause
> 
>
> Key: HIVE-23951
> URL: https://issues.apache.org/jira/browse/HIVE-23951
> Project: Hive
>  Issue Type: Sub-task
>  Components: Query Planning
>Reporter: Vineet Garg
>Assignee: Vineet Garg
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (HIVE-23716) Support Anti Join in Hive

2020-08-03 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-23716?focusedWorklogId=465943=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-465943
 ]

ASF GitHub Bot logged work on HIVE-23716:
-

Author: ASF GitHub Bot
Created on: 03/Aug/20 23:03
Start Date: 03/Aug/20 23:03
Worklog Time Spent: 10m 
  Work Description: jcamachor commented on pull request #1147:
URL: https://github.com/apache/hive/pull/1147#issuecomment-668281996


   @maheshk114 , thanks for addressing the first batch of comments. PR looks 
better. I have done a second pass and left some additional comments that should 
be addressed before merging. Please, also merge master into your branch, since 
there seem to be some conflicts.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 465943)
Time Spent: 14h  (was: 13h 50m)

> Support Anti Join in Hive 
> --
>
> Key: HIVE-23716
> URL: https://issues.apache.org/jira/browse/HIVE-23716
> Project: Hive
>  Issue Type: Bug
>Reporter: mahesh kumar behera
>Assignee: mahesh kumar behera
>Priority: Major
>  Labels: pull-request-available
> Attachments: HIVE-23716.01.patch
>
>  Time Spent: 14h
>  Remaining Estimate: 0h
>
> Currently hive does not support Anti join. The query for anti join is 
> converted to left outer join and null filter on right side join key is added 
> to get the desired result. This is causing
>  # Extra computation — The left outer join projects the redundant columns 
> from right side. Along with that, filtering is done to remove the redundant 
> rows. This is can be avoided in case of anti join as anti join will project 
> only the required columns and rows from the left side table.
>  # Extra shuffle — In case of anti join the duplicate records moved to join 
> node can be avoided from the child node. This can reduce significant amount 
> of data movement if the number of distinct rows( join keys) is significant.
>  # Extra Memory Usage - In case of map based anti join , hash set is 
> sufficient as just the key is required to check  if the records matches the 
> join condition. In case of left join, we need the key and the non key columns 
> also and thus a hash table will be required.
> For a query like
> {code:java}
>  select wr_order_number FROM web_returns LEFT JOIN web_sales  ON 
> wr_order_number = ws_order_number WHERE ws_order_number IS NULL;{code}
> The number of distinct ws_order_number in web_sales table in a typical 10TB 
> TPCDS set up is just 10% of total records. So when we convert this query to 
> anti join, instead of 7 billion rows, only 600 million rows are moved to join 
> node.
> In the current patch, just one conversion is done. The pattern of 
> project->filter->left-join is converted to project->anti-join. This will take 
> care of sub queries with “not exists” clause. The queries with “not exists” 
> are converted first to filter + left-join and then its converted to anti 
> join. The queries with “not in” are not handled in the current patch.
> From execution side, both merge join and map join with vectorized execution  
> is supported for anti join.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (HIVE-23716) Support Anti Join in Hive

2020-08-03 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-23716?focusedWorklogId=465942=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-465942
 ]

ASF GitHub Bot logged work on HIVE-23716:
-

Author: ASF GitHub Bot
Created on: 03/Aug/20 23:02
Start Date: 03/Aug/20 23:02
Worklog Time Spent: 10m 
  Work Description: jcamachor commented on a change in pull request #1147:
URL: https://github.com/apache/hive/pull/1147#discussion_r464673502



##
File path: 
ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/HiveRelOptUtil.java
##
@@ -747,6 +747,8 @@ public static RewritablePKFKJoinInfo 
isRewritablePKFKJoin(Join join,
 final RelNode nonFkInput = leftInputPotentialFK ? join.getRight() : 
join.getLeft();
 final RewritablePKFKJoinInfo nonRewritable = 
RewritablePKFKJoinInfo.of(false, null);
 
+// TODO : Need to handle Anti join.

Review comment:
   Thanks for creating HIVE-23906. Can we simply return `nonRewritable` if 
it is an anti-join for the time being, rather than proceeding? This certainly 
requires a bit of extra thinking and specific tests to make sure it is working 
as expected (for which we already have HIVE-23906).

##
File path: 
ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/rules/HiveJoinConstraintsRule.java
##
@@ -183,6 +189,7 @@ public void onMatch(RelOptRuleCall call) {
 switch (joinType) {
 case SEMI:
 case INNER:
+case ANTI:

Review comment:
   This should be removed to avoid confusion, since we bail out above.

##
File path: ql/src/test/queries/clientpositive/subquery_in_having.q
##
@@ -140,6 +140,22 @@ CREATE TABLE src_null_n4 (key STRING COMMENT 'default', 
value STRING COMMENT 'de
 LOAD DATA LOCAL INPATH "../../data/files/kv1.txt" INTO TABLE src_null_n4;
 INSERT INTO src_null_n4 values('5444', null);
 
+explain
+select key, value, count(*)

Review comment:
   Should we execute this query with conversion=true?

##
File path: 
ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/HiveCalciteUtil.java
##
@@ -1233,4 +1233,21 @@ public FixNullabilityShuttle(RexBuilder rexBuilder,
 }
   }
 
+  // Checks if any of the expression given as list expressions are from right 
side of the join.

Review comment:
   nit. Change comment to javadoc

##
File path: 
ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/rules/HiveAntiSemiJoinRule.java
##
@@ -0,0 +1,105 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.hadoop.hive.ql.optimizer.calcite.rules;
+
+import org.apache.calcite.plan.RelOptRule;
+import org.apache.calcite.plan.RelOptRuleCall;
+import org.apache.calcite.plan.RelOptUtil;
+import org.apache.calcite.rel.RelNode;
+import org.apache.calcite.rel.core.Filter;
+import org.apache.calcite.rel.core.Join;
+import org.apache.calcite.rel.core.JoinRelType;
+import org.apache.calcite.rel.core.Project;
+import org.apache.calcite.rex.RexNode;
+import org.apache.calcite.sql.SqlKind;
+import org.apache.hadoop.hive.ql.optimizer.calcite.HiveCalciteUtil;
+import org.apache.hadoop.hive.ql.optimizer.calcite.reloperators.HiveAntiJoin;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import java.util.List;
+import java.util.stream.Collectors;
+import java.util.stream.Stream;
+
+/**
+ * Planner rule that converts a join plus filter to anti join.
+ */
+public class HiveAntiSemiJoinRule extends RelOptRule {
+  protected static final Logger LOG = 
LoggerFactory.getLogger(HiveAntiSemiJoinRule.class);
+  public static final HiveAntiSemiJoinRule INSTANCE = new 
HiveAntiSemiJoinRule();
+
+  //HiveProject(fld=[$0])
+  //  HiveFilter(condition=[IS NULL($1)])
+  //HiveJoin(condition=[=($0, $1)], joinType=[left], algorithm=[none], 
cost=[not available])
+  //
+  // TO
+  //
+  //HiveProject(fld_tbl=[$0])
+  //  HiveAntiJoin(condition=[=($0, $1)], joinType=[anti])
+  //
+  public HiveAntiSemiJoinRule() {
+super(operand(Project.class, operand(Filter.class, operand(Join.class, 
RelOptRule.any(,
+"HiveJoinWithFilterToAntiJoinRule:filter");
+  }

[jira] [Work logged] (HIVE-23951) Support parameterized queries in WHERE/HAVING clause

2020-08-03 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-23951?focusedWorklogId=465938=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-465938
 ]

ASF GitHub Bot logged work on HIVE-23951:
-

Author: ASF GitHub Bot
Created on: 03/Aug/20 22:47
Start Date: 03/Aug/20 22:47
Worklog Time Spent: 10m 
  Work Description: vineetgarg02 commented on a change in pull request 
#1315:
URL: https://github.com/apache/hive/pull/1315#discussion_r464701187



##
File path: 
ql/src/java/org/apache/hadoop/hive/ql/ddl/table/drop/ExecuteStatementAnalyzer.java
##
@@ -0,0 +1,377 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hadoop.hive.ql.ddl.table.drop;
+
+import org.apache.hadoop.hive.ql.QueryState;
+import org.apache.hadoop.hive.ql.ddl.DDLSemanticAnalyzerFactory.DDLType;
+import org.apache.hadoop.hive.ql.exec.ExplainTask;
+import org.apache.hadoop.hive.ql.exec.FetchTask;
+import org.apache.hadoop.hive.ql.exec.FilterOperator;
+import org.apache.hadoop.hive.ql.exec.Operator;
+import org.apache.hadoop.hive.ql.exec.OperatorUtils;
+import org.apache.hadoop.hive.ql.exec.SelectOperator;
+import org.apache.hadoop.hive.ql.exec.SerializationUtilities;
+import org.apache.hadoop.hive.ql.exec.Task;
+import org.apache.hadoop.hive.ql.exec.Utilities;
+import org.apache.hadoop.hive.ql.exec.tez.TezTask;
+import org.apache.hadoop.hive.ql.exec.vector.VectorSelectOperator;
+import org.apache.hadoop.hive.ql.parse.ASTNode;
+import org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer;
+import org.apache.hadoop.hive.ql.parse.HiveParser;
+import org.apache.hadoop.hive.ql.parse.SemanticException;
+import org.apache.hadoop.hive.ql.parse.type.ExprNodeDescExprFactory;
+import org.apache.hadoop.hive.ql.plan.BaseWork;
+import org.apache.hadoop.hive.ql.plan.ExprDynamicParamDesc;
+import org.apache.hadoop.hive.ql.plan.ExprNodeConstantDesc;
+import org.apache.hadoop.hive.ql.plan.ExprNodeDesc;
+import org.apache.hadoop.hive.ql.session.SessionState;
+import org.apache.hadoop.hive.serde2.typeinfo.CharTypeInfo;
+import org.apache.hadoop.hive.serde2.typeinfo.TypeInfo;
+import org.apache.hadoop.hive.serde2.typeinfo.TypeInfoFactory;
+import org.apache.hadoop.hive.serde2.typeinfo.VarcharTypeInfo;
+
+import java.io.ByteArrayInputStream;
+import java.io.ByteArrayOutputStream;
+import java.util.ArrayList;
+import java.util.HashMap;
+import java.util.HashSet;
+import java.util.List;
+import java.util.Map;
+import java.util.Set;
+
+/**
+ * Analyzer for Execute statement.
+ * This analyzer
+ *  retreives cached {@link BaseSemanticAnalyzer},
+ *  makes copy of all tasks by serializing/deserializing it,
+ *  bind dynamic parameters inside cached {@link BaseSemanticAnalyzer} using 
values provided
+ */
+@DDLType(types = HiveParser.TOK_EXECUTE)
+public class ExecuteStatementAnalyzer extends BaseSemanticAnalyzer {
+
+  public ExecuteStatementAnalyzer(QueryState queryState) throws 
SemanticException {
+super(queryState);
+  }
+
+  /**
+   * This class encapsulate all {@link Task} required to be copied.
+   * This is required because {@link FetchTask} list of {@link Task} may hold 
reference to same
+   * objects (e.g. list of result files) and are required to be 
serialized/de-serialized together.
+   */
+  private class PlanCopy {
+FetchTask fetchTask;
+List> tasks;
+
+PlanCopy(FetchTask fetchTask, List> tasks) {
+  this.fetchTask = fetchTask;
+  this.tasks = tasks;
+}
+
+FetchTask getFetchTask() {
+  return fetchTask;
+}
+
+List> getTasks()  {
+  return tasks;
+}
+  }
+
+  private String getQueryName(ASTNode root) {
+ASTNode queryNameAST = (ASTNode)(root.getChild(1));
+return queryNameAST.getText();
+  }
+
+  /**
+   * Utility method to create copy of provided object using kyro 
serialization/de-serialization.
+   */
+  private  T makeCopy(final Object task, Class objClass) {
+ByteArrayOutputStream baos = new ByteArrayOutputStream();
+SerializationUtilities.serializePlan(task, baos);
+
+return SerializationUtilities.deserializePlan(
+new ByteArrayInputStream(baos.toByteArray()), objClass);
+

[jira] [Work logged] (HIVE-23951) Support parameterized queries in WHERE/HAVING clause

2020-08-03 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-23951?focusedWorklogId=465931=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-465931
 ]

ASF GitHub Bot logged work on HIVE-23951:
-

Author: ASF GitHub Bot
Created on: 03/Aug/20 22:38
Start Date: 03/Aug/20 22:38
Worklog Time Spent: 10m 
  Work Description: vineetgarg02 commented on a change in pull request 
#1315:
URL: https://github.com/apache/hive/pull/1315#discussion_r464698359



##
File path: 
ql/src/java/org/apache/hadoop/hive/ql/ddl/table/drop/ExecuteStatementAnalyzer.java
##
@@ -0,0 +1,377 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hadoop.hive.ql.ddl.table.drop;
+
+import org.apache.hadoop.hive.ql.QueryState;
+import org.apache.hadoop.hive.ql.ddl.DDLSemanticAnalyzerFactory.DDLType;
+import org.apache.hadoop.hive.ql.exec.ExplainTask;
+import org.apache.hadoop.hive.ql.exec.FetchTask;
+import org.apache.hadoop.hive.ql.exec.FilterOperator;
+import org.apache.hadoop.hive.ql.exec.Operator;
+import org.apache.hadoop.hive.ql.exec.OperatorUtils;
+import org.apache.hadoop.hive.ql.exec.SelectOperator;
+import org.apache.hadoop.hive.ql.exec.SerializationUtilities;
+import org.apache.hadoop.hive.ql.exec.Task;
+import org.apache.hadoop.hive.ql.exec.Utilities;
+import org.apache.hadoop.hive.ql.exec.tez.TezTask;
+import org.apache.hadoop.hive.ql.exec.vector.VectorSelectOperator;
+import org.apache.hadoop.hive.ql.parse.ASTNode;
+import org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer;
+import org.apache.hadoop.hive.ql.parse.HiveParser;
+import org.apache.hadoop.hive.ql.parse.SemanticException;
+import org.apache.hadoop.hive.ql.parse.type.ExprNodeDescExprFactory;
+import org.apache.hadoop.hive.ql.plan.BaseWork;
+import org.apache.hadoop.hive.ql.plan.ExprDynamicParamDesc;
+import org.apache.hadoop.hive.ql.plan.ExprNodeConstantDesc;
+import org.apache.hadoop.hive.ql.plan.ExprNodeDesc;
+import org.apache.hadoop.hive.ql.session.SessionState;
+import org.apache.hadoop.hive.serde2.typeinfo.CharTypeInfo;
+import org.apache.hadoop.hive.serde2.typeinfo.TypeInfo;
+import org.apache.hadoop.hive.serde2.typeinfo.TypeInfoFactory;
+import org.apache.hadoop.hive.serde2.typeinfo.VarcharTypeInfo;
+
+import java.io.ByteArrayInputStream;
+import java.io.ByteArrayOutputStream;
+import java.util.ArrayList;
+import java.util.HashMap;
+import java.util.HashSet;
+import java.util.List;
+import java.util.Map;
+import java.util.Set;
+
+/**
+ * Analyzer for Execute statement.
+ * This analyzer
+ *  retreives cached {@link BaseSemanticAnalyzer},
+ *  makes copy of all tasks by serializing/deserializing it,
+ *  bind dynamic parameters inside cached {@link BaseSemanticAnalyzer} using 
values provided
+ */
+@DDLType(types = HiveParser.TOK_EXECUTE)
+public class ExecuteStatementAnalyzer extends BaseSemanticAnalyzer {
+
+  public ExecuteStatementAnalyzer(QueryState queryState) throws 
SemanticException {
+super(queryState);
+  }
+
+  /**
+   * This class encapsulate all {@link Task} required to be copied.
+   * This is required because {@link FetchTask} list of {@link Task} may hold 
reference to same
+   * objects (e.g. list of result files) and are required to be 
serialized/de-serialized together.
+   */
+  private class PlanCopy {
+FetchTask fetchTask;
+List> tasks;
+
+PlanCopy(FetchTask fetchTask, List> tasks) {
+  this.fetchTask = fetchTask;
+  this.tasks = tasks;
+}
+
+FetchTask getFetchTask() {
+  return fetchTask;
+}
+
+List> getTasks()  {
+  return tasks;
+}
+  }
+
+  private String getQueryName(ASTNode root) {
+ASTNode queryNameAST = (ASTNode)(root.getChild(1));
+return queryNameAST.getText();
+  }
+
+  /**
+   * Utility method to create copy of provided object using kyro 
serialization/de-serialization.
+   */
+  private  T makeCopy(final Object task, Class objClass) {
+ByteArrayOutputStream baos = new ByteArrayOutputStream();
+SerializationUtilities.serializePlan(task, baos);
+
+return SerializationUtilities.deserializePlan(
+new ByteArrayInputStream(baos.toByteArray()), objClass);
+

[jira] [Commented] (HIVE-9020) When dropping external tables, Hive should not verify whether user has access to the data.

2020-08-03 Thread Tom Kiefer (Jira)



[ 
https://issues.apache.org/jira/browse/HIVE-9020?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17170396#comment-17170396
 ] 

Tom Kiefer commented on HIVE-9020:
--

[~cwsteinbach], can it (or its equivalent, if the underlying code has otherwise 
changed) be?

This appears to still be a valid and problematic issue.

> When dropping external tables, Hive should not verify whether user has access 
> to the data. 
> ---
>
> Key: HIVE-9020
> URL: https://issues.apache.org/jira/browse/HIVE-9020
> Project: Hive
>  Issue Type: Bug
>Affects Versions: 0.13.1
>Reporter: Anant Nag
>Priority: Major
> Attachments: dropExternal.patch
>
>
> When dropping tables, hive verifies whether the user has access to the data 
> on hdfs. It fails, if user doesn't have access. It makes sense for internal 
> tables since the data has to be deleted when dropping internal tables but for 
> external tables, Hive should not check for data access. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (HIVE-23941) Refactor TypeCheckProcFactory to be database agnostic

2020-08-03 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-23941?focusedWorklogId=465881=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-465881
 ]

ASF GitHub Bot logged work on HIVE-23941:
-

Author: ASF GitHub Bot
Created on: 03/Aug/20 19:41
Start Date: 03/Aug/20 19:41
Worklog Time Spent: 10m 
  Work Description: jcamachor merged pull request #1326:
URL: https://github.com/apache/hive/pull/1326


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 465881)
Time Spent: 40m  (was: 0.5h)

> Refactor TypeCheckProcFactory to be database agnostic
> -
>
> Key: HIVE-23941
> URL: https://issues.apache.org/jira/browse/HIVE-23941
> Project: Hive
>  Issue Type: Improvement
>  Components: HiveServer2
>Reporter: Steve Carlin
>Assignee: Steve Carlin
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> Part of the code has already been refactored to become database agnostic 
> (i.e. HiveFunctionHelper).  
> Further refactoring needs to be done on TypeCheckProcFactory which also 
> should be database agnostic.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Resolved] (HIVE-23941) Refactor TypeCheckProcFactory to be database agnostic

2020-08-03 Thread Jesus Camacho Rodriguez (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-23941?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jesus Camacho Rodriguez resolved HIVE-23941.

Fix Version/s: 4.0.0
   Resolution: Fixed

Pushed to master, thanks [~scarlin]!

> Refactor TypeCheckProcFactory to be database agnostic
> -
>
> Key: HIVE-23941
> URL: https://issues.apache.org/jira/browse/HIVE-23941
> Project: Hive
>  Issue Type: Improvement
>  Components: HiveServer2
>Reporter: Steve Carlin
>Assignee: Steve Carlin
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> Part of the code has already been refactored to become database agnostic 
> (i.e. HiveFunctionHelper).  
> Further refactoring needs to be done on TypeCheckProcFactory which also 
> should be database agnostic.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (HIVE-23941) Refactor TypeCheckProcFactory to be database agnostic

2020-08-03 Thread Jesus Camacho Rodriguez (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-23941?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jesus Camacho Rodriguez reassigned HIVE-23941:
--

Assignee: Steve Carlin

> Refactor TypeCheckProcFactory to be database agnostic
> -
>
> Key: HIVE-23941
> URL: https://issues.apache.org/jira/browse/HIVE-23941
> Project: Hive
>  Issue Type: Improvement
>  Components: HiveServer2
>Reporter: Steve Carlin
>Assignee: Steve Carlin
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> Part of the code has already been refactored to become database agnostic 
> (i.e. HiveFunctionHelper).  
> Further refactoring needs to be done on TypeCheckProcFactory which also 
> should be database agnostic.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (HIVE-3562) Some limit can be pushed down to map stage

2020-08-03 Thread Girish Kadli (Jira)



[ 
https://issues.apache.org/jira/browse/HIVE-3562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17170367#comment-17170367
 ] 

Girish Kadli commented on HIVE-3562:


I have a hive query its returning different results with and without limit.

Let's say with limit query result set as R1 and without limit query result set 
as R2.

These are the following discrepancies: 
 * R1 contains some of the column values as null. 
 * R2 doesn't contain the rows returned by R1.
 * R2 contains all non null column values. 
 * R2 is returning correct results, R1 is returning wrong results.

After debugging realised that *hive.limit.pushdown.memory.usage=0.1* 

is the root cause of this issue. after i set this property to -1, R1 starts 
returning correct rows with non null column values. and R1 results are part of 
R2 results.

What could be the problem setting lower value to 
*hive.limit.pushdown.memory.usage?*

can it cause data issues in "with limit" hive queries by returning wrong 
results?

 

 

 

 

 

 

> Some limit can be pushed down to map stage
> --
>
> Key: HIVE-3562
> URL: https://issues.apache.org/jira/browse/HIVE-3562
> Project: Hive
>  Issue Type: Bug
>Reporter: Navis Ryu
>Assignee: Navis Ryu
>Priority: Trivial
> Fix For: 0.12.0
>
> Attachments: HIVE-3562.D5967.1.patch, HIVE-3562.D5967.2.patch, 
> HIVE-3562.D5967.3.patch, HIVE-3562.D5967.4.patch, HIVE-3562.D5967.5.patch, 
> HIVE-3562.D5967.6.patch, HIVE-3562.D5967.7.patch, HIVE-3562.D5967.8.patch, 
> HIVE-3562.D5967.9.patch
>
>
> Queries with limit clause (with reasonable number), for example
> {noformat}
> select * from src order by key limit 10;
> {noformat}
> makes operator tree, 
> TS-SEL-RS-EXT-LIMIT-FS
> But LIMIT can be partially calculated in RS, reducing size of shuffling.
> TS-SEL-RS(TOP-N)-EXT-LIMIT-FS



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HIVE-23820) [HS2] Send tableId in request for get_table_request API

2020-08-03 Thread Kishen Das (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-23820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kishen Das updated HIVE-23820:
--
Summary: [HS2] Send tableId in request for get_table_request API  (was: 
[HS2] Send tableId in request for all the new HMS get_parition_* APIs that are 
in request/response form)

> [HS2] Send tableId in request for get_table_request API
> ---
>
> Key: HIVE-23820
> URL: https://issues.apache.org/jira/browse/HIVE-23820
> Project: Hive
>  Issue Type: Sub-task
>Reporter: Kishen Das
>Assignee: Kishen Das
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work started] (HIVE-23821) [HS2] Send tableId in request for all the new HMS get_parition_* APIs that are in request/response form

2020-08-03 Thread Kishen Das (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-23821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on HIVE-23821 started by Kishen Das.
-
> [HS2] Send tableId in request for all the new HMS get_parition_* APIs that 
> are in request/response form
> ---
>
> Key: HIVE-23821
> URL: https://issues.apache.org/jira/browse/HIVE-23821
> Project: Hive
>  Issue Type: Sub-task
>Reporter: Kishen Das
>Assignee: Kishen Das
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HIVE-23821) [HS2] Send tableId in request for all the new HMS get_parition_* APIs that are in request/response form

2020-08-03 Thread Kishen Das (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-23821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kishen Das updated HIVE-23821:
--
Summary: [HS2] Send tableId in request for all the new HMS get_parition_* 
APIs that are in request/response form  (was: [HS2] Send tableId in request for 
get_table_request API)

> [HS2] Send tableId in request for all the new HMS get_parition_* APIs that 
> are in request/response form
> ---
>
> Key: HIVE-23821
> URL: https://issues.apache.org/jira/browse/HIVE-23821
> Project: Hive
>  Issue Type: Sub-task
>Reporter: Kishen Das
>Assignee: Kishen Das
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (HIVE-23980) Shade guava from existing Hive versions

2020-08-03 Thread L. C. Hsieh (Jira)



[ 
https://issues.apache.org/jira/browse/HIVE-23980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17170362#comment-17170362
 ] 

L. C. Hsieh commented on HIVE-23980:


[~csun] What do you think if we can shade Guava in existing Hive artifacts? 

> Shade guava from existing Hive versions
> ---
>
> Key: HIVE-23980
> URL: https://issues.apache.org/jira/browse/HIVE-23980
> Project: Hive
>  Issue Type: Bug
>Reporter: L. C. Hsieh
>Priority: Major
>
> I'm trying to upgrade Guava version in Spark. The JIRA ticket is SPARK-32502.
> Running test hits an error:
> {code}
> sbt.ForkMain$ForkError: sbt.ForkMain$ForkError: java.lang.IllegalAccessError: 
> tried to access method 
> com.google.common.collect.Iterators.emptyIterator()Lcom/google/common/collect/UnmodifiableIterator;
>  from class org.apache.hadoop.hive.ql.exec.FetchOperator
>   at 
> org.apache.hadoop.hive.ql.exec.FetchOperator.(FetchOperator.java:108)
>   at 
> org.apache.hadoop.hive.ql.exec.FetchTask.initialize(FetchTask.java:87)
>   at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:541)
>   at org.apache.hadoop.hive.ql.Driver.compileInternal(Driver.java:1317)
>   at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1457)
>   at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1237)
>   at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1227)
> {code}
> I know that hive-exec doesn't shade Guava until HIVE-22126 but that work 
> targets 4.0.0. I'm wondering if there is a solution for current Hive 
> versions, e.g. Hive 2.3.7? Any ideas?
> Thanks.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HIVE-23820) [HS2] Send tableId in request for all the new HMS get_parition_* APIs that are in request/response form

2020-08-03 Thread Kishen Das (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-23820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kishen Das updated HIVE-23820:
--
Summary: [HS2] Send tableId in request for all the new HMS get_parition_* 
APIs that are in request/response form  (was: [HS2] Send tableId in request for 
all the new HMS get_* APIs that are in request/response form)

> [HS2] Send tableId in request for all the new HMS get_parition_* APIs that 
> are in request/response form
> ---
>
> Key: HIVE-23820
> URL: https://issues.apache.org/jira/browse/HIVE-23820
> Project: Hive
>  Issue Type: Sub-task
>Reporter: Kishen Das
>Assignee: Kishen Das
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (HIVE-23821) [HS2] Send tableId in request for get_table_request API

2020-08-03 Thread Kishen Das (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-23821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kishen Das reassigned HIVE-23821:
-

Assignee: Kishen Das

> [HS2] Send tableId in request for get_table_request API
> ---
>
> Key: HIVE-23821
> URL: https://issues.apache.org/jira/browse/HIVE-23821
> Project: Hive
>  Issue Type: Sub-task
>Reporter: Kishen Das
>Assignee: Kishen Das
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HIVE-23821) [HS2] Send tableId in request for get_table_request API

2020-08-03 Thread Kishen Das (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-23821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kishen Das updated HIVE-23821:
--
Summary: [HS2] Send tableId in request for get_table_request API  (was: 
[HS2] Send tableId in request for all the new HMS get_* APIs that are in 
request/response form)

> [HS2] Send tableId in request for get_table_request API
> ---
>
> Key: HIVE-23821
> URL: https://issues.apache.org/jira/browse/HIVE-23821
> Project: Hive
>  Issue Type: Sub-task
>Reporter: Kishen Das
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (HIVE-22062) WriteId is not updated for a partitioned ACID table when schema changes

2020-08-03 Thread Vihang Karajgaonkar (Jira)



[ 
https://issues.apache.org/jira/browse/HIVE-22062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17170290#comment-17170290
 ] 

Vihang Karajgaonkar commented on HIVE-22062:


DDLs advance the table level writeId after HIVE-23573. Can you recheck if this 
still is a problem?
cc [~gaborkaszab]

> WriteId is not updated for a partitioned ACID table when schema changes
> ---
>
> Key: HIVE-22062
> URL: https://issues.apache.org/jira/browse/HIVE-22062
> Project: Hive
>  Issue Type: Bug
>Reporter: Gabor Kaszab
>Assignee: Laszlo Kovari
>Priority: Major
>  Labels: ACID
>
> Changing the schema (e.g. adding a new column) of a non-partitioned ACID 
> table results in the table-level writeId being incremented. This is as 
> expected.
> However, if you do the same on a partitioned ACID table then neither the 
> table-level nor the partition-level writeIds are updated. I would expect in 
> this case to increment the table-level writeId to reflect that the table has 
> been changed.
> Note, that get_valid_write_ids() shows that the high watermark is incremented 
> even though the writeId isn't.
> Update: I'd extend the scope of this Jira further a bit. There are a number 
> of use cases in Hive that doesn't result in a writeId change on ACID tables 
> and as a result there is no way from other systems (like Impala) to judge if 
> a refresh should be run on a table or not. The only option is to every time 
> update all the data for a table that is expensive. E.g. Additionally to the 
> above use-case compaction is something that is not noticeable outside from 
> Hive.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (HIVE-23980) Shade guava from existing Hive versions

2020-08-03 Thread L. C. Hsieh (Jira)



[ 
https://issues.apache.org/jira/browse/HIVE-23980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17170187#comment-17170187
 ] 

L. C. Hsieh commented on HIVE-23980:


And before HIVE-22126, I think guava is not shaded in hive-exec.

> Shade guava from existing Hive versions
> ---
>
> Key: HIVE-23980
> URL: https://issues.apache.org/jira/browse/HIVE-23980
> Project: Hive
>  Issue Type: Bug
>Reporter: L. C. Hsieh
>Priority: Major
>
> I'm trying to upgrade Guava version in Spark. The JIRA ticket is SPARK-32502.
> Running test hits an error:
> {code}
> sbt.ForkMain$ForkError: sbt.ForkMain$ForkError: java.lang.IllegalAccessError: 
> tried to access method 
> com.google.common.collect.Iterators.emptyIterator()Lcom/google/common/collect/UnmodifiableIterator;
>  from class org.apache.hadoop.hive.ql.exec.FetchOperator
>   at 
> org.apache.hadoop.hive.ql.exec.FetchOperator.(FetchOperator.java:108)
>   at 
> org.apache.hadoop.hive.ql.exec.FetchTask.initialize(FetchTask.java:87)
>   at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:541)
>   at org.apache.hadoop.hive.ql.Driver.compileInternal(Driver.java:1317)
>   at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1457)
>   at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1237)
>   at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1227)
> {code}
> I know that hive-exec doesn't shade Guava until HIVE-22126 but that work 
> targets 4.0.0. I'm wondering if there is a solution for current Hive 
> versions, e.g. Hive 2.3.7? Any ideas?
> Thanks.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (HIVE-23980) Shade guava from existing Hive versions

2020-08-03 Thread L. C. Hsieh (Jira)



[ 
https://issues.apache.org/jira/browse/HIVE-23980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17170186#comment-17170186
 ] 

L. C. Hsieh commented on HIVE-23980:


I think Spark already uses core classifier in Hive dependencies.

> Shade guava from existing Hive versions
> ---
>
> Key: HIVE-23980
> URL: https://issues.apache.org/jira/browse/HIVE-23980
> Project: Hive
>  Issue Type: Bug
>Reporter: L. C. Hsieh
>Priority: Major
>
> I'm trying to upgrade Guava version in Spark. The JIRA ticket is SPARK-32502.
> Running test hits an error:
> {code}
> sbt.ForkMain$ForkError: sbt.ForkMain$ForkError: java.lang.IllegalAccessError: 
> tried to access method 
> com.google.common.collect.Iterators.emptyIterator()Lcom/google/common/collect/UnmodifiableIterator;
>  from class org.apache.hadoop.hive.ql.exec.FetchOperator
>   at 
> org.apache.hadoop.hive.ql.exec.FetchOperator.(FetchOperator.java:108)
>   at 
> org.apache.hadoop.hive.ql.exec.FetchTask.initialize(FetchTask.java:87)
>   at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:541)
>   at org.apache.hadoop.hive.ql.Driver.compileInternal(Driver.java:1317)
>   at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1457)
>   at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1237)
>   at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1227)
> {code}
> I know that hive-exec doesn't shade Guava until HIVE-22126 but that work 
> targets 4.0.0. I'm wondering if there is a solution for current Hive 
> versions, e.g. Hive 2.3.7? Any ideas?
> Thanks.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (HIVE-23975) Reuse evicted keys from aggregation buffers

2020-08-03 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-23975?focusedWorklogId=465807=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-465807
 ]

ASF GitHub Bot logged work on HIVE-23975:
-

Author: ASF GitHub Bot
Created on: 03/Aug/20 16:45
Start Date: 03/Aug/20 16:45
Worklog Time Spent: 10m 
  Work Description: mustafaiman commented on a change in pull request #1352:
URL: https://github.com/apache/hive/pull/1352#discussion_r464532531



##
File path: 
ql/src/java/org/apache/hadoop/hive/ql/exec/vector/wrapper/VectorHashKeyWrapperGeneral.java
##
@@ -262,6 +255,138 @@ private void duplicateTo(VectorHashKeyWrapperGeneral 
clone) {
 
 clone.hashcode = hashcode;
 assert clone.equals(this);
+
+return clone;
+  }
+
+  private long[] copyInPlaceOrAllocate(long[] from, long[] to) {
+if (from.length > 0) {
+  if (to != null && to.length == from.length) {
+System.arraycopy(from, 0, to, 0, from.length);
+return to;
+  } else {
+return from.clone();
+  }
+} else {
+  return EMPTY_LONG_ARRAY;
+}
+  }
+
+  private double[] copyInPlaceOrAllocate(double[] from, double[] to) {
+if (from.length > 0) {
+  if (to != null && to.length == from.length) {
+System.arraycopy(from, 0, to, 0, from.length);
+return to;
+  } else {
+return from.clone();
+  }
+} else {
+  return EMPTY_DOUBLE_ARRAY;
+}
+  }
+
+  private boolean[] copyInPlaceOrAllocate(boolean[] from, boolean[] to) {
+if (to != null && to.length == from.length) {
+  System.arraycopy(from, 0, to, 0, from.length);
+  return to;
+} else {
+  return from.clone();
+}
+  }
+
+  private HiveDecimalWritable[] copyInPlaceOrAllocate(HiveDecimalWritable[] 
from, HiveDecimalWritable[] to) {
+if (from.length > 0) {
+  if (to == null || to.length != from.length) {
+to = new HiveDecimalWritable[from.length];
+  }
+  for (int i = 0; i < from.length; i++) {
+to[i] = new HiveDecimalWritable(from[i]);
+  }
+  return to;
+} else {
+  return EMPTY_DECIMAL_ARRAY;
+}
+  }
+
+  private Timestamp[] copyInPlaceOrAllocate(Timestamp[] from, Timestamp[] to) {
+if (from.length > 0) {
+  if (to == null || to.length != from.length) {
+to = new Timestamp[from.length];
+  }
+  for (int i = 0; i < from.length; i++) {
+to[i] = (Timestamp) from[i].clone();
+  }
+  return to;
+} else {
+  return EMPTY_TIMESTAMP_ARRAY;
+}
+  }
+
+  @Override
+  public void copyKey(KeyWrapper oldWrapper) {
+VectorHashKeyWrapperGeneral clone = (VectorHashKeyWrapperGeneral) 
oldWrapper;
+clone.hashCtx = hashCtx;
+clone.keyCount = keyCount;
+clone.longValues = copyInPlaceOrAllocate(longValues, clone.longValues);
+clone.doubleValues = copyInPlaceOrAllocate(doubleValues, 
clone.doubleValues);
+clone.isNull = copyInPlaceOrAllocate(isNull, clone.isNull);
+clone.decimalValues = copyInPlaceOrAllocate(decimalValues, 
clone.decimalValues);
+
+if (byteLengths.length > 0) {
+  if (clone.byteLengths == null || clone.byteValues.length != 
byteValues.length) {
+// byteValues and byteStarts are always the same length
+clone.byteValues = new byte[byteValues.length][];
+clone.byteStarts = new int[byteValues.length];
+clone.byteLengths = byteLengths.clone();
+for (int i = 0; i < byteValues.length; ++i) {
+  // avoid allocation/copy of nulls, because it potentially expensive.
+  // branch instead.
+  if (byteLengths[i] != -1) {
+clone.byteValues[i] = Arrays.copyOfRange(byteValues[i],
+byteStarts[i], byteStarts[i] + byteLengths[i]);
+  }
+}
+  } else {
+System.arraycopy(byteLengths, 0, clone.byteLengths, 0, 
byteValues.length);
+Arrays.fill(byteStarts, 0);
+System.arraycopy(byteStarts, 0, clone.byteStarts, 0, 
byteValues.length);
+for (int i = 0; i < byteValues.length; ++i) {
+  // avoid allocation/copy of nulls, because it potentially expensive.
+  // branch instead.
+  if (byteLengths[i] != -1) {
+if (clone.byteValues[i] != null && clone.byteValues[i].length >= 
byteValues[i].length) {
+  System.arraycopy(byteValues[i], byteStarts[i], 
clone.byteValues[i], 0, byteLengths[i]);
+} else {
+  clone.byteValues[i] = Arrays.copyOfRange(byteValues[i],
+  byteStarts[i], byteStarts[i] + byteLengths[i]);

Review comment:
   `clone.byteStarts[i]` is always zero but `byteStarts[i]` can be 
different values, depending on how it was assigned initially in 
`VectorHashKeyWrapperBatch#assignString` methods. Example: 
assignStringBullsNoRepeatingSelection





This is an

[jira] [Commented] (HIVE-23980) Shade guava from existing Hive versions

2020-08-03 Thread Chao Sun (Jira)



[ 
https://issues.apache.org/jira/browse/HIVE-23980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17170176#comment-17170176
 ] 

Chao Sun commented on HIVE-23980:
-

[~viirya] have you considered using {{hive-exec-\-core.jar}}? see 
[here|https://github.com/apache/hive/blob/master/ql/pom.xml#L992]. it has the 
same content as {{hive-exec-.jar}} but shades many dependencies. 

> Shade guava from existing Hive versions
> ---
>
> Key: HIVE-23980
> URL: https://issues.apache.org/jira/browse/HIVE-23980
> Project: Hive
>  Issue Type: Bug
>Reporter: L. C. Hsieh
>Priority: Major
>
> I'm trying to upgrade Guava version in Spark. The JIRA ticket is SPARK-32502.
> Running test hits an error:
> {code}
> sbt.ForkMain$ForkError: sbt.ForkMain$ForkError: java.lang.IllegalAccessError: 
> tried to access method 
> com.google.common.collect.Iterators.emptyIterator()Lcom/google/common/collect/UnmodifiableIterator;
>  from class org.apache.hadoop.hive.ql.exec.FetchOperator
>   at 
> org.apache.hadoop.hive.ql.exec.FetchOperator.(FetchOperator.java:108)
>   at 
> org.apache.hadoop.hive.ql.exec.FetchTask.initialize(FetchTask.java:87)
>   at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:541)
>   at org.apache.hadoop.hive.ql.Driver.compileInternal(Driver.java:1317)
>   at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1457)
>   at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1237)
>   at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1227)
> {code}
> I know that hive-exec doesn't shade Guava until HIVE-22126 but that work 
> targets 4.0.0. I'm wondering if there is a solution for current Hive 
> versions, e.g. Hive 2.3.7? Any ideas?
> Thanks.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (HIVE-23251) Provide a way to have only a selection of datasets loaded

2020-08-03 Thread Stamatis Zampetakis (Jira)



[ 
https://issues.apache.org/jira/browse/HIVE-23251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17170175#comment-17170175
 ] 

Stamatis Zampetakis commented on HIVE-23251:


What is supposed to happen if in the same file we have multiple occurrences of 
ONLY?

For instance:
{noformat}
qt:dataset:src,part:ONLY
qt:dataset:lineitem:ONLY
{noformat}


> Provide a way to have only a selection of datasets loaded
> -
>
> Key: HIVE-23251
> URL: https://issues.apache.org/jira/browse/HIVE-23251
> Project: Hive
>  Issue Type: Sub-task
>  Components: Test
>Reporter: Zoltan Haindrich
>Assignee: Zoltan Haindrich
>Priority: Major
> Fix For: 4.0.0
>
> Attachments: HIVE-23251.01.patch
>
>
> for example sysdb.q is listing all the tables known; which can change 
> depending on tests executed prior to this qtest 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (HIVE-23975) Reuse evicted keys from aggregation buffers

2020-08-03 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-23975?focusedWorklogId=465792=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-465792
 ]

ASF GitHub Bot logged work on HIVE-23975:
-

Author: ASF GitHub Bot
Created on: 03/Aug/20 16:28
Start Date: 03/Aug/20 16:28
Worklog Time Spent: 10m 
  Work Description: mustafaiman commented on a change in pull request #1352:
URL: https://github.com/apache/hive/pull/1352#discussion_r464523610



##
File path: 
ql/src/java/org/apache/hadoop/hive/ql/exec/vector/VectorGroupByOperator.java
##
@@ -514,7 +526,8 @@ private void 
prepareBatchAggregationBufferSets(VectorizedRowBatch batch) throws
   // is very important to clone the keywrapper, the one we have from 
our
   // keyWrappersBatch is going to be reset/reused on next batch.
   aggregationBuffer = allocateAggregationBuffer();

Review comment:
   @rbalamohan I have another patch for that.

##
File path: 
ql/src/java/org/apache/hadoop/hive/ql/exec/vector/VectorGroupByOperator.java
##
@@ -514,7 +526,8 @@ private void 
prepareBatchAggregationBufferSets(VectorizedRowBatch batch) throws
   // is very important to clone the keywrapper, the one we have from 
our
   // keyWrappersBatch is going to be reset/reused on next batch.
   aggregationBuffer = allocateAggregationBuffer();

Review comment:
   @rbalamohan I have another patch for that: 
https://github.com/apache/hive/pull/1337/files





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 465792)
Time Spent: 0.5h  (was: 20m)

> Reuse evicted keys from aggregation buffers
> ---
>
> Key: HIVE-23975
> URL: https://issues.apache.org/jira/browse/HIVE-23975
> Project: Hive
>  Issue Type: Improvement
>Reporter: Mustafa Iman
>Assignee: Mustafa Iman
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HIVE-23980) Shade guava from existing Hive versions

2020-08-03 Thread L. C. Hsieh (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-23980?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

L. C. Hsieh updated HIVE-23980:
---
Summary: Shade guava from existing Hive versions  (was: Shade guava from 
existing Hive modules)

> Shade guava from existing Hive versions
> ---
>
> Key: HIVE-23980
> URL: https://issues.apache.org/jira/browse/HIVE-23980
> Project: Hive
>  Issue Type: Bug
>Reporter: L. C. Hsieh
>Priority: Major
>
> I'm trying to upgrade Guava version in Spark. The JIRA ticket is SPARK-32502.
> Running test hits an error:
> {code}
> sbt.ForkMain$ForkError: sbt.ForkMain$ForkError: java.lang.IllegalAccessError: 
> tried to access method 
> com.google.common.collect.Iterators.emptyIterator()Lcom/google/common/collect/UnmodifiableIterator;
>  from class org.apache.hadoop.hive.ql.exec.FetchOperator
>   at 
> org.apache.hadoop.hive.ql.exec.FetchOperator.(FetchOperator.java:108)
>   at 
> org.apache.hadoop.hive.ql.exec.FetchTask.initialize(FetchTask.java:87)
>   at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:541)
>   at org.apache.hadoop.hive.ql.Driver.compileInternal(Driver.java:1317)
>   at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1457)
>   at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1237)
>   at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1227)
> {code}
> I know that hive-exec doesn't shade Guava until HIVE-22126 but that work 
> targets 4.0.0. I'm wondering if there is a solution for current Hive 
> versions, e.g. Hive 2.3.7? Any ideas?
> Thanks.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HIVE-23978) Enable logging with PerfLogger in HMS client

2020-08-03 Thread Vineet Garg (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-23978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vineet Garg updated HIVE-23978:
---
Issue Type: Improvement  (was: New Feature)

> Enable logging with PerfLogger in HMS client
> 
>
> Key: HIVE-23978
> URL: https://issues.apache.org/jira/browse/HIVE-23978
> Project: Hive
>  Issue Type: Improvement
>Reporter: Soumyakanti Das
>Assignee: Soumyakanti Das
>Priority: Minor
>
> Currently we cannot use PerfLogger in HiveMetaStoreClient.java to log 
> duration of API calls. When PerfLogger.java is moved from metastore-server to 
> metastore-common, without changing the package definition, many tests fail, 
> although metastore-server has a dependency on metastore-common.
> More analysis and investigation is needed to understand the root cause of 
> this issue.
> Related to [HIVE-23949|https://issues.apache.org/jira/browse/HIVE-23949]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (HIVE-23946) Improve control flow and error handling in QTest dataset loading/unloading

2020-08-03 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-23946?focusedWorklogId=465775=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-465775
 ]

ASF GitHub Bot logged work on HIVE-23946:
-

Author: ASF GitHub Bot
Created on: 03/Aug/20 15:56
Start Date: 03/Aug/20 15:56
Worklog Time Spent: 10m 
  Work Description: zabetak commented on a change in pull request #1331:
URL: https://github.com/apache/hive/pull/1331#discussion_r464505336



##
File path: 
itests/util/src/main/java/org/apache/hadoop/hive/ql/dataset/QTestDatasetHandler.java
##
@@ -84,23 +83,25 @@ public boolean initDataset(String table, CliDriver 
cliDriver) throws Exception {
 
 try {
   CommandProcessorResponse result = cliDriver.processLine(commands);
-  LOG.info("Result from cliDrriver.processLine in initFromDatasets=" + 
result);
+  LOG.info("Result from cliDrriver.processLine in initDataset=" + result);
 } catch (CommandProcessorException e) {
-  Assert.fail("Failed during initFromDatasets processLine with code=" + e);
+  throw new RuntimeException("Failed while loading table " + table, e);
 }
-
-return true;
+// Add the talbe in sources if it is loaded sucessfully
+addSrcTable(table);
   }
 
-  public boolean unloadDataset(String table, CliDriver cliDriver) throws 
Exception {
+  private void unloadDataset(String table, CliDriver cliDriver) {
 try {
+  // Remove table from sources otherwise the following command will fail 
due to EnforceReadOnlyTables.
+  removeSrcTable(table);
   CommandProcessorResponse result = cliDriver.processLine("drop table " + 
table);
-  LOG.info("Result from cliDrriver.processLine in initFromDatasets=" + 
result);
+  LOG.info("Result from cliDrriver.processLine in unloadDataset=" + 
result);
 } catch (CommandProcessorException e) {
-  Assert.fail("Failed during initFromDatasets processLine with code=" + e);
+  // If the unloading fails for any reason then add again the table to 
sources since it is still there.
+  addSrcTable(table);
+  throw new RuntimeException("Failed while unloading table " + table, e);

Review comment:
   Given that there are no test failures I would say no but let's wait to 
see what @kgyrtkirk has to say :) On the other hand, as I noted in the JIRA 
there is code that does not work well if an assertion is thrown here. 





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 465775)
Time Spent: 50m  (was: 40m)

> Improve control flow and error handling in QTest dataset loading/unloading
> --
>
> Key: HIVE-23946
> URL: https://issues.apache.org/jira/browse/HIVE-23946
> Project: Hive
>  Issue Type: Improvement
>Reporter: Stamatis Zampetakis
>Assignee: Stamatis Zampetakis
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> This issue focuses mainly on the following methods:
> [QTestDatasetHandler#initDataset| 
> https://github.com/apache/hive/blob/6fbd54c0af60276d49b237defb550938c9c32610/itests/util/src/main/java/org/apache/hadoop/hive/ql/dataset/QTestDatasetHandler.java#L76]
> [QTestDatasetHandler#unloadDataset|https://github.com/apache/hive/blob/6fbd54c0af60276d49b237defb550938c9c32610/itests/util/src/main/java/org/apache/hadoop/hive/ql/dataset/QTestDatasetHandler.java#L95]
> related to QTest dataset loading and unloading.
> The boolean return type in these methods is redundant since they either fail 
> or return true (they never return false).
> The methods should throw an Exception instead of an AssertionError to 
> indicate failure. This allows code higher up the stack to perform proper 
> recovery and properly report the failure. At the moment, if an AssertionError 
> is raised from these methods dependent code (eg., 
> [CoreCliDriver|https://github.com/apache/hive/blob/6fbd54c0af60276d49b237defb550938c9c32610/itests/util/src/main/java/org/apache/hadoop/hive/cli/control/CoreCliDriver.java#L188])
>  fails to notice that the query has failed. 
> In case of failure in loading/unloading the environment (instance and class 
> variables) is not properly cleaned leading to failures in all subsequent 
> tests.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (HIVE-23946) Improve control flow and error handling in QTest dataset loading/unloading

2020-08-03 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-23946?focusedWorklogId=465769=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-465769
 ]

ASF GitHub Bot logged work on HIVE-23946:
-

Author: ASF GitHub Bot
Created on: 03/Aug/20 15:53
Start Date: 03/Aug/20 15:53
Worklog Time Spent: 10m 
  Work Description: zabetak commented on a change in pull request #1331:
URL: https://github.com/apache/hive/pull/1331#discussion_r464503259



##
File path: 
itests/util/src/main/java/org/apache/hadoop/hive/ql/dataset/QTestDatasetHandler.java
##
@@ -52,8 +51,8 @@
 
   private File datasetDir;
   private static Set srcTables;
-  private static Set missingTables = new HashSet<>();

Review comment:
   Indeed there is a check-then-act race condition here. I was hoping to 
fix this without making `missingTables` and `tablesToUnload` static but looking 
at the code that I committed it seems that I screwed up something while I was 
rebasing :D I will address this in the following commits (hopefully 
:crossed_fingers: ). Thanks for catching this!





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 465769)
Time Spent: 40m  (was: 0.5h)

> Improve control flow and error handling in QTest dataset loading/unloading
> --
>
> Key: HIVE-23946
> URL: https://issues.apache.org/jira/browse/HIVE-23946
> Project: Hive
>  Issue Type: Improvement
>Reporter: Stamatis Zampetakis
>Assignee: Stamatis Zampetakis
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> This issue focuses mainly on the following methods:
> [QTestDatasetHandler#initDataset| 
> https://github.com/apache/hive/blob/6fbd54c0af60276d49b237defb550938c9c32610/itests/util/src/main/java/org/apache/hadoop/hive/ql/dataset/QTestDatasetHandler.java#L76]
> [QTestDatasetHandler#unloadDataset|https://github.com/apache/hive/blob/6fbd54c0af60276d49b237defb550938c9c32610/itests/util/src/main/java/org/apache/hadoop/hive/ql/dataset/QTestDatasetHandler.java#L95]
> related to QTest dataset loading and unloading.
> The boolean return type in these methods is redundant since they either fail 
> or return true (they never return false).
> The methods should throw an Exception instead of an AssertionError to 
> indicate failure. This allows code higher up the stack to perform proper 
> recovery and properly report the failure. At the moment, if an AssertionError 
> is raised from these methods dependent code (eg., 
> [CoreCliDriver|https://github.com/apache/hive/blob/6fbd54c0af60276d49b237defb550938c9c32610/itests/util/src/main/java/org/apache/hadoop/hive/cli/control/CoreCliDriver.java#L188])
>  fails to notice that the query has failed. 
> In case of failure in loading/unloading the environment (instance and class 
> variables) is not properly cleaned leading to failures in all subsequent 
> tests.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (HIVE-23763) Query based minor compaction produces wrong files when rows with different buckets Ids are processed by the same FileSinkOperator

2020-08-03 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-23763?focusedWorklogId=465767=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-465767
 ]

ASF GitHub Bot logged work on HIVE-23763:
-

Author: ASF GitHub Bot
Created on: 03/Aug/20 15:50
Start Date: 03/Aug/20 15:50
Worklog Time Spent: 10m 
  Work Description: kuczoram commented on a change in pull request #1327:
URL: https://github.com/apache/hive/pull/1327#discussion_r464501671



##
File path: 
ql/src/java/org/apache/hadoop/hive/ql/txn/compactor/QueryCompactor.java
##
@@ -115,6 +115,10 @@ void runCompactionQueries(HiveConf conf, String 
tmpTableName, StorageDescriptor
   }
   for (String query : compactionQueries) {
 LOG.info("Running {} compaction via query: {}", 
compactionInfo.isMajorCompaction() ? "major" : "minor", query);
+if (!compactionInfo.isMajorCompaction()) {

Review comment:
   Sure, added a comment.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 465767)
Time Spent: 2h  (was: 1h 50m)

> Query based minor compaction produces wrong files when rows with different 
> buckets Ids are processed by the same FileSinkOperator
> -
>
> Key: HIVE-23763
> URL: https://issues.apache.org/jira/browse/HIVE-23763
> Project: Hive
>  Issue Type: Bug
>  Components: Transactions
>Affects Versions: 4.0.0
>Reporter: Marta Kuczora
>Assignee: Marta Kuczora
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 2h
>  Remaining Estimate: 0h
>
> How to reproduce:
> - Create an unbucketed ACID table
> - Insert a bigger amount of data into this table so there would be multiple 
> bucket files in the table
> The files in the table should look like this:
> /warehouse/tablespace/managed/hive/bubu_acid/delta_001_001_/bucket_0_0
> /warehouse/tablespace/managed/hive/bubu_acid/delta_001_001_/bucket_1_0
> /warehouse/tablespace/managed/hive/bubu_acid/delta_001_001_/bucket_2_0
> /warehouse/tablespace/managed/hive/bubu_acid/delta_001_001_/bucket_3_0
> /warehouse/tablespace/managed/hive/bubu_acid/delta_001_001_/bucket_4_0
> /warehouse/tablespace/managed/hive/bubu_acid/delta_001_001_/bucket_5_0
> - Do some delete on rows with different bucket Ids
> The files in a delete delta should look like this:
> /warehouse/tablespace/managed/hive/bubu_acid/delete_delta_002_002_/bucket_0
> /warehouse/tablespace/managed/hive/bubu_acid/delete_delta_006_006_/bucket_3
> /warehouse/tablespace/managed/hive/bubu_acid/delete_delta_006_006_/bucket_1
> - Run the query-based minor compaction
> - After the compaction the newly created delete delta containes only 1 bucket 
> file. This file contains rows from all buckets and the table becomes unusable
> /warehouse/tablespace/managed/hive/bubu_acid/delete_delta_001_007_v066/bucket_0
> The issue happens only if rows with different bucket Ids are processed by the 
> same FileSinkOperator. 
> In the FileSinkOperator.process method, the files for the compaction table 
> are created like this:
> {noformat}
> if (!bDynParts && !filesCreated) {
>   if (lbDirName != null) {
> if (valToPaths.get(lbDirName) == null) {
>   createNewPaths(null, lbDirName);
> }
>   } else {
> if (conf.isCompactionTable()) {
>   int bucketProperty = getBucketProperty(row);
>   bucketId = 
> BucketCodec.determineVersion(bucketProperty).decodeWriterId(bucketProperty);
> }
> createBucketFiles(fsp);
>   }
> }
> {noformat}
> When the first row is processed, the file is created and then the 
> filesCreated variable is set to true. Then when the other rows are processed, 
> the first if statement will be false, so no new file gets created, but the 
> row will be written into the file created for the first row.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HIVE-23978) Enable logging with PerfLogger in HMS client

2020-08-03 Thread Soumyakanti Das (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-23978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Soumyakanti Das updated HIVE-23978:
---
Issue Type: New Feature  (was: Bug)

> Enable logging with PerfLogger in HMS client
> 
>
> Key: HIVE-23978
> URL: https://issues.apache.org/jira/browse/HIVE-23978
> Project: Hive
>  Issue Type: New Feature
>Reporter: Soumyakanti Das
>Assignee: Soumyakanti Das
>Priority: Minor
>
> Currently we cannot use PerfLogger in HiveMetaStoreClient.java to log 
> duration of API calls. When PerfLogger.java is moved from metastore-server to 
> metastore-common, without changing the package definition, many tests fail, 
> although metastore-server has a dependency on metastore-common.
> More analysis and investigation is needed to understand the root cause of 
> this issue.
> Related to [HIVE-23949|https://issues.apache.org/jira/browse/HIVE-23949]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HIVE-23979) Resolve spotbugs errors in JsonReporter.java, Metrics.java, and PerfLogger.java

2020-08-03 Thread Soumyakanti Das (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-23979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Soumyakanti Das updated HIVE-23979:
---
Issue Type: Bug  (was: New Feature)

> Resolve spotbugs errors in JsonReporter.java, Metrics.java, and 
> PerfLogger.java
> ---
>
> Key: HIVE-23979
> URL: https://issues.apache.org/jira/browse/HIVE-23979
> Project: Hive
>  Issue Type: Bug
>Reporter: Soumyakanti Das
>Assignee: Soumyakanti Das
>Priority: Minor
>
> Resolve these spotbugs errors:
> [ERROR] Found reliance on default encoding in 
> org.apache.hadoop.hive.metastore.metrics.JsonReporter.report(SortedMap, 
> SortedMap, SortedMap, SortedMap, SortedMap): new java.io.FileWriter(File) 
> [org.apache.hadoop.hive.metastore.metrics.JsonReporter] At 
> JsonReporter.java:[line 159] DM_DEFAULT_ENCODING
> [ERROR] Incorrect lazy initialization of static field 
> org.apache.hadoop.hive.metastore.metrics.Metrics.self in 
> org.apache.hadoop.hive.metastore.metrics.Metrics.shutdown() 
> [org.apache.hadoop.hive.metastore.metrics.Metrics] At Metrics.java:[lines 
> 79-85] LI_LAZY_INIT_STATIC
> [ERROR] The method name 
> org.apache.hadoop.hive.metastore.metrics.PerfLogger.PerfLogBegin(String, 
> String) doesn't start with a lower case letter 
> [org.apache.hadoop.hive.metastore.metrics.PerfLogger] At 
> PerfLogger.java:[lines 92-98] NM_METHOD_NAMING_CONVENTION
> [ERROR] The method name 
> org.apache.hadoop.hive.metastore.metrics.PerfLogger.PerfLogEnd(String, 
> String) doesn't start with a lower case letter 
> [org.apache.hadoop.hive.metastore.metrics.PerfLogger] At 
> PerfLogger.java:[line 106] NM_METHOD_NAMING_CONVENTION
> [ERROR] The method name 
> org.apache.hadoop.hive.metastore.metrics.PerfLogger.PerfLogEnd(String, 
> String, String) doesn't start with a lower case letter 
> [org.apache.hadoop.hive.metastore.metrics.PerfLogger] At 
> PerfLogger.java:[lines 116-138] NM_METHOD_NAMING_CONVENTION



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (HIVE-23979) Resolve spotbugs errors in JsonReporter.java, Metrics.java, and PerfLogger.java

2020-08-03 Thread Soumyakanti Das (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-23979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Soumyakanti Das reassigned HIVE-23979:
--


> Resolve spotbugs errors in JsonReporter.java, Metrics.java, and 
> PerfLogger.java
> ---
>
> Key: HIVE-23979
> URL: https://issues.apache.org/jira/browse/HIVE-23979
> Project: Hive
>  Issue Type: New Feature
>Reporter: Soumyakanti Das
>Assignee: Soumyakanti Das
>Priority: Minor
>
> Resolve these spotbugs errors:
> [ERROR] Found reliance on default encoding in 
> org.apache.hadoop.hive.metastore.metrics.JsonReporter.report(SortedMap, 
> SortedMap, SortedMap, SortedMap, SortedMap): new java.io.FileWriter(File) 
> [org.apache.hadoop.hive.metastore.metrics.JsonReporter] At 
> JsonReporter.java:[line 159] DM_DEFAULT_ENCODING
> [ERROR] Incorrect lazy initialization of static field 
> org.apache.hadoop.hive.metastore.metrics.Metrics.self in 
> org.apache.hadoop.hive.metastore.metrics.Metrics.shutdown() 
> [org.apache.hadoop.hive.metastore.metrics.Metrics] At Metrics.java:[lines 
> 79-85] LI_LAZY_INIT_STATIC
> [ERROR] The method name 
> org.apache.hadoop.hive.metastore.metrics.PerfLogger.PerfLogBegin(String, 
> String) doesn't start with a lower case letter 
> [org.apache.hadoop.hive.metastore.metrics.PerfLogger] At 
> PerfLogger.java:[lines 92-98] NM_METHOD_NAMING_CONVENTION
> [ERROR] The method name 
> org.apache.hadoop.hive.metastore.metrics.PerfLogger.PerfLogEnd(String, 
> String) doesn't start with a lower case letter 
> [org.apache.hadoop.hive.metastore.metrics.PerfLogger] At 
> PerfLogger.java:[line 106] NM_METHOD_NAMING_CONVENTION
> [ERROR] The method name 
> org.apache.hadoop.hive.metastore.metrics.PerfLogger.PerfLogEnd(String, 
> String, String) doesn't start with a lower case letter 
> [org.apache.hadoop.hive.metastore.metrics.PerfLogger] At 
> PerfLogger.java:[lines 116-138] NM_METHOD_NAMING_CONVENTION



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HIVE-23978) Enable logging with PerfLogger in HMS client

2020-08-03 Thread Soumyakanti Das (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-23978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Soumyakanti Das updated HIVE-23978:
---
Issue Type: Bug  (was: New Feature)

> Enable logging with PerfLogger in HMS client
> 
>
> Key: HIVE-23978
> URL: https://issues.apache.org/jira/browse/HIVE-23978
> Project: Hive
>  Issue Type: Bug
>Reporter: Soumyakanti Das
>Assignee: Soumyakanti Das
>Priority: Minor
>
> Currently we cannot use PerfLogger in HiveMetaStoreClient.java to log 
> duration of API calls. When PerfLogger.java is moved from metastore-server to 
> metastore-common, without changing the package definition, many tests fail, 
> although metastore-server has a dependency on metastore-common.
> More analysis and investigation is needed to understand the root cause of 
> this issue.
> Related to [HIVE-23949|https://issues.apache.org/jira/browse/HIVE-23949]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (HIVE-23977) Consolidate partition fetch to one place

2020-08-03 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-23977?focusedWorklogId=465756=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-465756
 ]

ASF GitHub Bot logged work on HIVE-23977:
-

Author: ASF GitHub Bot
Created on: 03/Aug/20 15:37
Start Date: 03/Aug/20 15:37
Worklog Time Spent: 10m 
  Work Description: scarlin-cloudera opened a new pull request #1354:
URL: https://github.com/apache/hive/pull/1354


   
   
   ### What changes were proposed in this pull request?
   
   
   
   ### Why are the changes needed?
   
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   
   
   ### How was this patch tested?
   
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 465756)
Remaining Estimate: 0h
Time Spent: 10m

> Consolidate partition fetch to one place
> 
>
> Key: HIVE-23977
> URL: https://issues.apache.org/jira/browse/HIVE-23977
> Project: Hive
>  Issue Type: Improvement
>  Components: HiveServer2
>Reporter: Steve Carlin
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HIVE-23977) Consolidate partition fetch to one place

2020-08-03 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-23977?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HIVE-23977:
--
Labels: pull-request-available  (was: )

> Consolidate partition fetch to one place
> 
>
> Key: HIVE-23977
> URL: https://issues.apache.org/jira/browse/HIVE-23977
> Project: Hive
>  Issue Type: Improvement
>  Components: HiveServer2
>Reporter: Steve Carlin
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (HIVE-23978) Enable logging with PerfLogger in HMS client

2020-08-03 Thread Soumyakanti Das (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-23978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Soumyakanti Das reassigned HIVE-23978:
--

Assignee: Soumyakanti Das

> Enable logging with PerfLogger in HMS client
> 
>
> Key: HIVE-23978
> URL: https://issues.apache.org/jira/browse/HIVE-23978
> Project: Hive
>  Issue Type: New Feature
>Reporter: Soumyakanti Das
>Assignee: Soumyakanti Das
>Priority: Minor
>
> Currently we cannot use PerfLogger in HiveMetaStoreClient.java to log 
> duration of API calls. When PerfLogger.java is moved from metastore-server to 
> metastore-common, without changing the package definition, many tests fail, 
> although metastore-server has a dependency on metastore-common.
> More analysis and investigation is needed to understand the root cause of 
> this issue.
> Related to [HIVE-23949|https://issues.apache.org/jira/browse/HIVE-23949]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (HIVE-23843) Improve key evictions in VectorGroupByOperator

2020-08-03 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-23843?focusedWorklogId=465755=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-465755
 ]

ASF GitHub Bot logged work on HIVE-23843:
-

Author: ASF GitHub Bot
Created on: 03/Aug/20 15:34
Start Date: 03/Aug/20 15:34
Worklog Time Spent: 10m 
  Work Description: kgyrtkirk closed pull request #1250:
URL: https://github.com/apache/hive/pull/1250


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 465755)
Time Spent: 2h 40m  (was: 2.5h)

> Improve key evictions in VectorGroupByOperator
> --
>
> Key: HIVE-23843
> URL: https://issues.apache.org/jira/browse/HIVE-23843
> Project: Hive
>  Issue Type: Improvement
>  Components: Hive
>Reporter: Rajesh Balamohan
>Assignee: Rajesh Balamohan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 2h 40m
>  Remaining Estimate: 0h
>
> Keys in {{mapKeysAggregationBuffers}} are evicted in random order. Tasks also 
> get into GC issues when multiple keys are involved in groupbys. It would be 
> good to provide an option to have LRU based eviction for 
> mapKeysAggregationBuffers.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (HIVE-23959) Provide an option to wipe out column stats for partitioned tables in case of column removal

2020-08-03 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-23959?focusedWorklogId=465754=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-465754
 ]

ASF GitHub Bot logged work on HIVE-23959:
-

Author: ASF GitHub Bot
Created on: 03/Aug/20 15:33
Start Date: 03/Aug/20 15:33
Worklog Time Spent: 10m 
  Work Description: kgyrtkirk commented on pull request #1341:
URL: https://github.com/apache/hive/pull/1341#issuecomment-668089074


   @pvary could you please take another look?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 465754)
Time Spent: 1h 10m  (was: 1h)

> Provide an option to wipe out column stats for partitioned tables in case of 
> column removal
> ---
>
> Key: HIVE-23959
> URL: https://issues.apache.org/jira/browse/HIVE-23959
> Project: Hive
>  Issue Type: Improvement
>Reporter: Zoltan Haindrich
>Assignee: Zoltan Haindrich
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> in case of column removal / replacement - an update for each partition is 
> neccessary; which could take a while.
> goal here is to provide an option to switch to the bulk removal of column 
> statistics instead of working hard to retain as much as possible from the old 
> stats.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (HIVE-23819) Use ranges in ValidReadTxnList serialization

2020-08-03 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-23819?focusedWorklogId=465753=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-465753
 ]

ASF GitHub Bot logged work on HIVE-23819:
-

Author: ASF GitHub Bot
Created on: 03/Aug/20 15:29
Start Date: 03/Aug/20 15:29
Worklog Time Spent: 10m 
  Work Description: pvargacl commented on pull request #1230:
URL: https://github.com/apache/hive/pull/1230#issuecomment-668087223


   @pvary could you merge this?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 465753)
Time Spent: 40m  (was: 0.5h)

> Use ranges in ValidReadTxnList serialization
> 
>
> Key: HIVE-23819
> URL: https://issues.apache.org/jira/browse/HIVE-23819
> Project: Hive
>  Issue Type: Improvement
>Reporter: Peter Varga
>Assignee: Peter Varga
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> Time to time we see a case, when the open / aborted transaction count is high 
> and often the aborted transactions come in continues ranges.
> When the transaction count goes high the serialization / deserialization to 
> hive.txn.valid.txns conf gets slower and produces a large config value.
> Using ranges in the string representation can mitigate the issue somewhat.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (HIVE-23956) Delete delta directory file information should be pushed to execution side

2020-08-03 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-23956?focusedWorklogId=465749=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-465749
 ]

ASF GitHub Bot logged work on HIVE-23956:
-

Author: ASF GitHub Bot
Created on: 03/Aug/20 15:21
Start Date: 03/Aug/20 15:21
Worklog Time Spent: 10m 
  Work Description: pvargacl commented on a change in pull request #1339:
URL: https://github.com/apache/hive/pull/1339#discussion_r464483465



##
File path: ql/src/java/org/apache/hadoop/hive/ql/io/AcidInputFormat.java
##
@@ -118,70 +126,217 @@
  */
 private long visibilityTxnId;
 
+private List deltaFiles;
+
 public DeltaMetaData() {
-  this(0,0,new ArrayList(), 0);
+  this(0, 0, new ArrayList<>(), 0, new ArrayList<>());
 }
+
 /**
+ * @param minWriteId min writeId of the delta directory
+ * @param maxWriteId max writeId of the delta directory
  * @param stmtIds delta dir suffixes when a single txn writes > 1 delta in 
the same partition
  * @param visibilityTxnId maybe 0, if the dir name didn't have it.  
txnid:0 is always visible
+ * @param deltaFiles bucketFiles in the directory
  */
-DeltaMetaData(long minWriteId, long maxWriteId, List stmtIds, 
long visibilityTxnId) {
+public DeltaMetaData(long minWriteId, long maxWriteId, List 
stmtIds, long visibilityTxnId,
+List deltaFiles) {
   this.minWriteId = minWriteId;
   this.maxWriteId = maxWriteId;
   if (stmtIds == null) {
 throw new IllegalArgumentException("stmtIds == null");
   }
   this.stmtIds = stmtIds;
   this.visibilityTxnId = visibilityTxnId;
+  this.deltaFiles = ObjectUtils.defaultIfNull(deltaFiles, new 
ArrayList<>());
 }
-long getMinWriteId() {
+
+public long getMinWriteId() {
   return minWriteId;
 }
-long getMaxWriteId() {
+
+public long getMaxWriteId() {
   return maxWriteId;
 }
-List getStmtIds() {
+
+public List getStmtIds() {
   return stmtIds;
 }
-long getVisibilityTxnId() {
+
+public long getVisibilityTxnId() {
   return visibilityTxnId;
 }
+
+public List getDeltaFiles() {
+  return deltaFiles;
+}
+
+public List getDeltaFilesForStmtId(final Integer 
stmtId) {
+  if (stmtIds.size() <= 1 || stmtId == null) {
+// If it is not a multistatement delta, we do not store the stmtId in 
the file list
+return deltaFiles;
+  } else {
+return deltaFiles.stream().filter(df -> 
stmtId.equals(df.getStmtId())).collect(Collectors.toList());

Review comment:
   I will be a very small list, I don't think it matters.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 465749)
Time Spent: 4.5h  (was: 4h 20m)

> Delete delta directory file information should be pushed to execution side
> --
>
> Key: HIVE-23956
> URL: https://issues.apache.org/jira/browse/HIVE-23956
> Project: Hive
>  Issue Type: Improvement
>Reporter: Peter Varga
>Assignee: Peter Varga
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 4.5h
>  Remaining Estimate: 0h
>
> Since HIVE-23840 LLAP cache is used to retrieve the tail of the ORC bucket 
> files in the delete deltas, but to use the cache the fileId must be 
> determined, so one more FileSystem call is issued for each bucket.
> This fileId is already available during compilation in the AcidState 
> calculation, we should serialise this to the OrcSplit, and remove the 
> unnecessary FS calls.
> Furthermore instead of sending the SyntheticFileId directly, we should pass 
> the attemptId instead of the standard path hash, this way the path and the 
> SyntheticFileId. can be calculated, and it will work even, if the move free 
> delete operations will be introduced.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (HIVE-23956) Delete delta directory file information should be pushed to execution side

2020-08-03 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-23956?focusedWorklogId=465743=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-465743
 ]

ASF GitHub Bot logged work on HIVE-23956:
-

Author: ASF GitHub Bot
Created on: 03/Aug/20 15:09
Start Date: 03/Aug/20 15:09
Worklog Time Spent: 10m 
  Work Description: pvargacl commented on a change in pull request #1339:
URL: https://github.com/apache/hive/pull/1339#discussion_r464476338



##
File path: ql/src/java/org/apache/hadoop/hive/ql/io/AcidInputFormat.java
##
@@ -118,70 +126,217 @@
  */
 private long visibilityTxnId;
 
+private List deltaFiles;
+
 public DeltaMetaData() {
-  this(0,0,new ArrayList(), 0);
+  this(0, 0, new ArrayList<>(), 0, new ArrayList<>());
 }
+
 /**
+ * @param minWriteId min writeId of the delta directory
+ * @param maxWriteId max writeId of the delta directory
  * @param stmtIds delta dir suffixes when a single txn writes > 1 delta in 
the same partition
  * @param visibilityTxnId maybe 0, if the dir name didn't have it.  
txnid:0 is always visible
+ * @param deltaFiles bucketFiles in the directory
  */
-DeltaMetaData(long minWriteId, long maxWriteId, List stmtIds, 
long visibilityTxnId) {
+public DeltaMetaData(long minWriteId, long maxWriteId, List 
stmtIds, long visibilityTxnId,
+List deltaFiles) {
   this.minWriteId = minWriteId;
   this.maxWriteId = maxWriteId;
   if (stmtIds == null) {
 throw new IllegalArgumentException("stmtIds == null");
   }
   this.stmtIds = stmtIds;
   this.visibilityTxnId = visibilityTxnId;
+  this.deltaFiles = ObjectUtils.defaultIfNull(deltaFiles, new 
ArrayList<>());
 }
-long getMinWriteId() {
+
+public long getMinWriteId() {
   return minWriteId;
 }
-long getMaxWriteId() {
+
+public long getMaxWriteId() {
   return maxWriteId;
 }
-List getStmtIds() {
+
+public List getStmtIds() {
   return stmtIds;
 }
-long getVisibilityTxnId() {
+
+public long getVisibilityTxnId() {
   return visibilityTxnId;
 }
+
+public List getDeltaFiles() {
+  return deltaFiles;
+}
+
+public List getDeltaFilesForStmtId(final Integer 
stmtId) {
+  if (stmtIds.size() <= 1 || stmtId == null) {
+// If it is not a multistatement delta, we do not store the stmtId in 
the file list
+return deltaFiles;
+  } else {
+return deltaFiles.stream().filter(df -> 
stmtId.equals(df.getStmtId())).collect(Collectors.toList());
+  }
+}
+
 @Override
 public void write(DataOutput out) throws IOException {
   out.writeLong(minWriteId);
   out.writeLong(maxWriteId);
   out.writeInt(stmtIds.size());
-  for(Integer id : stmtIds) {
+  for (Integer id : stmtIds) {
 out.writeInt(id);
   }
   out.writeLong(visibilityTxnId);
+  out.writeInt(deltaFiles.size());
+  for (DeltaFileMetaData fileMeta : deltaFiles) {
+fileMeta.write(out);
+  }
 }
+
 @Override
 public void readFields(DataInput in) throws IOException {
   minWriteId = in.readLong();
   maxWriteId = in.readLong();
   stmtIds.clear();
   int numStatements = in.readInt();
-  for(int i = 0; i < numStatements; i++) {
+  for (int i = 0; i < numStatements; i++) {
 stmtIds.add(in.readInt());
   }
   visibilityTxnId = in.readLong();
+
+  deltaFiles.clear();
+  int numFiles = in.readInt();
+  for (int i = 0; i < numFiles; i++) {
+DeltaFileMetaData file = new DeltaFileMetaData();
+file.readFields(in);
+deltaFiles.add(file);
+  }
 }
-String getName() {
+
+private String getName() {
   assert stmtIds.isEmpty() : "use getName(int)";
-  return AcidUtils.addVisibilitySuffix(AcidUtils
-  .deleteDeltaSubdir(minWriteId, maxWriteId), visibilityTxnId);
+  return 
AcidUtils.addVisibilitySuffix(AcidUtils.deleteDeltaSubdir(minWriteId, 
maxWriteId), visibilityTxnId);
 }
-String getName(int stmtId) {
+
+private String getName(int stmtId) {
   assert !stmtIds.isEmpty() : "use getName()";
   return AcidUtils.addVisibilitySuffix(AcidUtils
   .deleteDeltaSubdir(minWriteId, maxWriteId, stmtId), visibilityTxnId);
 }
+
+public List> getPaths(Path root) {
+  if (stmtIds.isEmpty()) {
+return Collections.singletonList(new ImmutablePair<>(new Path(root, 
getName()), null));
+  } else {
+// To support multistatement transactions we may have multiple 
directories corresponding to one DeltaMetaData
+return getStmtIds().stream()
+.map(stmtId -> new ImmutablePair<>(new Path(root, 
getName(stmtId)), stmtId)).collect(Collectors.toList());
+  }
+}
+
 @Override
 public String toString() {
   return "Delta(?," +

[jira] [Work logged] (HIVE-23956) Delete delta directory file information should be pushed to execution side

2020-08-03 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-23956?focusedWorklogId=465739=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-465739
 ]

ASF GitHub Bot logged work on HIVE-23956:
-

Author: ASF GitHub Bot
Created on: 03/Aug/20 15:04
Start Date: 03/Aug/20 15:04
Worklog Time Spent: 10m 
  Work Description: pvargacl commented on a change in pull request #1339:
URL: https://github.com/apache/hive/pull/1339#discussion_r464473125



##
File path: 
ql/src/java/org/apache/hadoop/hive/ql/io/orc/VectorizedOrcAcidRowBatchReader.java
##
@@ -1641,28 +1645,26 @@ public int compareTo(CompressedOwid other) {
  * Check if the delete delta folder needs to be scanned for a given 
split's min/max write ids.
  *
  * @param orcSplitMinMaxWriteIds
- * @param deleteDeltaDir
+ * @param deleteDelta
+ * @param stmtId statementId of the deleteDelta if present
  * @return true when  delete delta dir has to be scanned.
  */
 @VisibleForTesting
 protected static boolean 
isQualifiedDeleteDeltaForSplit(AcidOutputFormat.Options orcSplitMinMaxWriteIds,
-Path deleteDeltaDir)
-{
-  AcidUtils.ParsedDelta deleteDelta = 
AcidUtils.parsedDelta(deleteDeltaDir, false);
+AcidInputFormat.DeltaMetaData deleteDelta, Integer stmtId) {

Review comment:
   it is the second line of parameters, no extra space here





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 465739)
Time Spent: 4h 10m  (was: 4h)

> Delete delta directory file information should be pushed to execution side
> --
>
> Key: HIVE-23956
> URL: https://issues.apache.org/jira/browse/HIVE-23956
> Project: Hive
>  Issue Type: Improvement
>Reporter: Peter Varga
>Assignee: Peter Varga
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 4h 10m
>  Remaining Estimate: 0h
>
> Since HIVE-23840 LLAP cache is used to retrieve the tail of the ORC bucket 
> files in the delete deltas, but to use the cache the fileId must be 
> determined, so one more FileSystem call is issued for each bucket.
> This fileId is already available during compilation in the AcidState 
> calculation, we should serialise this to the OrcSplit, and remove the 
> unnecessary FS calls.
> Furthermore instead of sending the SyntheticFileId directly, we should pass 
> the attemptId instead of the standard path hash, this way the path and the 
> SyntheticFileId. can be calculated, and it will work even, if the move free 
> delete operations will be introduced.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (HIVE-23956) Delete delta directory file information should be pushed to execution side

2020-08-03 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-23956?focusedWorklogId=465736=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-465736
 ]

ASF GitHub Bot logged work on HIVE-23956:
-

Author: ASF GitHub Bot
Created on: 03/Aug/20 15:03
Start Date: 03/Aug/20 15:03
Worklog Time Spent: 10m 
  Work Description: pvargacl commented on a change in pull request #1339:
URL: https://github.com/apache/hive/pull/1339#discussion_r464472005



##
File path: ql/src/java/org/apache/hadoop/hive/ql/io/AcidInputFormat.java
##
@@ -118,70 +126,217 @@
  */
 private long visibilityTxnId;
 
+private List deltaFiles;
+
 public DeltaMetaData() {
-  this(0,0,new ArrayList(), 0);
+  this(0, 0, new ArrayList<>(), 0, new ArrayList<>());
 }
+
 /**
+ * @param minWriteId min writeId of the delta directory
+ * @param maxWriteId max writeId of the delta directory
  * @param stmtIds delta dir suffixes when a single txn writes > 1 delta in 
the same partition
  * @param visibilityTxnId maybe 0, if the dir name didn't have it.  
txnid:0 is always visible
+ * @param deltaFiles bucketFiles in the directory
  */
-DeltaMetaData(long minWriteId, long maxWriteId, List stmtIds, 
long visibilityTxnId) {
+public DeltaMetaData(long minWriteId, long maxWriteId, List 
stmtIds, long visibilityTxnId,
+List deltaFiles) {
   this.minWriteId = minWriteId;
   this.maxWriteId = maxWriteId;
   if (stmtIds == null) {
 throw new IllegalArgumentException("stmtIds == null");
   }
   this.stmtIds = stmtIds;
   this.visibilityTxnId = visibilityTxnId;
+  this.deltaFiles = ObjectUtils.defaultIfNull(deltaFiles, new 
ArrayList<>());
 }
-long getMinWriteId() {
+
+public long getMinWriteId() {
   return minWriteId;
 }
-long getMaxWriteId() {
+
+public long getMaxWriteId() {
   return maxWriteId;
 }
-List getStmtIds() {
+
+public List getStmtIds() {
   return stmtIds;
 }
-long getVisibilityTxnId() {
+
+public long getVisibilityTxnId() {
   return visibilityTxnId;
 }
+
+public List getDeltaFiles() {
+  return deltaFiles;
+}
+
+public List getDeltaFilesForStmtId(final Integer 
stmtId) {
+  if (stmtIds.size() <= 1 || stmtId == null) {
+// If it is not a multistatement delta, we do not store the stmtId in 
the file list
+return deltaFiles;
+  } else {
+return deltaFiles.stream().filter(df -> 
stmtId.equals(df.getStmtId())).collect(Collectors.toList());
+  }
+}
+
 @Override
 public void write(DataOutput out) throws IOException {
   out.writeLong(minWriteId);
   out.writeLong(maxWriteId);
   out.writeInt(stmtIds.size());
-  for(Integer id : stmtIds) {
+  for (Integer id : stmtIds) {
 out.writeInt(id);
   }
   out.writeLong(visibilityTxnId);
+  out.writeInt(deltaFiles.size());
+  for (DeltaFileMetaData fileMeta : deltaFiles) {
+fileMeta.write(out);
+  }
 }
+
 @Override
 public void readFields(DataInput in) throws IOException {
   minWriteId = in.readLong();
   maxWriteId = in.readLong();
   stmtIds.clear();
   int numStatements = in.readInt();
-  for(int i = 0; i < numStatements; i++) {
+  for (int i = 0; i < numStatements; i++) {
 stmtIds.add(in.readInt());
   }
   visibilityTxnId = in.readLong();
+
+  deltaFiles.clear();
+  int numFiles = in.readInt();
+  for (int i = 0; i < numFiles; i++) {
+DeltaFileMetaData file = new DeltaFileMetaData();
+file.readFields(in);
+deltaFiles.add(file);
+  }
 }
-String getName() {
+
+private String getName() {
   assert stmtIds.isEmpty() : "use getName(int)";
-  return AcidUtils.addVisibilitySuffix(AcidUtils
-  .deleteDeltaSubdir(minWriteId, maxWriteId), visibilityTxnId);
+  return 
AcidUtils.addVisibilitySuffix(AcidUtils.deleteDeltaSubdir(minWriteId, 
maxWriteId), visibilityTxnId);
 }
-String getName(int stmtId) {
+
+private String getName(int stmtId) {
   assert !stmtIds.isEmpty() : "use getName()";
   return AcidUtils.addVisibilitySuffix(AcidUtils
   .deleteDeltaSubdir(minWriteId, maxWriteId, stmtId), visibilityTxnId);
 }
+
+public List> getPaths(Path root) {

Review comment:
   I think the List is much more straightforward, it will keep the stmid 
order.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 465736)
Time Spent: 4h

[jira] [Resolved] (HIVE-23949) Introduce caching layer in HS2 to accelerate query compilation

2020-08-03 Thread Jesus Camacho Rodriguez (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-23949?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jesus Camacho Rodriguez resolved HIVE-23949.

Fix Version/s: 4.0.0
   Resolution: Fixed

Pushed to master, thanks [~soumyakanti.das]!

> Introduce caching layer in HS2 to accelerate query compilation
> --
>
> Key: HIVE-23949
> URL: https://issues.apache.org/jira/browse/HIVE-23949
> Project: Hive
>  Issue Type: New Feature
>  Components: HiveServer2
>Reporter: Soumyakanti Das
>Assignee: Soumyakanti Das
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> One of the major contributors to compilation latency is the retrieval of 
> metadata from HMS. This JIRA introduces a caching layer in HS2 for this 
> metadata. Its design is simple, relying on snapshot information being queried 
> to cache and invalidate the metadata. This will help us to reduce the time 
> spent in compilation by using HS2 memory more effectively, and it will allow 
> us to improve HMS throughput for multi-tenant workloads by reducing the 
> number of calls it needs to serve.
> This patch only caches partition retrieval information, which is often one of 
> the most costly metadata operations. It also sets the foundation to cache 
> additional calls in follow-up work.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (HIVE-23949) Introduce caching layer in HS2 to accelerate query compilation

2020-08-03 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-23949?focusedWorklogId=465733=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-465733
 ]

ASF GitHub Bot logged work on HIVE-23949:
-

Author: ASF GitHub Bot
Created on: 03/Aug/20 15:00
Start Date: 03/Aug/20 15:00
Worklog Time Spent: 10m 
  Work Description: jcamachor closed pull request #1317:
URL: https://github.com/apache/hive/pull/1317


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 465733)
Time Spent: 50m  (was: 40m)

> Introduce caching layer in HS2 to accelerate query compilation
> --
>
> Key: HIVE-23949
> URL: https://issues.apache.org/jira/browse/HIVE-23949
> Project: Hive
>  Issue Type: New Feature
>  Components: HiveServer2
>Reporter: Soumyakanti Das
>Assignee: Soumyakanti Das
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> One of the major contributors to compilation latency is the retrieval of 
> metadata from HMS. This JIRA introduces a caching layer in HS2 for this 
> metadata. Its design is simple, relying on snapshot information being queried 
> to cache and invalidate the metadata. This will help us to reduce the time 
> spent in compilation by using HS2 memory more effectively, and it will allow 
> us to improve HMS throughput for multi-tenant workloads by reducing the 
> number of calls it needs to serve.
> This patch only caches partition retrieval information, which is often one of 
> the most costly metadata operations. It also sets the foundation to cache 
> additional calls in follow-up work.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (HIVE-23873) Querying Hive JDBCStorageHandler table fails with NPE when CBO is off

2020-08-03 Thread Jesus Camacho Rodriguez (Jira)



[ 
https://issues.apache.org/jira/browse/HIVE-23873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17170079#comment-17170079
 ] 

Jesus Camacho Rodriguez commented on HIVE-23873:


Thanks for the review [~srahman], I had not checked the JIRA before merging and 
did not add your name to the commit message, but credit where is due.

> Querying Hive JDBCStorageHandler table fails with NPE when CBO is off
> -
>
> Key: HIVE-23873
> URL: https://issues.apache.org/jira/browse/HIVE-23873
> Project: Hive
>  Issue Type: Bug
>  Components: HiveServer2, JDBC
>Affects Versions: 3.1.0, 3.1.1, 3.1.2
>Reporter: Chiran Ravani
>Assignee: Chiran Ravani
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 4.0.0
>
> Attachments: HIVE-23873.01.patch, HIVE-23873.02.patch, 
> HIVE-23873.3.patch, HIVE-23873.4.patch
>
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> Scenario is Hive table having same schema as table in Oracle, however when we 
> query the table with data it fails with NPE, below is the trace.
> {code}
> Caused by: java.io.IOException: java.lang.NullPointerException
> at 
> org.apache.hadoop.hive.ql.exec.FetchOperator.getNextRow(FetchOperator.java:617)
>  ~[hive-exec-3.1.0.3.1.5.0-152.jar:3.1.0.3.1.5.0-152]
> at 
> org.apache.hadoop.hive.ql.exec.FetchOperator.pushRow(FetchOperator.java:524) 
> ~[hive-exec-3.1.0.3.1.5.0-152.jar:3.1.0.3.1.5.0-152]
> at org.apache.hadoop.hive.ql.exec.FetchTask.fetch(FetchTask.java:146) 
> ~[hive-exec-3.1.0.3.1.5.0-152.jar:3.1.0.3.1.5.0-152]
> at org.apache.hadoop.hive.ql.Driver.getResults(Driver.java:2739) 
> ~[hive-exec-3.1.0.3.1.5.0-152.jar:3.1.0.3.1.5.0-152]
> at 
> org.apache.hadoop.hive.ql.reexec.ReExecDriver.getResults(ReExecDriver.java:229)
>  ~[hive-exec-3.1.0.3.1.5.0-152.jar:3.1.0.3.1.5.0-152]
> at 
> org.apache.hive.service.cli.operation.SQLOperation.getNextRowSet(SQLOperation.java:473)
>  ~[hive-service-3.1.0.3.1.5.0-152.jar:3.1.0.3.1.5.0-152]
> ... 34 more
> Caused by: java.lang.NullPointerException
> at 
> org.apache.hive.storage.jdbc.JdbcSerDe.deserialize(JdbcSerDe.java:164) 
> ~[hive-jdbc-handler-3.1.0.3.1.5.0-152.jar:3.1.0.3.1.5.0-152]
> at 
> org.apache.hadoop.hive.ql.exec.FetchOperator.getNextRow(FetchOperator.java:598)
>  ~[hive-exec-3.1.0.3.1.5.0-152.jar:3.1.0.3.1.5.0-152]
> at 
> org.apache.hadoop.hive.ql.exec.FetchOperator.pushRow(FetchOperator.java:524) 
> ~[hive-exec-3.1.0.3.1.5.0-152.jar:3.1.0.3.1.5.0-152]
> at org.apache.hadoop.hive.ql.exec.FetchTask.fetch(FetchTask.java:146) 
> ~[hive-exec-3.1.0.3.1.5.0-152.jar:3.1.0.3.1.5.0-152]
> at org.apache.hadoop.hive.ql.Driver.getResults(Driver.java:2739) 
> ~[hive-exec-3.1.0.3.1.5.0-152.jar:3.1.0.3.1.5.0-152]
> at 
> org.apache.hadoop.hive.ql.reexec.ReExecDriver.getResults(ReExecDriver.java:229)
>  ~[hive-exec-3.1.0.3.1.5.0-152.jar:3.1.0.3.1.5.0-152]
> at 
> org.apache.hive.service.cli.operation.SQLOperation.getNextRowSet(SQLOperation.java:473)
>  ~[hive-service-3.1.0.3.1.5.0-152.jar:3.1.0.3.1.5.0-152]
> ... 34 more
> {code}
> Problem appears when column names in Oracle are in Upper case and since in 
> Hive, table and column names are forced to store in lowercase during 
> creation. User runs into NPE error while fetching data.
> While deserializing data, input consists of column names in lower case which 
> fails to get the value
> https://github.com/apache/hive/blob/rel/release-3.1.2/jdbc-handler/src/main/java/org/apache/hive/storage/jdbc/JdbcSerDe.java#L136
> {code}
> rowVal = ((ObjectWritable)value).get();
> {code}
> Log Snio:
> =
> {code}
> 2020-07-17T16:49:09,598 INFO  [04ed42ec-91d2-4662-aee7-37e840a06036 
> HiveServer2-Handler-Pool: Thread-104]: dao.GenericJdbcDatabaseAccessor (:()) 
> - Query to execute is [select * from TESTHIVEJDBCSTORAGE]
> 2020-07-17T16:49:10,642 INFO  [04ed42ec-91d2-4662-aee7-37e840a06036 
> HiveServer2-Handler-Pool: Thread-104]: jdbc.JdbcSerDe (:()) - *** ColumnKey = 
> ID
> 2020-07-17T16:49:10,642 INFO  [04ed42ec-91d2-4662-aee7-37e840a06036 
> HiveServer2-Handler-Pool: Thread-104]: jdbc.JdbcSerDe (:()) - *** Blob value 
> = {fname=OW[class=class java.lang.String,value=Name1], id=OW[class=class 
> java.lang.Integer,value=1]}
> {code}
> Simple Reproducer for this case.
> =
> 1. Create table in Oracle
> {code}
> create table TESTHIVEJDBCSTORAGE(ID INT, FNAME VARCHAR(20));
> {code}
> 2. Insert dummy data.
> {code}
> Insert into TESTHIVEJDBCSTORAGE values (1, 'Name1');
> {code}
> 3. Create JDBCStorageHandler table in Hive.
> {code}
> CREATE EXTERNAL TABLE default.TESTHIVEJDBCSTORAGE_HIVE_TBL (ID INT, FNAME 
> VARCHAR(20)) 
> STORED BY

[jira] [Updated] (HIVE-23873) Querying Hive JDBCStorageHandler table fails with NPE when CBO is off

2020-08-03 Thread Jesus Camacho Rodriguez (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-23873?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jesus Camacho Rodriguez updated HIVE-23873:
---
Fix Version/s: 4.0.0
   Resolution: Fixed
   Status: Resolved  (was: Patch Available)

Pushed to master, thanks [~chiran54321]!

> Querying Hive JDBCStorageHandler table fails with NPE when CBO is off
> -
>
> Key: HIVE-23873
> URL: https://issues.apache.org/jira/browse/HIVE-23873
> Project: Hive
>  Issue Type: Bug
>  Components: HiveServer2, JDBC
>Affects Versions: 3.1.0, 3.1.1, 3.1.2
>Reporter: Chiran Ravani
>Assignee: Chiran Ravani
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 4.0.0
>
> Attachments: HIVE-23873.01.patch, HIVE-23873.02.patch, 
> HIVE-23873.3.patch, HIVE-23873.4.patch
>
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> Scenario is Hive table having same schema as table in Oracle, however when we 
> query the table with data it fails with NPE, below is the trace.
> {code}
> Caused by: java.io.IOException: java.lang.NullPointerException
> at 
> org.apache.hadoop.hive.ql.exec.FetchOperator.getNextRow(FetchOperator.java:617)
>  ~[hive-exec-3.1.0.3.1.5.0-152.jar:3.1.0.3.1.5.0-152]
> at 
> org.apache.hadoop.hive.ql.exec.FetchOperator.pushRow(FetchOperator.java:524) 
> ~[hive-exec-3.1.0.3.1.5.0-152.jar:3.1.0.3.1.5.0-152]
> at org.apache.hadoop.hive.ql.exec.FetchTask.fetch(FetchTask.java:146) 
> ~[hive-exec-3.1.0.3.1.5.0-152.jar:3.1.0.3.1.5.0-152]
> at org.apache.hadoop.hive.ql.Driver.getResults(Driver.java:2739) 
> ~[hive-exec-3.1.0.3.1.5.0-152.jar:3.1.0.3.1.5.0-152]
> at 
> org.apache.hadoop.hive.ql.reexec.ReExecDriver.getResults(ReExecDriver.java:229)
>  ~[hive-exec-3.1.0.3.1.5.0-152.jar:3.1.0.3.1.5.0-152]
> at 
> org.apache.hive.service.cli.operation.SQLOperation.getNextRowSet(SQLOperation.java:473)
>  ~[hive-service-3.1.0.3.1.5.0-152.jar:3.1.0.3.1.5.0-152]
> ... 34 more
> Caused by: java.lang.NullPointerException
> at 
> org.apache.hive.storage.jdbc.JdbcSerDe.deserialize(JdbcSerDe.java:164) 
> ~[hive-jdbc-handler-3.1.0.3.1.5.0-152.jar:3.1.0.3.1.5.0-152]
> at 
> org.apache.hadoop.hive.ql.exec.FetchOperator.getNextRow(FetchOperator.java:598)
>  ~[hive-exec-3.1.0.3.1.5.0-152.jar:3.1.0.3.1.5.0-152]
> at 
> org.apache.hadoop.hive.ql.exec.FetchOperator.pushRow(FetchOperator.java:524) 
> ~[hive-exec-3.1.0.3.1.5.0-152.jar:3.1.0.3.1.5.0-152]
> at org.apache.hadoop.hive.ql.exec.FetchTask.fetch(FetchTask.java:146) 
> ~[hive-exec-3.1.0.3.1.5.0-152.jar:3.1.0.3.1.5.0-152]
> at org.apache.hadoop.hive.ql.Driver.getResults(Driver.java:2739) 
> ~[hive-exec-3.1.0.3.1.5.0-152.jar:3.1.0.3.1.5.0-152]
> at 
> org.apache.hadoop.hive.ql.reexec.ReExecDriver.getResults(ReExecDriver.java:229)
>  ~[hive-exec-3.1.0.3.1.5.0-152.jar:3.1.0.3.1.5.0-152]
> at 
> org.apache.hive.service.cli.operation.SQLOperation.getNextRowSet(SQLOperation.java:473)
>  ~[hive-service-3.1.0.3.1.5.0-152.jar:3.1.0.3.1.5.0-152]
> ... 34 more
> {code}
> Problem appears when column names in Oracle are in Upper case and since in 
> Hive, table and column names are forced to store in lowercase during 
> creation. User runs into NPE error while fetching data.
> While deserializing data, input consists of column names in lower case which 
> fails to get the value
> https://github.com/apache/hive/blob/rel/release-3.1.2/jdbc-handler/src/main/java/org/apache/hive/storage/jdbc/JdbcSerDe.java#L136
> {code}
> rowVal = ((ObjectWritable)value).get();
> {code}
> Log Snio:
> =
> {code}
> 2020-07-17T16:49:09,598 INFO  [04ed42ec-91d2-4662-aee7-37e840a06036 
> HiveServer2-Handler-Pool: Thread-104]: dao.GenericJdbcDatabaseAccessor (:()) 
> - Query to execute is [select * from TESTHIVEJDBCSTORAGE]
> 2020-07-17T16:49:10,642 INFO  [04ed42ec-91d2-4662-aee7-37e840a06036 
> HiveServer2-Handler-Pool: Thread-104]: jdbc.JdbcSerDe (:()) - *** ColumnKey = 
> ID
> 2020-07-17T16:49:10,642 INFO  [04ed42ec-91d2-4662-aee7-37e840a06036 
> HiveServer2-Handler-Pool: Thread-104]: jdbc.JdbcSerDe (:()) - *** Blob value 
> = {fname=OW[class=class java.lang.String,value=Name1], id=OW[class=class 
> java.lang.Integer,value=1]}
> {code}
> Simple Reproducer for this case.
> =
> 1. Create table in Oracle
> {code}
> create table TESTHIVEJDBCSTORAGE(ID INT, FNAME VARCHAR(20));
> {code}
> 2. Insert dummy data.
> {code}
> Insert into TESTHIVEJDBCSTORAGE values (1, 'Name1');
> {code}
> 3. Create JDBCStorageHandler table in Hive.
> {code}
> CREATE EXTERNAL TABLE default.TESTHIVEJDBCSTORAGE_HIVE_TBL (ID INT, FNAME 
> VARCHAR(20)) 
> STORED BY 'org.apache.hive.storage.jdbc.JdbcStorageHandler' 
> TBLPROPERTIES

[jira] [Work logged] (HIVE-23873) Querying Hive JDBCStorageHandler table fails with NPE when CBO is off

2020-08-03 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-23873?focusedWorklogId=465729=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-465729
 ]

ASF GitHub Bot logged work on HIVE-23873:
-

Author: ASF GitHub Bot
Created on: 03/Aug/20 14:40
Start Date: 03/Aug/20 14:40
Worklog Time Spent: 10m 
  Work Description: jcamachor merged pull request #1328:
URL: https://github.com/apache/hive/pull/1328


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 465729)
Time Spent: 1h 20m  (was: 1h 10m)

> Querying Hive JDBCStorageHandler table fails with NPE when CBO is off
> -
>
> Key: HIVE-23873
> URL: https://issues.apache.org/jira/browse/HIVE-23873
> Project: Hive
>  Issue Type: Bug
>  Components: HiveServer2, JDBC
>Affects Versions: 3.1.0, 3.1.1, 3.1.2
>Reporter: Chiran Ravani
>Assignee: Chiran Ravani
>Priority: Critical
>  Labels: pull-request-available
> Attachments: HIVE-23873.01.patch, HIVE-23873.02.patch, 
> HIVE-23873.3.patch, HIVE-23873.4.patch
>
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> Scenario is Hive table having same schema as table in Oracle, however when we 
> query the table with data it fails with NPE, below is the trace.
> {code}
> Caused by: java.io.IOException: java.lang.NullPointerException
> at 
> org.apache.hadoop.hive.ql.exec.FetchOperator.getNextRow(FetchOperator.java:617)
>  ~[hive-exec-3.1.0.3.1.5.0-152.jar:3.1.0.3.1.5.0-152]
> at 
> org.apache.hadoop.hive.ql.exec.FetchOperator.pushRow(FetchOperator.java:524) 
> ~[hive-exec-3.1.0.3.1.5.0-152.jar:3.1.0.3.1.5.0-152]
> at org.apache.hadoop.hive.ql.exec.FetchTask.fetch(FetchTask.java:146) 
> ~[hive-exec-3.1.0.3.1.5.0-152.jar:3.1.0.3.1.5.0-152]
> at org.apache.hadoop.hive.ql.Driver.getResults(Driver.java:2739) 
> ~[hive-exec-3.1.0.3.1.5.0-152.jar:3.1.0.3.1.5.0-152]
> at 
> org.apache.hadoop.hive.ql.reexec.ReExecDriver.getResults(ReExecDriver.java:229)
>  ~[hive-exec-3.1.0.3.1.5.0-152.jar:3.1.0.3.1.5.0-152]
> at 
> org.apache.hive.service.cli.operation.SQLOperation.getNextRowSet(SQLOperation.java:473)
>  ~[hive-service-3.1.0.3.1.5.0-152.jar:3.1.0.3.1.5.0-152]
> ... 34 more
> Caused by: java.lang.NullPointerException
> at 
> org.apache.hive.storage.jdbc.JdbcSerDe.deserialize(JdbcSerDe.java:164) 
> ~[hive-jdbc-handler-3.1.0.3.1.5.0-152.jar:3.1.0.3.1.5.0-152]
> at 
> org.apache.hadoop.hive.ql.exec.FetchOperator.getNextRow(FetchOperator.java:598)
>  ~[hive-exec-3.1.0.3.1.5.0-152.jar:3.1.0.3.1.5.0-152]
> at 
> org.apache.hadoop.hive.ql.exec.FetchOperator.pushRow(FetchOperator.java:524) 
> ~[hive-exec-3.1.0.3.1.5.0-152.jar:3.1.0.3.1.5.0-152]
> at org.apache.hadoop.hive.ql.exec.FetchTask.fetch(FetchTask.java:146) 
> ~[hive-exec-3.1.0.3.1.5.0-152.jar:3.1.0.3.1.5.0-152]
> at org.apache.hadoop.hive.ql.Driver.getResults(Driver.java:2739) 
> ~[hive-exec-3.1.0.3.1.5.0-152.jar:3.1.0.3.1.5.0-152]
> at 
> org.apache.hadoop.hive.ql.reexec.ReExecDriver.getResults(ReExecDriver.java:229)
>  ~[hive-exec-3.1.0.3.1.5.0-152.jar:3.1.0.3.1.5.0-152]
> at 
> org.apache.hive.service.cli.operation.SQLOperation.getNextRowSet(SQLOperation.java:473)
>  ~[hive-service-3.1.0.3.1.5.0-152.jar:3.1.0.3.1.5.0-152]
> ... 34 more
> {code}
> Problem appears when column names in Oracle are in Upper case and since in 
> Hive, table and column names are forced to store in lowercase during 
> creation. User runs into NPE error while fetching data.
> While deserializing data, input consists of column names in lower case which 
> fails to get the value
> https://github.com/apache/hive/blob/rel/release-3.1.2/jdbc-handler/src/main/java/org/apache/hive/storage/jdbc/JdbcSerDe.java#L136
> {code}
> rowVal = ((ObjectWritable)value).get();
> {code}
> Log Snio:
> =
> {code}
> 2020-07-17T16:49:09,598 INFO  [04ed42ec-91d2-4662-aee7-37e840a06036 
> HiveServer2-Handler-Pool: Thread-104]: dao.GenericJdbcDatabaseAccessor (:()) 
> - Query to execute is [select * from TESTHIVEJDBCSTORAGE]
> 2020-07-17T16:49:10,642 INFO  [04ed42ec-91d2-4662-aee7-37e840a06036 
> HiveServer2-Handler-Pool: Thread-104]: jdbc.JdbcSerDe (:()) - *** ColumnKey = 
> ID
> 2020-07-17T16:49:10,642 INFO  [04ed42ec-91d2-4662-aee7-37e840a06036 
> HiveServer2-Handler-Pool: Thread-104]: jdbc.JdbcSerDe (:()) - *** Blob value 
> = {fname=OW[class=class

[jira] [Commented] (HIVE-23963) UnsupportedOperationException in queries 74 and 84 while applying HiveCardinalityPreservingJoinRule

2020-08-03 Thread Jesus Camacho Rodriguez (Jira)



[ 
https://issues.apache.org/jira/browse/HIVE-23963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17170071#comment-17170071
 ] 

Jesus Camacho Rodriguez commented on HIVE-23963:


[~kkasa], I think that should be explored on Calcite side indeed. I was 
checking the code and may need to wait for a release though?

I saw we rely on that method in {{HiveRelDistribution}}. Is it possible to 
create the map using {{mapping.iterator()}} and rely on that map to get the 
value for each key? I think that may provide a valid workaround.

> UnsupportedOperationException in queries 74 and 84 while applying 
> HiveCardinalityPreservingJoinRule
> ---
>
> Key: HIVE-23963
> URL: https://issues.apache.org/jira/browse/HIVE-23963
> Project: Hive
>  Issue Type: Bug
>  Components: CBO
>Reporter: Stamatis Zampetakis
>Assignee: Krisztian Kasa
>Priority: Major
> Attachments: cbo_query74_stacktrace.txt, cbo_query84_stacktrace.txt
>
>
> The following TPC-DS queries: 
> * cbo_query74.q
> * cbo_query84.q 
> * query74.q 
> * query84.q 
> fail on the metastore with the partitioned TPC-DS 30TB dataset.
> The stacktraces for cbo_query74 and cbo_query84 show that the problem 
> originates while applying HiveCardinalityPreservingJoinRule.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HIVE-23973) Use SQL constraints to improve join reordering algorithm (III)

2020-08-03 Thread Jesus Camacho Rodriguez (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-23973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jesus Camacho Rodriguez updated HIVE-23973:
---
Fix Version/s: 4.0.0
   Resolution: Fixed
   Status: Resolved  (was: Patch Available)

Pushed to master.

> Use SQL constraints to improve join reordering algorithm (III)
> --
>
> Key: HIVE-23973
> URL: https://issues.apache.org/jira/browse/HIVE-23973
> Project: Hive
>  Issue Type: Improvement
>  Components: CBO
>Reporter: Jesus Camacho Rodriguez
>Assignee: Jesus Camacho Rodriguez
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> This issue focuses on pulling non-filtering column appending FK-PK joins to 
> the top of the plan. Among other improvements, this will avoid unnecessary 
> shuffling of data in intermediate stages.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (HIVE-23973) Use SQL constraints to improve join reordering algorithm (III)

2020-08-03 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-23973?focusedWorklogId=465712=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-465712
 ]

ASF GitHub Bot logged work on HIVE-23973:
-

Author: ASF GitHub Bot
Created on: 03/Aug/20 14:13
Start Date: 03/Aug/20 14:13
Worklog Time Spent: 10m 
  Work Description: jcamachor merged pull request #1349:
URL: https://github.com/apache/hive/pull/1349


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 465712)
Time Spent: 0.5h  (was: 20m)

> Use SQL constraints to improve join reordering algorithm (III)
> --
>
> Key: HIVE-23973
> URL: https://issues.apache.org/jira/browse/HIVE-23973
> Project: Hive
>  Issue Type: Improvement
>  Components: CBO
>Reporter: Jesus Camacho Rodriguez
>Assignee: Jesus Camacho Rodriguez
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> This issue focuses on pulling non-filtering column appending FK-PK joins to 
> the top of the plan. Among other improvements, this will avoid unnecessary 
> shuffling of data in intermediate stages.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (HIVE-23973) Use SQL constraints to improve join reordering algorithm (III)

2020-08-03 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-23973?focusedWorklogId=465706=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-465706
 ]

ASF GitHub Bot logged work on HIVE-23973:
-

Author: ASF GitHub Bot
Created on: 03/Aug/20 14:04
Start Date: 03/Aug/20 14:04
Worklog Time Spent: 10m 
  Work Description: kasakrisz commented on a change in pull request #1349:
URL: https://github.com/apache/hive/pull/1349#discussion_r464309156



##
File path: 
ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/rules/HiveFilterJoinRule.java
##
@@ -52,14 +55,34 @@ protected HiveFilterJoinRule(RelOptRuleOperand operand, 
String id, boolean smart
 super(operand, id, smart, relBuilderFactory, TRUE_PREDICATE);
   }
 
+  /**
+   * Rule that tries to push filter expressions into a join condition and into
+   * the inputs of the join, iff the join is a column appending
+   * non-filtering join.
+   */
+  public static class HiveFilterNonFilteringJoinMergeRule extends 
HiveFilterJoinMergeRule {
+
+@Override
+public boolean matches(RelOptRuleCall call) {
+  Join join = call.rel(1);
+  RewritablePKFKJoinInfo joinInfo = HiveRelOptUtil.isRewritablePKFKJoin(
+  join, true, call.getMetadataQuery());
+  if (!joinInfo.rewritable) {
+return false;
+  }
+  return super.matches(call);
+}
+
+  }
+
   /**
* Rule that tries to push filter expressions into a join condition and into
* the inputs of the join.
*/
   public static class HiveFilterJoinMergeRule extends HiveFilterJoinRule {
 public HiveFilterJoinMergeRule() {
-  super(RelOptRule.operand(Filter.class, RelOptRule.operand(Join.class, 
RelOptRule.any())),
-  "HiveFilterJoinRule:filter", true, HiveRelFactories.HIVE_BUILDER);
+  super(operand(Filter.class, operand(Join.class, any())),
+  null, true, HiveRelFactories.HIVE_BUILDER);

Review comment:
   nit: you can expose the `id` parameter to the constructor of 
`HiveFilterJoinMergeRule `





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 465706)
Time Spent: 20m  (was: 10m)

> Use SQL constraints to improve join reordering algorithm (III)
> --
>
> Key: HIVE-23973
> URL: https://issues.apache.org/jira/browse/HIVE-23973
> Project: Hive
>  Issue Type: Improvement
>  Components: CBO
>Reporter: Jesus Camacho Rodriguez
>Assignee: Jesus Camacho Rodriguez
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> This issue focuses on pulling non-filtering column appending FK-PK joins to 
> the top of the plan. Among other improvements, this will avoid unnecessary 
> shuffling of data in intermediate stages.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (HIVE-23829) Compute Stats Incorrect for Binary Columns

2020-08-03 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-23829?focusedWorklogId=465667=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-465667
 ]

ASF GitHub Bot logged work on HIVE-23829:
-

Author: ASF GitHub Bot
Created on: 03/Aug/20 12:45
Start Date: 03/Aug/20 12:45
Worklog Time Spent: 10m 
  Work Description: HunterL opened a new pull request #1313:
URL: https://github.com/apache/hive/pull/1313


   Updated the LazySimple SerDe to no longer attempt to auto-detect if Binary 
columns were Base64 and instead use a table property. The previous way this was 
done was expensive and did not correctly check if the values were valid Base64 
which in niche cases could result in statistics being computed incorrectly.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 465667)
Time Spent: 1h  (was: 50m)

> Compute Stats Incorrect for Binary Columns
> --
>
> Key: HIVE-23829
> URL: https://issues.apache.org/jira/browse/HIVE-23829
> Project: Hive
>  Issue Type: Bug
>Reporter: David Mollitor
>Assignee: David Mollitor
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> I came across an issue when working on [HIVE-22674].
> The SerDe used for processing binary data tries to auto-detect if the data is 
> in Base-64.  It uses 
> {{org.apache.commons.codec.binary.Base64#isArrayByteBase64}} which has two 
> issues:
> # It's slow since it will check if the array is compatible,... and then 
> process the data (examines the array twice)
> # More importantly, this method _Tests a given byte array to see if it 
> contains only valid characters within the Base64 alphabet. Currently the 
> method treats whitespace as valid._
> https://commons.apache.org/proper/commons-codec/apidocs/org/apache/commons/codec/binary/Base64.html#isArrayByteBase64-byte:A-
> The 
> [qtest|https://github.com/apache/hive/blob/f98e136bdd5642e3de10d2fd1a4c14d1d6762113/ql/src/test/queries/clientpositive/compute_stats_binary.q]
>  for this feature uses full sentences (which includes spaces) 
> [here|https://github.com/apache/hive/blob/f98e136bdd5642e3de10d2fd1a4c14d1d6762113/data/files/binary.txt]
>  and therefore it thinks this data is Base-64 and returns an incorrect 
> estimation for size.
> This should really not auto-detect Base64 data and instead it should be 
> enabled with a table property.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (HIVE-23829) Compute Stats Incorrect for Binary Columns

2020-08-03 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-23829?focusedWorklogId=465666=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-465666
 ]

ASF GitHub Bot logged work on HIVE-23829:
-

Author: ASF GitHub Bot
Created on: 03/Aug/20 12:42
Start Date: 03/Aug/20 12:42
Worklog Time Spent: 10m 
  Work Description: HunterL closed pull request #1313:
URL: https://github.com/apache/hive/pull/1313


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 465666)
Time Spent: 50m  (was: 40m)

> Compute Stats Incorrect for Binary Columns
> --
>
> Key: HIVE-23829
> URL: https://issues.apache.org/jira/browse/HIVE-23829
> Project: Hive
>  Issue Type: Bug
>Reporter: David Mollitor
>Assignee: David Mollitor
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> I came across an issue when working on [HIVE-22674].
> The SerDe used for processing binary data tries to auto-detect if the data is 
> in Base-64.  It uses 
> {{org.apache.commons.codec.binary.Base64#isArrayByteBase64}} which has two 
> issues:
> # It's slow since it will check if the array is compatible,... and then 
> process the data (examines the array twice)
> # More importantly, this method _Tests a given byte array to see if it 
> contains only valid characters within the Base64 alphabet. Currently the 
> method treats whitespace as valid._
> https://commons.apache.org/proper/commons-codec/apidocs/org/apache/commons/codec/binary/Base64.html#isArrayByteBase64-byte:A-
> The 
> [qtest|https://github.com/apache/hive/blob/f98e136bdd5642e3de10d2fd1a4c14d1d6762113/ql/src/test/queries/clientpositive/compute_stats_binary.q]
>  for this feature uses full sentences (which includes spaces) 
> [here|https://github.com/apache/hive/blob/f98e136bdd5642e3de10d2fd1a4c14d1d6762113/data/files/binary.txt]
>  and therefore it thinks this data is Base-64 and returns an incorrect 
> estimation for size.
> This should really not auto-detect Base64 data and instead it should be 
> enabled with a table property.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (HIVE-23946) Improve control flow and error handling in QTest dataset loading/unloading

2020-08-03 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-23946?focusedWorklogId=465664=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-465664
 ]

ASF GitHub Bot logged work on HIVE-23946:
-

Author: ASF GitHub Bot
Created on: 03/Aug/20 12:37
Start Date: 03/Aug/20 12:37
Worklog Time Spent: 10m 
  Work Description: abstractdog commented on a change in pull request #1331:
URL: https://github.com/apache/hive/pull/1331#discussion_r464382232



##
File path: 
itests/util/src/main/java/org/apache/hadoop/hive/ql/dataset/QTestDatasetHandler.java
##
@@ -52,8 +51,8 @@
 
   private File datasetDir;
   private static Set srcTables;
-  private static Set missingTables = new HashSet<>();

Review comment:
   @zabetak : I really want to have this non-static as you did, but I 
needed to change it in HIVE-22617 as it caused flakyness in TestMTQueries (you 
can take a look at comments)...however, I'm not really sure if parallel qtest 
running will be implemented properly in the near future, so TestMTQueries is 
not a useful unit test

##
File path: 
itests/util/src/main/java/org/apache/hadoop/hive/ql/dataset/QTestDatasetHandler.java
##
@@ -84,23 +83,25 @@ public boolean initDataset(String table, CliDriver 
cliDriver) throws Exception {
 
 try {
   CommandProcessorResponse result = cliDriver.processLine(commands);
-  LOG.info("Result from cliDrriver.processLine in initFromDatasets=" + 
result);
+  LOG.info("Result from cliDrriver.processLine in initDataset=" + result);
 } catch (CommandProcessorException e) {
-  Assert.fail("Failed during initFromDatasets processLine with code=" + e);
+  throw new RuntimeException("Failed while loading table " + table, e);
 }
-
-return true;
+// Add the talbe in sources if it is loaded sucessfully
+addSrcTable(table);
   }
 
-  public boolean unloadDataset(String table, CliDriver cliDriver) throws 
Exception {
+  private void unloadDataset(String table, CliDriver cliDriver) {
 try {
+  // Remove table from sources otherwise the following command will fail 
due to EnforceReadOnlyTables.
+  removeSrcTable(table);
   CommandProcessorResponse result = cliDriver.processLine("drop table " + 
table);
-  LOG.info("Result from cliDrriver.processLine in initFromDatasets=" + 
result);
+  LOG.info("Result from cliDrriver.processLine in unloadDataset=" + 
result);
 } catch (CommandProcessorException e) {
-  Assert.fail("Failed during initFromDatasets processLine with code=" + e);
+  // If the unloading fails for any reason then add again the table to 
sources since it is still there.
+  addSrcTable(table);
+  throw new RuntimeException("Failed while unloading table " + table, e);

Review comment:
   I'm fine with this change, but I don't know what was the original 
purpose... @kgyrtkirk is there some place in the code which relies on assertion 
failures?

##
File path: 
itests/util/src/main/java/org/apache/hadoop/hive/ql/dataset/QTestDatasetHandler.java
##
@@ -84,23 +83,25 @@ public boolean initDataset(String table, CliDriver 
cliDriver) throws Exception {
 
 try {
   CommandProcessorResponse result = cliDriver.processLine(commands);
-  LOG.info("Result from cliDrriver.processLine in initFromDatasets=" + 
result);
+  LOG.info("Result from cliDrriver.processLine in initDataset=" + result);
 } catch (CommandProcessorException e) {
-  Assert.fail("Failed during initFromDatasets processLine with code=" + e);
+  throw new RuntimeException("Failed while loading table " + table, e);
 }
-
-return true;
+// Add the talbe in sources if it is loaded sucessfully

Review comment:
   minor typo: "talbe"





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 465664)
Time Spent: 0.5h  (was: 20m)

> Improve control flow and error handling in QTest dataset loading/unloading
> --
>
> Key: HIVE-23946
> URL: https://issues.apache.org/jira/browse/HIVE-23946
> Project: Hive
>  Issue Type: Improvement
>Reporter: Stamatis Zampetakis
>Assignee: Stamatis Zampetakis
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> This issue focuses mainly on the following methods:
> [QTestDatasetHandler#initDataset| 
>

[jira] [Assigned] (HIVE-23976) Enable vectorization for multi-col semi join reducers

2020-08-03 Thread Stamatis Zampetakis (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-23976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stamatis Zampetakis reassigned HIVE-23976:
--


> Enable vectorization for multi-col semi join reducers
> -
>
> Key: HIVE-23976
> URL: https://issues.apache.org/jira/browse/HIVE-23976
> Project: Hive
>  Issue Type: Improvement
>Reporter: Stamatis Zampetakis
>Assignee: Stamatis Zampetakis
>Priority: Major
>
> HIVE-21196 introduces multi-column semi-join reducers in the query engine. 
> However, the implementation relies on GenericUDFMurmurHash which is not 
> vectorized thus the respective operators cannot be executed in vectorized 
> mode. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (HIVE-23963) UnsupportedOperationException in queries 74 and 84 while applying HiveCardinalityPreservingJoinRule

2020-08-03 Thread Krisztian Kasa (Jira)



[ 
https://issues.apache.org/jira/browse/HIVE-23963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17169985#comment-17169985
 ] 

Krisztian Kasa commented on HIVE-23963:
---

When cost is calculated in HiveOnTezCostModel
https://github.com/apache/hive/blob/28b6384e9ba287188015418b4c38c85dfdde8133/ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/cost/HiveOnTezCostModel.java#L403
We need the RelDistribution of a Project node.

The RelDistribution is calculated by:
1. get the Project input's RelDistribution. In this case this is a TableScan on 
the customer table partitioned by the c_customer_sk column -> 
HiveRelDistribution(hash[0]) - the number *0* indicates that the c_customer_sk 
column is the 0th in the customer table but in general it can be anywhere.
2. The c_customer_sk is the 0th expression in the project but in general it can 
be anywhere so we need a mapping to map the key indexes in the RelDistribution. 

This mapping is an INVERSE_FUNCTION
https://github.com/apache/calcite/blob/2088488ac8327b19512a76a122cae2961fc551c3/core/src/main/java/org/apache/calcite/rel/core/Project.java#L375
which does not support *getTargetOpt*

[~jcamachorodriguez] 
Can the mapping type changed to something else which support getTargetOpt?

> UnsupportedOperationException in queries 74 and 84 while applying 
> HiveCardinalityPreservingJoinRule
> ---
>
> Key: HIVE-23963
> URL: https://issues.apache.org/jira/browse/HIVE-23963
> Project: Hive
>  Issue Type: Bug
>  Components: CBO
>Reporter: Stamatis Zampetakis
>Assignee: Krisztian Kasa
>Priority: Major
> Attachments: cbo_query74_stacktrace.txt, cbo_query84_stacktrace.txt
>
>
> The following TPC-DS queries: 
> * cbo_query74.q
> * cbo_query84.q 
> * query74.q 
> * query84.q 
> fail on the metastore with the partitioned TPC-DS 30TB dataset.
> The stacktraces for cbo_query74 and cbo_query84 show that the problem 
> originates while applying HiveCardinalityPreservingJoinRule.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (HIVE-23763) Query based minor compaction produces wrong files when rows with different buckets Ids are processed by the same FileSinkOperator

2020-08-03 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-23763?focusedWorklogId=465655=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-465655
 ]

ASF GitHub Bot logged work on HIVE-23763:
-

Author: ASF GitHub Bot
Created on: 03/Aug/20 12:11
Start Date: 03/Aug/20 12:11
Worklog Time Spent: 10m 
  Work Description: kuczoram commented on a change in pull request #1327:
URL: https://github.com/apache/hive/pull/1327#discussion_r464373202



##
File path: ql/src/java/org/apache/hadoop/hive/ql/exec/FileSinkOperator.java
##
@@ -1063,7 +1076,11 @@ public void process(Object row, int tag) throws 
HiveException {
   // RecordUpdater expects to get the actual row, not a serialized version 
of it.  Thus we
   // pass the row rather than recordValue.
   if (conf.getWriteType() == AcidUtils.Operation.NOT_ACID || 
conf.isMmTable() || conf.isCompactionTable()) {
-rowOutWriters[findWriterOffset(row)].write(recordValue);
+writerOffset = bucketId;
+if (!conf.isCompactionTable()) {
+  writerOffset = findWriterOffset(row);
+}
+rowOutWriters[writerOffset].write(recordValue);

Review comment:
   They should be in order, because the result temp table for the 
compaction is created like "clustered by (`bucket`) sorted by (`bucket`, 
`originalTransaction`, `rowId`) into 10 buckets". I would assume that because 
of this, the rows in the table will be in order by bucket, originalTransaction 
and rowId. I haven't seen otherwise during my testing.
   I don't think we can close the writers here, because they will be used in 
the closeOp method as well and they are closed there.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 465655)
Time Spent: 1h 50m  (was: 1h 40m)

> Query based minor compaction produces wrong files when rows with different 
> buckets Ids are processed by the same FileSinkOperator
> -
>
> Key: HIVE-23763
> URL: https://issues.apache.org/jira/browse/HIVE-23763
> Project: Hive
>  Issue Type: Bug
>  Components: Transactions
>Affects Versions: 4.0.0
>Reporter: Marta Kuczora
>Assignee: Marta Kuczora
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 1h 50m
>  Remaining Estimate: 0h
>
> How to reproduce:
> - Create an unbucketed ACID table
> - Insert a bigger amount of data into this table so there would be multiple 
> bucket files in the table
> The files in the table should look like this:
> /warehouse/tablespace/managed/hive/bubu_acid/delta_001_001_/bucket_0_0
> /warehouse/tablespace/managed/hive/bubu_acid/delta_001_001_/bucket_1_0
> /warehouse/tablespace/managed/hive/bubu_acid/delta_001_001_/bucket_2_0
> /warehouse/tablespace/managed/hive/bubu_acid/delta_001_001_/bucket_3_0
> /warehouse/tablespace/managed/hive/bubu_acid/delta_001_001_/bucket_4_0
> /warehouse/tablespace/managed/hive/bubu_acid/delta_001_001_/bucket_5_0
> - Do some delete on rows with different bucket Ids
> The files in a delete delta should look like this:
> /warehouse/tablespace/managed/hive/bubu_acid/delete_delta_002_002_/bucket_0
> /warehouse/tablespace/managed/hive/bubu_acid/delete_delta_006_006_/bucket_3
> /warehouse/tablespace/managed/hive/bubu_acid/delete_delta_006_006_/bucket_1
> - Run the query-based minor compaction
> - After the compaction the newly created delete delta containes only 1 bucket 
> file. This file contains rows from all buckets and the table becomes unusable
> /warehouse/tablespace/managed/hive/bubu_acid/delete_delta_001_007_v066/bucket_0
> The issue happens only if rows with different bucket Ids are processed by the 
> same FileSinkOperator. 
> In the FileSinkOperator.process method, the files for the compaction table 
> are created like this:
> {noformat}
> if (!bDynParts && !filesCreated) {
>   if (lbDirName != null) {
> if (valToPaths.get(lbDirName) == null) {
>   createNewPaths(null, lbDirName);
> }
>   } else {
> if (conf.isCompactionTable()) {
>   int bucketProperty = getBucketProperty(row);
>   bucketId = 
> BucketCodec.determineVersion(bucketProperty).decodeWriterId(bucketProperty);
>

[jira] [Work logged] (HIVE-23763) Query based minor compaction produces wrong files when rows with different buckets Ids are processed by the same FileSinkOperator

2020-08-03 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-23763?focusedWorklogId=465649=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-465649
 ]

ASF GitHub Bot logged work on HIVE-23763:
-

Author: ASF GitHub Bot
Created on: 03/Aug/20 12:02
Start Date: 03/Aug/20 12:02
Worklog Time Spent: 10m 
  Work Description: kuczoram commented on a change in pull request #1327:
URL: https://github.com/apache/hive/pull/1327#discussion_r464369294



##
File path: 
itests/hive-unit/src/test/java/org/apache/hadoop/hive/ql/txn/compactor/CompactorOnTezTest.java
##
@@ -261,22 +326,77 @@ protected void insertMmTestData(String tblName, int 
iterations) throws Exception
 }
 
 List getAllData(String tblName) throws Exception {
-  return getAllData(null, tblName);
+  return getAllData(null, tblName, false);
 }
 
-List getAllData(String dbName, String tblName) throws Exception {
+List getAllData(String tblName, boolean withRowId) throws 
Exception {
+  return getAllData(null, tblName, withRowId);
+}
+
+List getAllData(String dbName, String tblName, boolean withRowId) 
throws Exception {
   if (dbName != null) {
 tblName = dbName + "." + tblName;
   }
-  List result = executeStatementOnDriverAndReturnResults("select * 
from " + tblName, driver);
+  StringBuffer query = new StringBuffer();
+  query.append("select ");
+  if (withRowId) {
+query.append("ROW__ID, ");
+  }
+  query.append("* from ");
+  query.append(tblName);
+  List result = 
executeStatementOnDriverAndReturnResults(query.toString(), driver);
   Collections.sort(result);
   return result;
 }
 
+List getDataWithInputFileNames(String dbName, String tblName) 
throws Exception {
+  if (dbName != null) {
+tblName = dbName + "." + tblName;
+  }
+  StringBuffer query = new StringBuffer();
+  query.append("select ");
+  query.append("INPUT__FILE__NAME, a from ");

Review comment:
   Thanks a lot for the info. I will check, but I haven't seen any issues 
with this. Maybe because I don't check the row number here, just the file names?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 465649)
Time Spent: 1h 40m  (was: 1.5h)

> Query based minor compaction produces wrong files when rows with different 
> buckets Ids are processed by the same FileSinkOperator
> -
>
> Key: HIVE-23763
> URL: https://issues.apache.org/jira/browse/HIVE-23763
> Project: Hive
>  Issue Type: Bug
>  Components: Transactions
>Affects Versions: 4.0.0
>Reporter: Marta Kuczora
>Assignee: Marta Kuczora
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> How to reproduce:
> - Create an unbucketed ACID table
> - Insert a bigger amount of data into this table so there would be multiple 
> bucket files in the table
> The files in the table should look like this:
> /warehouse/tablespace/managed/hive/bubu_acid/delta_001_001_/bucket_0_0
> /warehouse/tablespace/managed/hive/bubu_acid/delta_001_001_/bucket_1_0
> /warehouse/tablespace/managed/hive/bubu_acid/delta_001_001_/bucket_2_0
> /warehouse/tablespace/managed/hive/bubu_acid/delta_001_001_/bucket_3_0
> /warehouse/tablespace/managed/hive/bubu_acid/delta_001_001_/bucket_4_0
> /warehouse/tablespace/managed/hive/bubu_acid/delta_001_001_/bucket_5_0
> - Do some delete on rows with different bucket Ids
> The files in a delete delta should look like this:
> /warehouse/tablespace/managed/hive/bubu_acid/delete_delta_002_002_/bucket_0
> /warehouse/tablespace/managed/hive/bubu_acid/delete_delta_006_006_/bucket_3
> /warehouse/tablespace/managed/hive/bubu_acid/delete_delta_006_006_/bucket_1
> - Run the query-based minor compaction
> - After the compaction the newly created delete delta containes only 1 bucket 
> file. This file contains rows from all buckets and the table becomes unusable
> /warehouse/tablespace/managed/hive/bubu_acid/delete_delta_001_007_v066/bucket_0
> The issue happens only if rows with different bucket Ids are processed by the 
> same FileSinkOperator. 
> In the

[jira] [Work logged] (HIVE-23763) Query based minor compaction produces wrong files when rows with different buckets Ids are processed by the same FileSinkOperator

2020-08-03 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-23763?focusedWorklogId=465648=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-465648
 ]

ASF GitHub Bot logged work on HIVE-23763:
-

Author: ASF GitHub Bot
Created on: 03/Aug/20 12:00
Start Date: 03/Aug/20 12:00
Worklog Time Spent: 10m 
  Work Description: kuczoram commented on a change in pull request #1327:
URL: https://github.com/apache/hive/pull/1327#discussion_r464368518



##
File path: 
itests/hive-unit/src/test/java/org/apache/hadoop/hive/ql/txn/compactor/CompactorOnTezTest.java
##
@@ -217,6 +224,64 @@ void insertTestData(String dbName, String tblName) throws 
Exception {
   executeStatementOnDriver("delete from " + tblName + " where a = '1'", 
driver);
 }
 
+void createTableWithoutBucketWithMultipleSplits(String dbName, String 
tblName, String tempTblName,

Review comment:
   Sure, added a comment.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 465648)
Time Spent: 1.5h  (was: 1h 20m)

> Query based minor compaction produces wrong files when rows with different 
> buckets Ids are processed by the same FileSinkOperator
> -
>
> Key: HIVE-23763
> URL: https://issues.apache.org/jira/browse/HIVE-23763
> Project: Hive
>  Issue Type: Bug
>  Components: Transactions
>Affects Versions: 4.0.0
>Reporter: Marta Kuczora
>Assignee: Marta Kuczora
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> How to reproduce:
> - Create an unbucketed ACID table
> - Insert a bigger amount of data into this table so there would be multiple 
> bucket files in the table
> The files in the table should look like this:
> /warehouse/tablespace/managed/hive/bubu_acid/delta_001_001_/bucket_0_0
> /warehouse/tablespace/managed/hive/bubu_acid/delta_001_001_/bucket_1_0
> /warehouse/tablespace/managed/hive/bubu_acid/delta_001_001_/bucket_2_0
> /warehouse/tablespace/managed/hive/bubu_acid/delta_001_001_/bucket_3_0
> /warehouse/tablespace/managed/hive/bubu_acid/delta_001_001_/bucket_4_0
> /warehouse/tablespace/managed/hive/bubu_acid/delta_001_001_/bucket_5_0
> - Do some delete on rows with different bucket Ids
> The files in a delete delta should look like this:
> /warehouse/tablespace/managed/hive/bubu_acid/delete_delta_002_002_/bucket_0
> /warehouse/tablespace/managed/hive/bubu_acid/delete_delta_006_006_/bucket_3
> /warehouse/tablespace/managed/hive/bubu_acid/delete_delta_006_006_/bucket_1
> - Run the query-based minor compaction
> - After the compaction the newly created delete delta containes only 1 bucket 
> file. This file contains rows from all buckets and the table becomes unusable
> /warehouse/tablespace/managed/hive/bubu_acid/delete_delta_001_007_v066/bucket_0
> The issue happens only if rows with different bucket Ids are processed by the 
> same FileSinkOperator. 
> In the FileSinkOperator.process method, the files for the compaction table 
> are created like this:
> {noformat}
> if (!bDynParts && !filesCreated) {
>   if (lbDirName != null) {
> if (valToPaths.get(lbDirName) == null) {
>   createNewPaths(null, lbDirName);
> }
>   } else {
> if (conf.isCompactionTable()) {
>   int bucketProperty = getBucketProperty(row);
>   bucketId = 
> BucketCodec.determineVersion(bucketProperty).decodeWriterId(bucketProperty);
> }
> createBucketFiles(fsp);
>   }
> }
> {noformat}
> When the first row is processed, the file is created and then the 
> filesCreated variable is set to true. Then when the other rows are processed, 
> the first if statement will be false, so no new file gets created, but the 
> row will be written into the file created for the first row.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HIVE-8950) Add support in ParquetHiveSerde to create table schema from a parquet file

2020-08-03 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-8950?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HIVE-8950:
-
Labels: pull-request-available  (was: )

> Add support in ParquetHiveSerde to create table schema from a parquet file
> --
>
> Key: HIVE-8950
> URL: https://issues.apache.org/jira/browse/HIVE-8950
> Project: Hive
>  Issue Type: Improvement
>Reporter: Ashish Singh
>Assignee: Ashish Singh
>Priority: Major
>  Labels: pull-request-available
> Attachments: HIVE-8950.1.patch, HIVE-8950.10.patch, 
> HIVE-8950.11.patch, HIVE-8950.2.patch, HIVE-8950.3.patch, HIVE-8950.4.patch, 
> HIVE-8950.5.patch, HIVE-8950.6.patch, HIVE-8950.7.patch, HIVE-8950.8.patch, 
> HIVE-8950.9.patch, HIVE-8950.patch
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> PARQUET-76 and PARQUET-47 ask for creating parquet backed tables without 
> having to specify the column names and types. As, parquet files store schema 
> in their footer, it is possible to generate hive schema from parquet file's 
> metadata. This will improve usability of parquet backed tables.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (HIVE-8950) Add support in ParquetHiveSerde to create table schema from a parquet file

2020-08-03 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-8950?focusedWorklogId=465633=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-465633
 ]

ASF GitHub Bot logged work on HIVE-8950:


Author: ASF GitHub Bot
Created on: 03/Aug/20 10:58
Start Date: 03/Aug/20 10:58
Worklog Time Spent: 10m 
  Work Description: szehonCriteo opened a new pull request #1353:
URL: https://github.com/apache/hive/pull/1353


   …om a parquet file
   
   
   
   ### What changes were proposed in this pull request?
   
   
   
   ### Why are the changes needed?
   
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   
   
   ### How was this patch tested?
   
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 465633)
Remaining Estimate: 0h
Time Spent: 10m

> Add support in ParquetHiveSerde to create table schema from a parquet file
> --
>
> Key: HIVE-8950
> URL: https://issues.apache.org/jira/browse/HIVE-8950
> Project: Hive
>  Issue Type: Improvement
>Reporter: Ashish Singh
>Assignee: Ashish Singh
>Priority: Major
> Attachments: HIVE-8950.1.patch, HIVE-8950.10.patch, 
> HIVE-8950.11.patch, HIVE-8950.2.patch, HIVE-8950.3.patch, HIVE-8950.4.patch, 
> HIVE-8950.5.patch, HIVE-8950.6.patch, HIVE-8950.7.patch, HIVE-8950.8.patch, 
> HIVE-8950.9.patch, HIVE-8950.patch
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> PARQUET-76 and PARQUET-47 ask for creating parquet backed tables without 
> having to specify the column names and types. As, parquet files store schema 
> in their footer, it is possible to generate hive schema from parquet file's 
> metadata. This will improve usability of parquet backed tables.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (HIVE-23763) Query based minor compaction produces wrong files when rows with different buckets Ids are processed by the same FileSinkOperator

2020-08-03 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-23763?focusedWorklogId=465619=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-465619
 ]

ASF GitHub Bot logged work on HIVE-23763:
-

Author: ASF GitHub Bot
Created on: 03/Aug/20 09:56
Start Date: 03/Aug/20 09:56
Worklog Time Spent: 10m 
  Work Description: kuczoram commented on a change in pull request #1327:
URL: https://github.com/apache/hive/pull/1327#discussion_r464313931



##
File path: 
itests/hive-unit/src/test/java/org/apache/hadoop/hive/ql/txn/compactor/CompactorOnTezTest.java
##
@@ -95,6 +98,10 @@ private void setupTez(HiveConf conf) {
 conf.set("hive.tez.container.size", "128");
 conf.setBoolean("hive.merge.tezfiles", false);
 conf.setBoolean("hive.in.tez.test", true);
+if (!mmCompaction) {
+  conf.set("tez.grouping.max-size", "1024");

Review comment:
   Sure, added a comment.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 465619)
Time Spent: 1h 20m  (was: 1h 10m)

> Query based minor compaction produces wrong files when rows with different 
> buckets Ids are processed by the same FileSinkOperator
> -
>
> Key: HIVE-23763
> URL: https://issues.apache.org/jira/browse/HIVE-23763
> Project: Hive
>  Issue Type: Bug
>  Components: Transactions
>Affects Versions: 4.0.0
>Reporter: Marta Kuczora
>Assignee: Marta Kuczora
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> How to reproduce:
> - Create an unbucketed ACID table
> - Insert a bigger amount of data into this table so there would be multiple 
> bucket files in the table
> The files in the table should look like this:
> /warehouse/tablespace/managed/hive/bubu_acid/delta_001_001_/bucket_0_0
> /warehouse/tablespace/managed/hive/bubu_acid/delta_001_001_/bucket_1_0
> /warehouse/tablespace/managed/hive/bubu_acid/delta_001_001_/bucket_2_0
> /warehouse/tablespace/managed/hive/bubu_acid/delta_001_001_/bucket_3_0
> /warehouse/tablespace/managed/hive/bubu_acid/delta_001_001_/bucket_4_0
> /warehouse/tablespace/managed/hive/bubu_acid/delta_001_001_/bucket_5_0
> - Do some delete on rows with different bucket Ids
> The files in a delete delta should look like this:
> /warehouse/tablespace/managed/hive/bubu_acid/delete_delta_002_002_/bucket_0
> /warehouse/tablespace/managed/hive/bubu_acid/delete_delta_006_006_/bucket_3
> /warehouse/tablespace/managed/hive/bubu_acid/delete_delta_006_006_/bucket_1
> - Run the query-based minor compaction
> - After the compaction the newly created delete delta containes only 1 bucket 
> file. This file contains rows from all buckets and the table becomes unusable
> /warehouse/tablespace/managed/hive/bubu_acid/delete_delta_001_007_v066/bucket_0
> The issue happens only if rows with different bucket Ids are processed by the 
> same FileSinkOperator. 
> In the FileSinkOperator.process method, the files for the compaction table 
> are created like this:
> {noformat}
> if (!bDynParts && !filesCreated) {
>   if (lbDirName != null) {
> if (valToPaths.get(lbDirName) == null) {
>   createNewPaths(null, lbDirName);
> }
>   } else {
> if (conf.isCompactionTable()) {
>   int bucketProperty = getBucketProperty(row);
>   bucketId = 
> BucketCodec.determineVersion(bucketProperty).decodeWriterId(bucketProperty);
> }
> createBucketFiles(fsp);
>   }
> }
> {noformat}
> When the first row is processed, the file is created and then the 
> filesCreated variable is set to true. Then when the other rows are processed, 
> the first if statement will be false, so no new file gets created, but the 
> row will be written into the file created for the first row.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (HIVE-23956) Delete delta directory file information should be pushed to execution side

2020-08-03 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-23956?focusedWorklogId=465618=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-465618
 ]

ASF GitHub Bot logged work on HIVE-23956:
-

Author: ASF GitHub Bot
Created on: 03/Aug/20 09:54
Start Date: 03/Aug/20 09:54
Worklog Time Spent: 10m 
  Work Description: pvary commented on a change in pull request #1339:
URL: https://github.com/apache/hive/pull/1339#discussion_r464312615



##
File path: ql/src/test/org/apache/hadoop/hive/ql/TestTxnCommands.java
##
@@ -618,7 +618,13 @@ public void testMultipleInserts() throws Exception {
 dumpTableData(Table.ACIDTBL, 1, 1);
 List rs1 = runStatementOnDriver("select a,b from " + Table.ACIDTBL 
+ " order by a,b");
 Assert.assertEquals("Content didn't match after commit rs1", allData, rs1);
+runStatementOnDriver("delete from " + Table.ACIDTBL + " where b = 2");

Review comment:
   This is a valid test, but I think the testMultipleInserts test for 
inserts, and this is test for deletes. Maybe create its' own test method named 
testDeleteOfInserts like testUpdateOfInserts?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 465618)
Time Spent: 3h 50m  (was: 3h 40m)

> Delete delta directory file information should be pushed to execution side
> --
>
> Key: HIVE-23956
> URL: https://issues.apache.org/jira/browse/HIVE-23956
> Project: Hive
>  Issue Type: Improvement
>Reporter: Peter Varga
>Assignee: Peter Varga
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 3h 50m
>  Remaining Estimate: 0h
>
> Since HIVE-23840 LLAP cache is used to retrieve the tail of the ORC bucket 
> files in the delete deltas, but to use the cache the fileId must be 
> determined, so one more FileSystem call is issued for each bucket.
> This fileId is already available during compilation in the AcidState 
> calculation, we should serialise this to the OrcSplit, and remove the 
> unnecessary FS calls.
> Furthermore instead of sending the SyntheticFileId directly, we should pass 
> the attemptId instead of the standard path hash, this way the path and the 
> SyntheticFileId. can be calculated, and it will work even, if the move free 
> delete operations will be introduced.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (HIVE-23956) Delete delta directory file information should be pushed to execution side

2020-08-03 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-23956?focusedWorklogId=465617=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-465617
 ]

ASF GitHub Bot logged work on HIVE-23956:
-

Author: ASF GitHub Bot
Created on: 03/Aug/20 09:51
Start Date: 03/Aug/20 09:51
Worklog Time Spent: 10m 
  Work Description: pvary commented on a change in pull request #1339:
URL: https://github.com/apache/hive/pull/1339#discussion_r464311382



##
File path: 
ql/src/java/org/apache/hadoop/hive/ql/io/orc/VectorizedOrcAcidRowBatchReader.java
##
@@ -1641,28 +1645,26 @@ public int compareTo(CompressedOwid other) {
  * Check if the delete delta folder needs to be scanned for a given 
split's min/max write ids.
  *
  * @param orcSplitMinMaxWriteIds
- * @param deleteDeltaDir
+ * @param deleteDelta
+ * @param stmtId statementId of the deleteDelta if present
  * @return true when  delete delta dir has to be scanned.
  */
 @VisibleForTesting
 protected static boolean 
isQualifiedDeleteDeltaForSplit(AcidOutputFormat.Options orcSplitMinMaxWriteIds,
-Path deleteDeltaDir)
-{
-  AcidUtils.ParsedDelta deleteDelta = 
AcidUtils.parsedDelta(deleteDeltaDir, false);
+AcidInputFormat.DeltaMetaData deleteDelta, Integer stmtId) {

Review comment:
   nit: extra spaces?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 465617)
Time Spent: 3h 40m  (was: 3.5h)

> Delete delta directory file information should be pushed to execution side
> --
>
> Key: HIVE-23956
> URL: https://issues.apache.org/jira/browse/HIVE-23956
> Project: Hive
>  Issue Type: Improvement
>Reporter: Peter Varga
>Assignee: Peter Varga
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 3h 40m
>  Remaining Estimate: 0h
>
> Since HIVE-23840 LLAP cache is used to retrieve the tail of the ORC bucket 
> files in the delete deltas, but to use the cache the fileId must be 
> determined, so one more FileSystem call is issued for each bucket.
> This fileId is already available during compilation in the AcidState 
> calculation, we should serialise this to the OrcSplit, and remove the 
> unnecessary FS calls.
> Furthermore instead of sending the SyntheticFileId directly, we should pass 
> the attemptId instead of the standard path hash, this way the path and the 
> SyntheticFileId. can be calculated, and it will work even, if the move free 
> delete operations will be introduced.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (HIVE-23956) Delete delta directory file information should be pushed to execution side

2020-08-03 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-23956?focusedWorklogId=465616=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-465616
 ]

ASF GitHub Bot logged work on HIVE-23956:
-

Author: ASF GitHub Bot
Created on: 03/Aug/20 09:49
Start Date: 03/Aug/20 09:49
Worklog Time Spent: 10m 
  Work Description: pvary commented on a change in pull request #1339:
URL: https://github.com/apache/hive/pull/1339#discussion_r464310228



##
File path: ql/src/java/org/apache/hadoop/hive/ql/io/AcidInputFormat.java
##
@@ -118,70 +126,217 @@
  */
 private long visibilityTxnId;
 
+private List deltaFiles;
+
 public DeltaMetaData() {
-  this(0,0,new ArrayList(), 0);
+  this(0, 0, new ArrayList<>(), 0, new ArrayList<>());
 }
+
 /**
+ * @param minWriteId min writeId of the delta directory
+ * @param maxWriteId max writeId of the delta directory
  * @param stmtIds delta dir suffixes when a single txn writes > 1 delta in 
the same partition
  * @param visibilityTxnId maybe 0, if the dir name didn't have it.  
txnid:0 is always visible
+ * @param deltaFiles bucketFiles in the directory
  */
-DeltaMetaData(long minWriteId, long maxWriteId, List stmtIds, 
long visibilityTxnId) {
+public DeltaMetaData(long minWriteId, long maxWriteId, List 
stmtIds, long visibilityTxnId,
+List deltaFiles) {
   this.minWriteId = minWriteId;
   this.maxWriteId = maxWriteId;
   if (stmtIds == null) {
 throw new IllegalArgumentException("stmtIds == null");
   }
   this.stmtIds = stmtIds;
   this.visibilityTxnId = visibilityTxnId;
+  this.deltaFiles = ObjectUtils.defaultIfNull(deltaFiles, new 
ArrayList<>());
 }
-long getMinWriteId() {
+
+public long getMinWriteId() {
   return minWriteId;
 }
-long getMaxWriteId() {
+
+public long getMaxWriteId() {
   return maxWriteId;
 }
-List getStmtIds() {
+
+public List getStmtIds() {
   return stmtIds;
 }
-long getVisibilityTxnId() {
+
+public long getVisibilityTxnId() {
   return visibilityTxnId;
 }
+
+public List getDeltaFiles() {
+  return deltaFiles;
+}
+
+public List getDeltaFilesForStmtId(final Integer 
stmtId) {
+  if (stmtIds.size() <= 1 || stmtId == null) {
+// If it is not a multistatement delta, we do not store the stmtId in 
the file list
+return deltaFiles;
+  } else {
+return deltaFiles.stream().filter(df -> 
stmtId.equals(df.getStmtId())).collect(Collectors.toList());
+  }
+}
+
 @Override
 public void write(DataOutput out) throws IOException {
   out.writeLong(minWriteId);
   out.writeLong(maxWriteId);
   out.writeInt(stmtIds.size());
-  for(Integer id : stmtIds) {
+  for (Integer id : stmtIds) {
 out.writeInt(id);
   }
   out.writeLong(visibilityTxnId);
+  out.writeInt(deltaFiles.size());
+  for (DeltaFileMetaData fileMeta : deltaFiles) {
+fileMeta.write(out);
+  }
 }
+
 @Override
 public void readFields(DataInput in) throws IOException {
   minWriteId = in.readLong();
   maxWriteId = in.readLong();
   stmtIds.clear();
   int numStatements = in.readInt();
-  for(int i = 0; i < numStatements; i++) {
+  for (int i = 0; i < numStatements; i++) {
 stmtIds.add(in.readInt());
   }
   visibilityTxnId = in.readLong();
+
+  deltaFiles.clear();
+  int numFiles = in.readInt();
+  for (int i = 0; i < numFiles; i++) {
+DeltaFileMetaData file = new DeltaFileMetaData();
+file.readFields(in);
+deltaFiles.add(file);
+  }
 }
-String getName() {
+
+private String getName() {
   assert stmtIds.isEmpty() : "use getName(int)";
-  return AcidUtils.addVisibilitySuffix(AcidUtils
-  .deleteDeltaSubdir(minWriteId, maxWriteId), visibilityTxnId);
+  return 
AcidUtils.addVisibilitySuffix(AcidUtils.deleteDeltaSubdir(minWriteId, 
maxWriteId), visibilityTxnId);
 }
-String getName(int stmtId) {
+
+private String getName(int stmtId) {
   assert !stmtIds.isEmpty() : "use getName()";
   return AcidUtils.addVisibilitySuffix(AcidUtils
   .deleteDeltaSubdir(minWriteId, maxWriteId, stmtId), visibilityTxnId);
 }
+
+public List> getPaths(Path root) {

Review comment:
   Do we need the order? Why not map?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 465616)
Time Spent: 3.5h  (was: 3h 20m)

> Delete delta directory

[jira] [Work logged] (HIVE-23800) Add hooks when HiveServer2 stops due to OutOfMemoryError

2020-08-03 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-23800?focusedWorklogId=465614=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-465614
 ]

ASF GitHub Bot logged work on HIVE-23800:
-

Author: ASF GitHub Bot
Created on: 03/Aug/20 09:47
Start Date: 03/Aug/20 09:47
Worklog Time Spent: 10m 
  Work Description: dengzhhu653 opened a new pull request #1205:
URL: https://github.com/apache/hive/pull/1205


   ## NOTICE
   
   Please create an issue in ASF JIRA before opening a pull request,
   and you need to set the title of the pull request which starts with
   the corresponding JIRA issue number. (e.g. HIVE-X: Fix a typo in YYY)
   For more details, please see 
https://cwiki.apache.org/confluence/display/Hive/HowToContribute
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 465614)
Time Spent: 3h  (was: 2h 50m)

> Add hooks when HiveServer2 stops due to OutOfMemoryError
> 
>
> Key: HIVE-23800
> URL: https://issues.apache.org/jira/browse/HIVE-23800
> Project: Hive
>  Issue Type: Improvement
>  Components: HiveServer2
>Reporter: Zhihua Deng
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 3h
>  Remaining Estimate: 0h
>
> Make oom hook an interface of HiveServer2,  so user can implement the hook to 
> do something before HS2 stops, such as dumping the heap or altering the 
> devops.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (HIVE-23956) Delete delta directory file information should be pushed to execution side

2020-08-03 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-23956?focusedWorklogId=465615=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-465615
 ]

ASF GitHub Bot logged work on HIVE-23956:
-

Author: ASF GitHub Bot
Created on: 03/Aug/20 09:47
Start Date: 03/Aug/20 09:47
Worklog Time Spent: 10m 
  Work Description: pvary commented on a change in pull request #1339:
URL: https://github.com/apache/hive/pull/1339#discussion_r464309265



##
File path: 
ql/src/java/org/apache/hadoop/hive/ql/io/orc/VectorizedOrcAcidRowBatchReader.java
##
@@ -1574,20 +1577,23 @@ public int compareTo(CompressedOwid other) {
   this.orcSplit = orcSplit;
 
   try {
-final Path[] deleteDeltaDirs = getDeleteDeltaDirsFromSplit(orcSplit);
-if (deleteDeltaDirs.length > 0) {
+if (orcSplit.getDeltas().size() > 0) {
   AcidOutputFormat.Options orcSplitMinMaxWriteIds =
   AcidUtils.parseBaseOrDeltaBucketFilename(orcSplit.getPath(), 
conf);
   int totalDeleteEventCount = 0;
-  for (Path deleteDeltaDir : deleteDeltaDirs) {
-if (!isQualifiedDeleteDeltaForSplit(orcSplitMinMaxWriteIds, 
deleteDeltaDir)) {
-  continue;
-}
-Path[] deleteDeltaFiles = 
OrcRawRecordMerger.getDeltaFiles(deleteDeltaDir, bucket,
-new OrcRawRecordMerger.Options().isCompacting(false), null);
-for (Path deleteDeltaFile : deleteDeltaFiles) {
-  try {
-ReaderData readerData = getOrcTail(deleteDeltaFile, conf, 
cacheTag);
+  for (AcidInputFormat.DeltaMetaData deltaMetaData : 
orcSplit.getDeltas()) {
+// We got one path for each statement in a multiStmt transaction
+for (Pair deleteDeltaDir : 
deltaMetaData.getPaths(orcSplit.getRootDir())) {

Review comment:
   nit: space 





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 465615)
Time Spent: 3h 20m  (was: 3h 10m)

> Delete delta directory file information should be pushed to execution side
> --
>
> Key: HIVE-23956
> URL: https://issues.apache.org/jira/browse/HIVE-23956
> Project: Hive
>  Issue Type: Improvement
>Reporter: Peter Varga
>Assignee: Peter Varga
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 3h 20m
>  Remaining Estimate: 0h
>
> Since HIVE-23840 LLAP cache is used to retrieve the tail of the ORC bucket 
> files in the delete deltas, but to use the cache the fileId must be 
> determined, so one more FileSystem call is issued for each bucket.
> This fileId is already available during compilation in the AcidState 
> calculation, we should serialise this to the OrcSplit, and remove the 
> unnecessary FS calls.
> Furthermore instead of sending the SyntheticFileId directly, we should pass 
> the attemptId instead of the standard path hash, this way the path and the 
> SyntheticFileId. can be calculated, and it will work even, if the move free 
> delete operations will be introduced.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (HIVE-23956) Delete delta directory file information should be pushed to execution side

2020-08-03 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-23956?focusedWorklogId=465613=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-465613
 ]

ASF GitHub Bot logged work on HIVE-23956:
-

Author: ASF GitHub Bot
Created on: 03/Aug/20 09:45
Start Date: 03/Aug/20 09:45
Worklog Time Spent: 10m 
  Work Description: pvary commented on a change in pull request #1339:
URL: https://github.com/apache/hive/pull/1339#discussion_r464308530



##
File path: ql/src/java/org/apache/hadoop/hive/ql/io/AcidUtils.java
##
@@ -2493,7 +2514,7 @@ private static Path chooseFile(Path baseOrDeltaDir, 
FileSystem fs) throws IOExce
   }
   FileStatus[] dataFiles;
   try {
-dataFiles = fs.listStatus(new Path[]{baseOrDeltaDir}, 
originalBucketFilter);
+dataFiles = fs.listStatus(baseOrDeltaDir , originalBucketFilter);

Review comment:
   nit: extra space





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 465613)
Time Spent: 3h 10m  (was: 3h)

> Delete delta directory file information should be pushed to execution side
> --
>
> Key: HIVE-23956
> URL: https://issues.apache.org/jira/browse/HIVE-23956
> Project: Hive
>  Issue Type: Improvement
>Reporter: Peter Varga
>Assignee: Peter Varga
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 3h 10m
>  Remaining Estimate: 0h
>
> Since HIVE-23840 LLAP cache is used to retrieve the tail of the ORC bucket 
> files in the delete deltas, but to use the cache the fileId must be 
> determined, so one more FileSystem call is issued for each bucket.
> This fileId is already available during compilation in the AcidState 
> calculation, we should serialise this to the OrcSplit, and remove the 
> unnecessary FS calls.
> Furthermore instead of sending the SyntheticFileId directly, we should pass 
> the attemptId instead of the standard path hash, this way the path and the 
> SyntheticFileId. can be calculated, and it will work even, if the move free 
> delete operations will be introduced.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (HIVE-23800) Add hooks when HiveServer2 stops due to OutOfMemoryError

2020-08-03 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-23800?focusedWorklogId=465612=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-465612
 ]

ASF GitHub Bot logged work on HIVE-23800:
-

Author: ASF GitHub Bot
Created on: 03/Aug/20 09:45
Start Date: 03/Aug/20 09:45
Worklog Time Spent: 10m 
  Work Description: dengzhhu653 closed pull request #1205:
URL: https://github.com/apache/hive/pull/1205


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 465612)
Time Spent: 2h 50m  (was: 2h 40m)

> Add hooks when HiveServer2 stops due to OutOfMemoryError
> 
>
> Key: HIVE-23800
> URL: https://issues.apache.org/jira/browse/HIVE-23800
> Project: Hive
>  Issue Type: Improvement
>  Components: HiveServer2
>Reporter: Zhihua Deng
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 2h 50m
>  Remaining Estimate: 0h
>
> Make oom hook an interface of HiveServer2,  so user can implement the hook to 
> do something before HS2 stops, such as dumping the heap or altering the 
> devops.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (HIVE-23956) Delete delta directory file information should be pushed to execution side

2020-08-03 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-23956?focusedWorklogId=465607=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-465607
 ]

ASF GitHub Bot logged work on HIVE-23956:
-

Author: ASF GitHub Bot
Created on: 03/Aug/20 09:41
Start Date: 03/Aug/20 09:41
Worklog Time Spent: 10m 
  Work Description: pvary commented on a change in pull request #1339:
URL: https://github.com/apache/hive/pull/1339#discussion_r464306341



##
File path: ql/src/java/org/apache/hadoop/hive/ql/io/AcidInputFormat.java
##
@@ -118,70 +126,217 @@
  */
 private long visibilityTxnId;
 
+private List deltaFiles;
+
 public DeltaMetaData() {
-  this(0,0,new ArrayList(), 0);
+  this(0, 0, new ArrayList<>(), 0, new ArrayList<>());
 }
+
 /**
+ * @param minWriteId min writeId of the delta directory
+ * @param maxWriteId max writeId of the delta directory
  * @param stmtIds delta dir suffixes when a single txn writes > 1 delta in 
the same partition
  * @param visibilityTxnId maybe 0, if the dir name didn't have it.  
txnid:0 is always visible
+ * @param deltaFiles bucketFiles in the directory
  */
-DeltaMetaData(long minWriteId, long maxWriteId, List stmtIds, 
long visibilityTxnId) {
+public DeltaMetaData(long minWriteId, long maxWriteId, List 
stmtIds, long visibilityTxnId,
+List deltaFiles) {
   this.minWriteId = minWriteId;
   this.maxWriteId = maxWriteId;
   if (stmtIds == null) {
 throw new IllegalArgumentException("stmtIds == null");
   }
   this.stmtIds = stmtIds;
   this.visibilityTxnId = visibilityTxnId;
+  this.deltaFiles = ObjectUtils.defaultIfNull(deltaFiles, new 
ArrayList<>());
 }
-long getMinWriteId() {
+
+public long getMinWriteId() {
   return minWriteId;
 }
-long getMaxWriteId() {
+
+public long getMaxWriteId() {
   return maxWriteId;
 }
-List getStmtIds() {
+
+public List getStmtIds() {
   return stmtIds;
 }
-long getVisibilityTxnId() {
+
+public long getVisibilityTxnId() {
   return visibilityTxnId;
 }
+
+public List getDeltaFiles() {
+  return deltaFiles;
+}
+
+public List getDeltaFilesForStmtId(final Integer 
stmtId) {
+  if (stmtIds.size() <= 1 || stmtId == null) {
+// If it is not a multistatement delta, we do not store the stmtId in 
the file list
+return deltaFiles;
+  } else {
+return deltaFiles.stream().filter(df -> 
stmtId.equals(df.getStmtId())).collect(Collectors.toList());

Review comment:
   Question: How often do we call this? Is it ok to calculate this every 
time, or it would be better to store in a way that is already filtered, like a 
map?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 465607)
Time Spent: 3h  (was: 2h 50m)

> Delete delta directory file information should be pushed to execution side
> --
>
> Key: HIVE-23956
> URL: https://issues.apache.org/jira/browse/HIVE-23956
> Project: Hive
>  Issue Type: Improvement
>Reporter: Peter Varga
>Assignee: Peter Varga
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 3h
>  Remaining Estimate: 0h
>
> Since HIVE-23840 LLAP cache is used to retrieve the tail of the ORC bucket 
> files in the delete deltas, but to use the cache the fileId must be 
> determined, so one more FileSystem call is issued for each bucket.
> This fileId is already available during compilation in the AcidState 
> calculation, we should serialise this to the OrcSplit, and remove the 
> unnecessary FS calls.
> Furthermore instead of sending the SyntheticFileId directly, we should pass 
> the attemptId instead of the standard path hash, this way the path and the 
> SyntheticFileId. can be calculated, and it will work even, if the move free 
> delete operations will be introduced.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (HIVE-23956) Delete delta directory file information should be pushed to execution side

2020-08-03 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-23956?focusedWorklogId=465604=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-465604
 ]

ASF GitHub Bot logged work on HIVE-23956:
-

Author: ASF GitHub Bot
Created on: 03/Aug/20 09:40
Start Date: 03/Aug/20 09:40
Worklog Time Spent: 10m 
  Work Description: szlta commented on a change in pull request #1339:
URL: https://github.com/apache/hive/pull/1339#discussion_r464303366



##
File path: 
llap-server/src/test/org/apache/hadoop/hive/llap/cache/TestOrcMetadataCache.java
##
@@ -250,18 +255,71 @@ public void testGetOrcTailForPath() throws Exception {
 Configuration jobConf = new Configuration();
 Configuration daemonConf = new Configuration();
 CacheTag tag = CacheTag.build("test-table");
-OrcTail uncached = OrcEncodedDataReader.getOrcTailForPath(path, jobConf, 
tag, daemonConf, cache);
+OrcTail uncached = OrcEncodedDataReader.getOrcTailForPath(path, jobConf, 
tag, daemonConf, cache, null);
 jobConf.set(HiveConf.ConfVars.LLAP_IO_CACHE_ONLY.varname, "true");
-OrcTail cached = OrcEncodedDataReader.getOrcTailForPath(path, jobConf, 
tag, daemonConf, cache);
+OrcTail cached = OrcEncodedDataReader.getOrcTailForPath(path, jobConf, 
tag, daemonConf, cache, null);
 assertEquals(uncached.getSerializedTail(), cached.getSerializedTail());
 assertEquals(uncached.getFileTail(), cached.getFileTail());
   }
 
+  @Test
+  public void testGetOrcTailForPathWithFileId() throws Exception {
+DummyMemoryManager mm = new DummyMemoryManager();
+DummyCachePolicy cp = new DummyCachePolicy();
+final int MAX_ALLOC = 64;
+LlapDaemonCacheMetrics metrics = LlapDaemonCacheMetrics.create("", "");
+BuddyAllocator alloc = new BuddyAllocator(
+false, false, 8, MAX_ALLOC, 1, 4096, 0, null, mm, metrics, null, true);
+MetadataCache cache = new MetadataCache(alloc, mm, cp, true, metrics);
+
+Path path = new Path("../data/files/alltypesorc");
+Configuration jobConf = new Configuration();
+Configuration daemonConf = new Configuration();
+CacheTag tag = CacheTag.build("test-table");
+FileSystem fs = FileSystem.get(daemonConf);
+FileStatus fileStatus = fs.getFileStatus(path);
+OrcTail uncached = 
OrcEncodedDataReader.getOrcTailForPath(fileStatus.getPath(), jobConf, tag, 
daemonConf, cache, new SyntheticFileId(fileStatus));
+jobConf.set(HiveConf.ConfVars.LLAP_IO_CACHE_ONLY.varname, "true");
+// this should work from the cache, by recalculating the same fileId
+OrcTail cached = 
OrcEncodedDataReader.getOrcTailForPath(fileStatus.getPath(), jobConf, tag, 
daemonConf, cache, null);
+assertEquals(uncached.getSerializedTail(), cached.getSerializedTail());
+assertEquals(uncached.getFileTail(), cached.getFileTail());
+  }
+
+  @Test
+  public void testGetOrcTailForPathWithFileIdChange() throws Exception {
+DummyMemoryManager mm = new DummyMemoryManager();
+DummyCachePolicy cp = new DummyCachePolicy();
+final int MAX_ALLOC = 64;
+LlapDaemonCacheMetrics metrics = LlapDaemonCacheMetrics.create("", "");
+BuddyAllocator alloc = new BuddyAllocator(
+false, false, 8, MAX_ALLOC, 1, 4096, 0, null, mm, metrics, null, true);
+MetadataCache cache = new MetadataCache(alloc, mm, cp, true, metrics);
+
+Path path = new Path("../data/files/alltypesorc");
+Configuration jobConf = new Configuration();
+Configuration daemonConf = new Configuration();
+CacheTag tag = CacheTag.build("test-table");
+OrcEncodedDataReader.getOrcTailForPath(path, jobConf, tag, daemonConf, 
cache, new SyntheticFileId(path, 100, 100));
+jobConf.set(HiveConf.ConfVars.LLAP_IO_CACHE_ONLY.varname, "true");
+Exception ex = null;
+try {
+  // this should miss the cache, since the fileKey changed
+  OrcEncodedDataReader.getOrcTailForPath(path, jobConf, tag, daemonConf, 
cache, new SyntheticFileId(path, 100, 101));

Review comment:
   you can add a _fail_ call here, as it should always jump from line 308 
to catch clause.

##
File path: 
ql/src/java/org/apache/hadoop/hive/ql/io/orc/VectorizedOrcAcidRowBatchReader.java
##
@@ -680,14 +681,15 @@ public void setBaseAndInnerReader(
* @param path The Orc file path we want to get the OrcTail for
* @param conf The Configuration to access LLAP
* @param cacheTag The cacheTag needed to get OrcTail from LLAP IO cache
+   * @param fileKey fileId of the Orc file (either the Long fileId of HDFS or 
the SyntheticFileId)
* @return ReaderData object where the orcTail is not null. Reader can be 
null, but if we had to create
* one we return that as well for further reuse.
*/
-  private static ReaderData getOrcTail(Path path, Configuration conf, CacheTag 
cacheTag) throws IOException {
+  private static ReaderData getOrcTail(Path path, Configuration conf, CacheTag

[jira] [Work logged] (HIVE-23956) Delete delta directory file information should be pushed to execution side

2020-08-03 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-23956?focusedWorklogId=465603=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-465603
 ]

ASF GitHub Bot logged work on HIVE-23956:
-

Author: ASF GitHub Bot
Created on: 03/Aug/20 09:38
Start Date: 03/Aug/20 09:38
Worklog Time Spent: 10m 
  Work Description: pvary commented on a change in pull request #1339:
URL: https://github.com/apache/hive/pull/1339#discussion_r464304647



##
File path: 
llap-server/src/test/org/apache/hadoop/hive/llap/cache/TestOrcMetadataCache.java
##
@@ -250,18 +255,71 @@ public void testGetOrcTailForPath() throws Exception {
 Configuration jobConf = new Configuration();
 Configuration daemonConf = new Configuration();
 CacheTag tag = CacheTag.build("test-table");
-OrcTail uncached = OrcEncodedDataReader.getOrcTailForPath(path, jobConf, 
tag, daemonConf, cache);
+OrcTail uncached = OrcEncodedDataReader.getOrcTailForPath(path, jobConf, 
tag, daemonConf, cache, null);
 jobConf.set(HiveConf.ConfVars.LLAP_IO_CACHE_ONLY.varname, "true");
-OrcTail cached = OrcEncodedDataReader.getOrcTailForPath(path, jobConf, 
tag, daemonConf, cache);
+OrcTail cached = OrcEncodedDataReader.getOrcTailForPath(path, jobConf, 
tag, daemonConf, cache, null);
 assertEquals(uncached.getSerializedTail(), cached.getSerializedTail());
 assertEquals(uncached.getFileTail(), cached.getFileTail());
   }
 
+  @Test
+  public void testGetOrcTailForPathWithFileId() throws Exception {
+DummyMemoryManager mm = new DummyMemoryManager();
+DummyCachePolicy cp = new DummyCachePolicy();
+final int MAX_ALLOC = 64;
+LlapDaemonCacheMetrics metrics = LlapDaemonCacheMetrics.create("", "");
+BuddyAllocator alloc = new BuddyAllocator(
+false, false, 8, MAX_ALLOC, 1, 4096, 0, null, mm, metrics, null, true);
+MetadataCache cache = new MetadataCache(alloc, mm, cp, true, metrics);
+
+Path path = new Path("../data/files/alltypesorc");
+Configuration jobConf = new Configuration();
+Configuration daemonConf = new Configuration();
+CacheTag tag = CacheTag.build("test-table");
+FileSystem fs = FileSystem.get(daemonConf);
+FileStatus fileStatus = fs.getFileStatus(path);
+OrcTail uncached = 
OrcEncodedDataReader.getOrcTailForPath(fileStatus.getPath(), jobConf, tag, 
daemonConf, cache, new SyntheticFileId(fileStatus));
+jobConf.set(HiveConf.ConfVars.LLAP_IO_CACHE_ONLY.varname, "true");
+// this should work from the cache, by recalculating the same fileId
+OrcTail cached = 
OrcEncodedDataReader.getOrcTailForPath(fileStatus.getPath(), jobConf, tag, 
daemonConf, cache, null);
+assertEquals(uncached.getSerializedTail(), cached.getSerializedTail());
+assertEquals(uncached.getFileTail(), cached.getFileTail());
+  }
+
+  @Test
+  public void testGetOrcTailForPathWithFileIdChange() throws Exception {
+DummyMemoryManager mm = new DummyMemoryManager();
+DummyCachePolicy cp = new DummyCachePolicy();
+final int MAX_ALLOC = 64;
+LlapDaemonCacheMetrics metrics = LlapDaemonCacheMetrics.create("", "");
+BuddyAllocator alloc = new BuddyAllocator(
+false, false, 8, MAX_ALLOC, 1, 4096, 0, null, mm, metrics, null, true);
+MetadataCache cache = new MetadataCache(alloc, mm, cp, true, metrics);
+
+Path path = new Path("../data/files/alltypesorc");
+Configuration jobConf = new Configuration();
+Configuration daemonConf = new Configuration();
+CacheTag tag = CacheTag.build("test-table");
+OrcEncodedDataReader.getOrcTailForPath(path, jobConf, tag, daemonConf, 
cache, new SyntheticFileId(path, 100, 100));
+jobConf.set(HiveConf.ConfVars.LLAP_IO_CACHE_ONLY.varname, "true");
+Exception ex = null;
+try {
+  // this should miss the cache, since the fileKey changed
+  OrcEncodedDataReader.getOrcTailForPath(path, jobConf, tag, daemonConf, 
cache, new SyntheticFileId(path, 100, 101));
+} catch (IOException e) {
+  ex = e;
+}
+Assert.assertNotNull(ex);
+
Assert.assertTrue(ex.getMessage().contains(HiveConf.ConfVars.LLAP_IO_CACHE_ONLY.varname));
+  }
+
+

Review comment:
   nit: too many newline. If we need any fix, please remove them





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 465603)
Time Spent: 2h 40m  (was: 2.5h)

> Delete delta directory file information should be pushed to execution side
> --
>
> Key: HIVE-23956
> URL:

[jira] [Commented] (HIVE-23963) UnsupportedOperationException in queries 74 and 84 while applying HiveCardinalityPreservingJoinRule

2020-08-03 Thread Stamatis Zampetakis (Jira)



[ 
https://issues.apache.org/jira/browse/HIVE-23963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17169820#comment-17169820
 ] 

Stamatis Zampetakis commented on HIVE-23963:


The problem can be reproduced with the patch in 
[PR#1347|https://github.com/apache/hive/pull/1347].

> UnsupportedOperationException in queries 74 and 84 while applying 
> HiveCardinalityPreservingJoinRule
> ---
>
> Key: HIVE-23963
> URL: https://issues.apache.org/jira/browse/HIVE-23963
> Project: Hive
>  Issue Type: Bug
>  Components: CBO
>Reporter: Stamatis Zampetakis
>Assignee: Krisztian Kasa
>Priority: Major
> Attachments: cbo_query74_stacktrace.txt, cbo_query84_stacktrace.txt
>
>
> The following TPC-DS queries: 
> * cbo_query74.q
> * cbo_query84.q 
> * query74.q 
> * query84.q 
> fail on the metastore with the partitioned TPC-DS 30TB dataset.
> The stacktraces for cbo_query74 and cbo_query84 show that the problem 
> originates while applying HiveCardinalityPreservingJoinRule.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (HIVE-23972) Add external client ID to LLAP external client

2020-08-03 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-23972?focusedWorklogId=465577=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-465577
 ]

ASF GitHub Bot logged work on HIVE-23972:
-

Author: ASF GitHub Bot
Created on: 03/Aug/20 07:43
Start Date: 03/Aug/20 07:43
Worklog Time Spent: 10m 
  Work Description: jdere commented on pull request #1350:
URL: https://github.com/apache/hive/pull/1350#issuecomment-667863766


   @prasanthj  thanks for pointing that out - I've tried to update the patch to 
use hive.query.name



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 465577)
Time Spent: 0.5h  (was: 20m)

> Add external client ID to LLAP external client
> --
>
> Key: HIVE-23972
> URL: https://issues.apache.org/jira/browse/HIVE-23972
> Project: Hive
>  Issue Type: Bug
>  Components: llap
>Reporter: Jason Dere
>Assignee: Jason Dere
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> There currently is not a good way to tell which currently running LLAP tasks 
> are from external LLAP clients, and also no good way to know which 
> application is submitting these external LLAP requests.
> One possible solution for this is to add an option for the external LLAP 
> client to pass in an external client ID, which can get logged by HiveServer2 
> during the getSplits request, as well as displayed from the LLAP 
> executorsStatus.
> cc [~ShubhamChaurasia]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (HIVE-23966) Minor query-based compaction always results in delta dirs with minWriteId=1

2020-08-03 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-23966?focusedWorklogId=465567=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-465567
 ]

ASF GitHub Bot logged work on HIVE-23966:
-

Author: ASF GitHub Bot
Created on: 03/Aug/20 06:44
Start Date: 03/Aug/20 06:44
Worklog Time Spent: 10m 
  Work Description: klcopp commented on pull request #1346:
URL: https://github.com/apache/hive/pull/1346#issuecomment-667836261


   Closed and reopened to rerun tests



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 465567)
Time Spent: 40m  (was: 0.5h)

> Minor query-based compaction always results in delta dirs with minWriteId=1
> ---
>
> Key: HIVE-23966
> URL: https://issues.apache.org/jira/browse/HIVE-23966
> Project: Hive
>  Issue Type: Bug
>Reporter: Karen Coppage
>Assignee: Karen Coppage
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> Minor compaction after major/IOW will result in directories that look like:
>  * base_z_v
>  * delta_1_y_v
>  * delete_delta_1_y_v
> Should be:
>  * base_z_v
>  * delta_(z+1)_y_v
>  * delete_delta_(z+1)_y_v
> Issues this causes:
> For example, after running insert overwrite, then minor compaction, major 
> compaction will fail with the following error:
> {noformat}
> Found 2 equal splits: OrcSplit 
> [hdfs://.../warehouse/tablespace/managed/hive/bucketed/delta_001_006_v0001058/bucket_4,
> start=0, length=722, isOriginal=false, fileLength=722, hasFooter=false, 
> hasBase=true, deltas=1] and OrcSplit 
> [hdfs://.../warehouse/tablespace/managed/hive/bucketed/base_001/bucket_4_0,
> start=0, length=811, isOriginal=false, fileLength=811, hasFooter=false, 
> hasBase=true, deltas=1]
> {noformat}
> or it can fail with:
> {noformat}
> Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Wrong sort order 
> of Acid rows detected for the rows: 
> org.apache.hadoop.hive.ql.udf.generic.GenericUDFValidateAcidSortOrder$WriteIdRowId@201be62b
>  an
> d 
> org.apache.hadoop.hive.ql.udf.generic.GenericUDFValidateAcidSortOrder$WriteIdRowId@5f97bd3f
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

1 2 >

1 - 100 of 103 matches

Mail list logo