date:20200726

[jira] [Work logged] (HIVE-23896) hiveserver2 not listening on any port, am i miss some configurations?

2020-07-26 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-23896?focusedWorklogId=463491=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-463491
 ]

ASF GitHub Bot logged work on HIVE-23896:
-

Author: ASF GitHub Bot
Created on: 27/Jul/20 05:44
Start Date: 27/Jul/20 05:44
Worklog Time Spent: 10m 
  Work Description: pvary commented on pull request #1307:
URL: https://github.com/apache/hive/pull/1307#issuecomment-664132074


   > Someone help me to have a look, this error and I submit seems to have 
nothing to do with it! very thank!
   
   Just retrigger the test.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 463491)
Time Spent: 0.5h  (was: 20m)

> hiveserver2 not listening on any port, am i miss some configurations?
> -
>
> Key: HIVE-23896
> URL: https://issues.apache.org/jira/browse/HIVE-23896
> Project: Hive
>  Issue Type: Bug
>  Components: HiveServer2
>Affects Versions: 3.1.2
> Environment: hive: 3.1.2
> hadoop: 3.2.1, standalone, url: hdfs://namenode.hadoop.svc.cluster.local:9000
> {quote}$ $HADOOP_HOME/bin/hadoop fs -mkdir /tmp
>  $ $HADOOP_HOME/bin/hadoop fs -mkdir /user/hive/warehouse
> {quote}
> hadoop commands  are workable in the hiveserver node(POD).
>  
>Reporter: alanwake
>Priority: Blocker
>  Labels: pull-request-available
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
>  
>  
> i try deply hive 3.1.2 on k8s.  it was worked on version 2.3.2.
> metastore node and postgres node are ok, but hiveserver look like i miss some 
> important configuration properties?
> {code:java}
>  {code}
>  
>  
>  
> {code:java}
> [root@master hive]# ./get.sh 
> NAME READY   STATUSRESTARTS   AGE   IP
>  NODE   NOMINATED NODE   READINESS GATES
> hive-7bd48747d4-5zjmh1/1 Running   0  56s   10.244.3.110  
>  node03.51.local  
> metastore-66b58f9f76-6wsxj   1/1 Running   0  56s   10.244.3.109  
>  node03.51.local  
> postgres-57794b99b7-pqxwm1/1 Running   0  56s   10.244.2.241  
>  node02.51.local  NAMETYPECLUSTER-IP  
>  EXTERNAL-IP   PORT(S)   AGE   SELECTOR
> hiveNodePort10.108.40.17 
> 10002:30626/TCP,1:31845/TCP   56s   app=hive
> metastore   ClusterIP   10.106.159.220   9083/TCP   
>56s   app=metastore
> postgresClusterIP   10.108.85.47 5432/TCP   
>56s   app=postgres
> {code}
>  
>  
> {code:java}
> [root@master hive]# kubectl logs hive-7bd48747d4-5zjmh -n=hive
> Configuring core
>  - Setting hadoop.proxyuser.hue.hosts=*
>  - Setting fs.defaultFS=hdfs://namenode.hadoop.svc.cluster.local:9000
>  - Setting hadoop.http.staticuser.user=root
>  - Setting hadoop.proxyuser.hue.groups=*
> Configuring hdfs
>  - Setting dfs.namenode.datanode.registration.ip-hostname-check=false
>  - Setting dfs.webhdfs.enabled=true
>  - Setting dfs.permissions.enabled=false
> Configuring yarn
>  - Setting yarn.timeline-service.enabled=true
>  - Setting yarn.resourcemanager.system-metrics-publisher.enabled=true
>  - Setting 
> yarn.resourcemanager.store.class=org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore
>  - Setting 
> yarn.log.server.url=http://historyserver.hadoop.svc.cluster.local:8188/applicationhistory/logs/
>  - Setting yarn.resourcemanager.fs.state-store.uri=/rmstate
>  - Setting yarn.timeline-service.generic-application-history.enabled=true
>  - Setting yarn.log-aggregation-enable=true
>  - Setting 
> yarn.resourcemanager.hostname=resourcemanager.hadoop.svc.cluster.local
>  - Setting 
> yarn.resourcemanager.resource.tracker.address=resourcemanager.hadoop.svc.cluster.local:8031
>  - Setting 
> yarn.timeline-service.hostname=historyserver.hadoop.svc.cluster.local
>  - Setting 
> yarn.resourcemanager.scheduler.address=resourcemanager.hadoop.svc.cluster.local:8030
>  - Setting 
> yarn.resourcemanager.address=resourcemanager.hadoop.svc.cluster.local:8032
>  - Setting yarn.nodemanager.remote-app-log-dir=/app-logs
>  - Setting yarn.resourcemanager.recovery.enabled=true
> Configuring httpfs
> Configuring kms
> Configuring mapred
> Configuring hive
>  - Setting datanucleus.autoCreateSchema=false
>  - Setting javax.jdo.option.ConnectionPassword=hive
>  - Setting

[jira] [Work logged] (HIVE-23916) Fix Atlas client dependency version

2020-07-26 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-23916?focusedWorklogId=463490=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-463490
 ]

ASF GitHub Bot logged work on HIVE-23916:
-

Author: ASF GitHub Bot
Created on: 27/Jul/20 05:36
Start Date: 27/Jul/20 05:36
Worklog Time Spent: 10m 
  Work Description: aasha commented on a change in pull request #1318:
URL: https://github.com/apache/hive/pull/1318#discussion_r460656242



##
File path: pom.xml
##
@@ -112,7 +112,7 @@
 1.5.7
 
 0.10.0
-2.0.0
+2.1.0

Review comment:
   Please move this to the exec module





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 463490)
Time Spent: 20m  (was: 10m)

> Fix Atlas client dependency version
> ---
>
> Key: HIVE-23916
> URL: https://issues.apache.org/jira/browse/HIVE-23916
> Project: Hive
>  Issue Type: Task
>Reporter: Pravin Sinha
>Assignee: Pravin Sinha
>Priority: Major
>  Labels: pull-request-available
> Attachments: HIVE-23916.01.patch
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HIVE-23835) Repl Dump should dump function binaries to staging directory

2020-07-26 Thread Pravin Sinha (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-23835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pravin Sinha updated HIVE-23835:

Attachment: HIVE-23835.04.patch

> Repl Dump should dump function binaries to staging directory
> 
>
> Key: HIVE-23835
> URL: https://issues.apache.org/jira/browse/HIVE-23835
> Project: Hive
>  Issue Type: Task
>Reporter: Pravin Sinha
>Assignee: Pravin Sinha
>Priority: Major
>  Labels: pull-request-available
> Attachments: HIVE-23835.01.patch, HIVE-23835.02.patch, 
> HIVE-23835.03.patch, HIVE-23835.04.patch
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> {color:#172b4d}When hive function's binaries are on source HDFS, repl dump 
> should dump it to the staging location in order to break cross clusters 
> visibility requirement.{color}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HIVE-23835) Repl Dump should dump function binaries to staging directory

2020-07-26 Thread Pravin Sinha (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-23835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pravin Sinha updated HIVE-23835:

Attachment: (was: HIVE-23835.04.patch)

> Repl Dump should dump function binaries to staging directory
> 
>
> Key: HIVE-23835
> URL: https://issues.apache.org/jira/browse/HIVE-23835
> Project: Hive
>  Issue Type: Task
>Reporter: Pravin Sinha
>Assignee: Pravin Sinha
>Priority: Major
>  Labels: pull-request-available
> Attachments: HIVE-23835.01.patch, HIVE-23835.02.patch, 
> HIVE-23835.03.patch, HIVE-23835.04.patch
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> {color:#172b4d}When hive function's binaries are on source HDFS, repl dump 
> should dump it to the staging location in order to break cross clusters 
> visibility requirement.{color}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (HIVE-23851) MSCK REPAIR Command With Partition Filtering Fails While Dropping Partitions

2020-07-26 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-23851?focusedWorklogId=463482=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-463482
 ]

ASF GitHub Bot logged work on HIVE-23851:
-

Author: ASF GitHub Bot
Created on: 27/Jul/20 05:01
Start Date: 27/Jul/20 05:01
Worklog Time Spent: 10m 
  Work Description: shameersss1 commented on pull request #1271:
URL: https://github.com/apache/hive/pull/1271#issuecomment-664119961


   @kgyrtkirk Could you please take a look?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 463482)
Time Spent: 1.5h  (was: 1h 20m)

> MSCK REPAIR Command With Partition Filtering Fails While Dropping Partitions
> 
>
> Key: HIVE-23851
> URL: https://issues.apache.org/jira/browse/HIVE-23851
> Project: Hive
>  Issue Type: Bug
>Affects Versions: 4.0.0
>Reporter: Syed Shameerur Rahman
>Assignee: Syed Shameerur Rahman
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> *Steps to reproduce:*
> # Create external table
> # Run msck command to sync all the partitions with metastore
> # Remove one of the partition path
> # Run msck repair with partition filtering
> *Stack Trace:*
> {code:java}
>  2020-07-15T02:10:29,045 ERROR [4dad298b-28b1-4e6b-94b6-aa785b60c576 main] 
> ppr.PartitionExpressionForMetastore: Failed to deserialize the expression
>  java.lang.IndexOutOfBoundsException: Index: 110, Size: 0
>  at java.util.ArrayList.rangeCheck(ArrayList.java:657) ~[?:1.8.0_192]
>  at java.util.ArrayList.get(ArrayList.java:433) ~[?:1.8.0_192]
>  at 
> org.apache.hive.com.esotericsoftware.kryo.util.MapReferenceResolver.getReadObject(MapReferenceResolver.java:60)
>  ~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
>  at 
> org.apache.hive.com.esotericsoftware.kryo.Kryo.readReferenceOrNull(Kryo.java:857)
>  ~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
>  at org.apache.hive.com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:707) 
> ~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
>  at 
> org.apache.hadoop.hive.ql.exec.SerializationUtilities$KryoWithHooks.readObject(SerializationUtilities.java:211)
>  ~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
>  at 
> org.apache.hadoop.hive.ql.exec.SerializationUtilities.deserializeObjectFromKryo(SerializationUtilities.java:806)
>  ~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
>  at 
> org.apache.hadoop.hive.ql.exec.SerializationUtilities.deserializeExpressionFromKryo(SerializationUtilities.java:775)
>  ~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
>  at 
> org.apache.hadoop.hive.ql.optimizer.ppr.PartitionExpressionForMetastore.deserializeExpr(PartitionExpressionForMetastore.java:96)
>  [hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
>  at 
> org.apache.hadoop.hive.ql.optimizer.ppr.PartitionExpressionForMetastore.convertExprToFilter(PartitionExpressionForMetastore.java:52)
>  [hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
>  at 
> org.apache.hadoop.hive.metastore.PartFilterExprUtil.makeExpressionTree(PartFilterExprUtil.java:48)
>  [hive-standalone-metastore-server-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
>  at 
> org.apache.hadoop.hive.metastore.ObjectStore.getPartitionsByExprInternal(ObjectStore.java:3593)
>  [hive-standalone-metastore-server-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
>  at 
> org.apache.hadoop.hive.metastore.VerifyingObjectStore.getPartitionsByExpr(VerifyingObjectStore.java:80)
>  [hive-standalone-metastore-server-4.0.0-SNAPSHOT-tests.jar:4.0.0-SNAPSHOT]
>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:1.8.0_192]
>  at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 
> ~[?:1.8.0_192]
> {code}
> *Cause:*
> In case of msck repair with partition filtering we expect expression proxy 
> class to be set as PartitionExpressionForMetastore ( 
> https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/ddl/misc/msck/MsckAnalyzer.java#L78
>  ), While dropping partition we serialize the drop partition filter 
> expression as ( 
> https://github.com/apache/hive/blob/master/standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/Msck.java#L589
>  ) which is incompatible during deserializtion happening in 
> PartitionExpressionForMetastore ( 
>

[jira] [Updated] (HIVE-23863) UGI doAs privilege action to make calls to Ranger Service

2020-07-26 Thread Anishek Agarwal (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-23863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anishek Agarwal updated HIVE-23863:
---
Resolution: Fixed
Status: Resolved  (was: Patch Available)

committed to master . Thanks for the patch !

> UGI doAs privilege action  to make calls to Ranger Service
> --
>
> Key: HIVE-23863
> URL: https://issues.apache.org/jira/browse/HIVE-23863
> Project: Hive
>  Issue Type: Task
>Reporter: Aasha Medhi
>Assignee: Aasha Medhi
>Priority: Major
>  Labels: pull-request-available
> Attachments: HIVE-23863.01.patch, HIVE-23863.02.patch, 
> HIVE-23863.03.patch, UGI and Replication.pdf
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (HIVE-23716) Support Anti Join in Hive

2020-07-26 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-23716?focusedWorklogId=463446=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-463446
 ]

ASF GitHub Bot logged work on HIVE-23716:
-

Author: ASF GitHub Bot
Created on: 27/Jul/20 01:41
Start Date: 27/Jul/20 01:41
Worklog Time Spent: 10m 
  Work Description: maheshk114 commented on a change in pull request #1147:
URL: https://github.com/apache/hive/pull/1147#discussion_r460606016



##
File path: 
ql/src/java/org/apache/hadoop/hive/ql/exec/vector/mapjoin/VectorMapJoinAntiJoinGenerateResultOperator.java
##
@@ -0,0 +1,218 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hadoop.hive.ql.exec.vector.mapjoin;
+
+import org.apache.hadoop.hive.ql.CompilationOpContext;
+import org.apache.hadoop.hive.ql.exec.JoinUtil;
+import org.apache.hadoop.hive.ql.exec.vector.VectorizationContext;
+import org.apache.hadoop.hive.ql.exec.vector.VectorizedRowBatch;
+import org.apache.hadoop.hive.ql.exec.vector.expressions.VectorExpression;
+import 
org.apache.hadoop.hive.ql.exec.vector.mapjoin.hashtable.VectorMapJoinHashSet;
+import 
org.apache.hadoop.hive.ql.exec.vector.mapjoin.hashtable.VectorMapJoinHashSetResult;
+import 
org.apache.hadoop.hive.ql.exec.vector.mapjoin.hashtable.VectorMapJoinHashTableResult;
+import org.apache.hadoop.hive.ql.metadata.HiveException;
+import org.apache.hadoop.hive.ql.plan.OperatorDesc;
+import org.apache.hadoop.hive.ql.plan.VectorDesc;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import java.io.IOException;
+
+// TODO : This class is duplicate of semi join. Need to do a refactoring to 
merge it with semi join.
+/**
+ * This class has methods for generating vectorized join results for Anti 
joins.
+ * The big difference between inner joins and anti joins is existence testing.
+ * Inner joins use a hash map to lookup the 1 or more small table values.
+ * Anti joins are a specialized join for outputting big table rows whose key 
exists

Review comment:
   done





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 463446)
Time Spent: 13h 40m  (was: 13.5h)

> Support Anti Join in Hive 
> --
>
> Key: HIVE-23716
> URL: https://issues.apache.org/jira/browse/HIVE-23716
> Project: Hive
>  Issue Type: Bug
>Reporter: mahesh kumar behera
>Assignee: mahesh kumar behera
>Priority: Major
>  Labels: pull-request-available
> Attachments: HIVE-23716.01.patch
>
>  Time Spent: 13h 40m
>  Remaining Estimate: 0h
>
> Currently hive does not support Anti join. The query for anti join is 
> converted to left outer join and null filter on right side join key is added 
> to get the desired result. This is causing
>  # Extra computation — The left outer join projects the redundant columns 
> from right side. Along with that, filtering is done to remove the redundant 
> rows. This is can be avoided in case of anti join as anti join will project 
> only the required columns and rows from the left side table.
>  # Extra shuffle — In case of anti join the duplicate records moved to join 
> node can be avoided from the child node. This can reduce significant amount 
> of data movement if the number of distinct rows( join keys) is significant.
>  # Extra Memory Usage - In case of map based anti join , hash set is 
> sufficient as just the key is required to check  if the records matches the 
> join condition. In case of left join, we need the key and the non key columns 
> also and thus a hash table will be required.
> For a query like
> {code:java}
>  select wr_order_number FROM web_returns LEFT JOIN web_sales  ON 
> wr_order_number = ws_order_number WHERE ws_order_number IS

[jira] [Work logged] (HIVE-23716) Support Anti Join in Hive

2020-07-26 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-23716?focusedWorklogId=463444=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-463444
 ]

ASF GitHub Bot logged work on HIVE-23716:
-

Author: ASF GitHub Bot
Created on: 27/Jul/20 01:40
Start Date: 27/Jul/20 01:40
Worklog Time Spent: 10m 
  Work Description: maheshk114 commented on a change in pull request #1147:
URL: https://github.com/apache/hive/pull/1147#discussion_r460605836



##
File path: 
ql/src/java/org/apache/hadoop/hive/ql/exec/vector/mapjoin/VectorMapJoinAntiJoinGenerateResultOperator.java
##
@@ -0,0 +1,218 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hadoop.hive.ql.exec.vector.mapjoin;
+
+import org.apache.hadoop.hive.ql.CompilationOpContext;
+import org.apache.hadoop.hive.ql.exec.JoinUtil;
+import org.apache.hadoop.hive.ql.exec.vector.VectorizationContext;
+import org.apache.hadoop.hive.ql.exec.vector.VectorizedRowBatch;
+import org.apache.hadoop.hive.ql.exec.vector.expressions.VectorExpression;
+import 
org.apache.hadoop.hive.ql.exec.vector.mapjoin.hashtable.VectorMapJoinHashSet;
+import 
org.apache.hadoop.hive.ql.exec.vector.mapjoin.hashtable.VectorMapJoinHashSetResult;
+import 
org.apache.hadoop.hive.ql.exec.vector.mapjoin.hashtable.VectorMapJoinHashTableResult;
+import org.apache.hadoop.hive.ql.metadata.HiveException;
+import org.apache.hadoop.hive.ql.plan.OperatorDesc;
+import org.apache.hadoop.hive.ql.plan.VectorDesc;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import java.io.IOException;
+
+// TODO : This class is duplicate of semi join. Need to do a refactoring to 
merge it with semi join.
+/**
+ * This class has methods for generating vectorized join results for Anti 
joins.
+ * The big difference between inner joins and anti joins is existence testing.
+ * Inner joins use a hash map to lookup the 1 or more small table values.
+ * Anti joins are a specialized join for outputting big table rows whose key 
exists
+ * in the small table.
+ *
+ * No small table values are needed for anti since they would be empty.  So,
+ * we use a hash set as the hash table.  Hash sets just report whether a key 
exists.  This
+ * is a big performance optimization.
+ */
+public abstract class VectorMapJoinAntiJoinGenerateResultOperator
+extends VectorMapJoinGenerateResultOperator {
+
+  private static final long serialVersionUID = 1L;
+  private static final Logger LOG = 
LoggerFactory.getLogger(VectorMapJoinAntiJoinGenerateResultOperator.class.getName());
+
+  // Anti join specific members.
+
+  // An array of hash set results so we can do lookups on the whole batch 
before output result
+  // generation.
+  protected transient VectorMapJoinHashSetResult hashSetResults[];
+
+  // Pre-allocated member for storing the (physical) batch index of matching 
row (single- or
+  // multi-small-table-valued) indexes during a process call.
+  protected transient int[] allMatchs;
+
+  // Pre-allocated member for storing the (physical) batch index of rows that 
need to be spilled.
+  protected transient int[] spills;
+
+  // Pre-allocated member for storing index into the hashSetResults for each 
spilled row.
+  protected transient int[] spillHashMapResultIndices;
+
+  /** Kryo ctor. */
+  protected VectorMapJoinAntiJoinGenerateResultOperator() {
+super();
+  }
+
+  public VectorMapJoinAntiJoinGenerateResultOperator(CompilationOpContext ctx) 
{
+super(ctx);
+  }
+
+  public VectorMapJoinAntiJoinGenerateResultOperator(CompilationOpContext ctx, 
OperatorDesc conf,
+ VectorizationContext 
vContext, VectorDesc vectorDesc) throws HiveException {
+super(ctx, conf, vContext, vectorDesc);
+  }
+
+  /*
+   * Setup our anti join specific members.
+   */
+  protected void commonSetup() throws HiveException {
+super.commonSetup();
+
+// Anti join specific.
+VectorMapJoinHashSet baseHashSet = (VectorMapJoinHashSet) 
vectorMapJoinHashTable;
+
+hashSetResults = new 
VectorMapJoinHashSetResult[VectorizedRowBatch.DEFAULT_SIZE];
+for (int i = 0; i <

[jira] [Work logged] (HIVE-23716) Support Anti Join in Hive

2020-07-26 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-23716?focusedWorklogId=463443=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-463443
 ]

ASF GitHub Bot logged work on HIVE-23716:
-

Author: ASF GitHub Bot
Created on: 27/Jul/20 01:40
Start Date: 27/Jul/20 01:40
Worklog Time Spent: 10m 
  Work Description: maheshk114 commented on a change in pull request #1147:
URL: https://github.com/apache/hive/pull/1147#discussion_r460605730



##
File path: ql/src/java/org/apache/hadoop/hive/ql/exec/CommonJoinOperator.java
##
@@ -523,11 +533,19 @@ private boolean createForwardJoinObject(boolean[] skip) 
throws HiveException {
 forward = true;
   }
 }
+return forward;
+  }
+
+  // returns whether a record was forwarded
+  private boolean createForwardJoinObject(boolean[] skip, boolean antiJoin) 
throws HiveException {
+boolean forward = fillFwdCache(skip);
 if (forward) {
   if (needsPostEvaluation) {
 forward = !JoinUtil.isFiltered(forwardCache, residualJoinFilters, 
residualJoinFiltersOIs);
   }
-  if (forward) {
+
+  // For anti join, check all right side and if nothing is matched then 
only forward.

Review comment:
   done





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 463443)
Time Spent: 13h 20m  (was: 13h 10m)

> Support Anti Join in Hive 
> --
>
> Key: HIVE-23716
> URL: https://issues.apache.org/jira/browse/HIVE-23716
> Project: Hive
>  Issue Type: Bug
>Reporter: mahesh kumar behera
>Assignee: mahesh kumar behera
>Priority: Major
>  Labels: pull-request-available
> Attachments: HIVE-23716.01.patch
>
>  Time Spent: 13h 20m
>  Remaining Estimate: 0h
>
> Currently hive does not support Anti join. The query for anti join is 
> converted to left outer join and null filter on right side join key is added 
> to get the desired result. This is causing
>  # Extra computation — The left outer join projects the redundant columns 
> from right side. Along with that, filtering is done to remove the redundant 
> rows. This is can be avoided in case of anti join as anti join will project 
> only the required columns and rows from the left side table.
>  # Extra shuffle — In case of anti join the duplicate records moved to join 
> node can be avoided from the child node. This can reduce significant amount 
> of data movement if the number of distinct rows( join keys) is significant.
>  # Extra Memory Usage - In case of map based anti join , hash set is 
> sufficient as just the key is required to check  if the records matches the 
> join condition. In case of left join, we need the key and the non key columns 
> also and thus a hash table will be required.
> For a query like
> {code:java}
>  select wr_order_number FROM web_returns LEFT JOIN web_sales  ON 
> wr_order_number = ws_order_number WHERE ws_order_number IS NULL;{code}
> The number of distinct ws_order_number in web_sales table in a typical 10TB 
> TPCDS set up is just 10% of total records. So when we convert this query to 
> anti join, instead of 7 billion rows, only 600 million rows are moved to join 
> node.
> In the current patch, just one conversion is done. The pattern of 
> project->filter->left-join is converted to project->anti-join. This will take 
> care of sub queries with “not exists” clause. The queries with “not exists” 
> are converted first to filter + left-join and then its converted to anti 
> join. The queries with “not in” are not handled in the current patch.
> From execution side, both merge join and map join with vectorized execution  
> is supported for anti join.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (HIVE-23716) Support Anti Join in Hive

2020-07-26 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-23716?focusedWorklogId=463442=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-463442
 ]

ASF GitHub Bot logged work on HIVE-23716:
-

Author: ASF GitHub Bot
Created on: 27/Jul/20 01:39
Start Date: 27/Jul/20 01:39
Worklog Time Spent: 10m 
  Work Description: maheshk114 commented on a change in pull request #1147:
URL: https://github.com/apache/hive/pull/1147#discussion_r460605498



##
File path: ql/src/java/org/apache/hadoop/hive/ql/exec/CommonJoinOperator.java
##
@@ -638,6 +657,12 @@ private void genObject(int aliasNum, boolean allLeftFirst, 
boolean allLeftNull)
   // skipping the rest of the rows in the rhs table of the semijoin
   done = !needsPostEvaluation;
 }
+  } else if (type == JoinDesc.ANTI_JOIN) {
+if (innerJoin(skip, left, right)) {
+  // if anti join found a match then the condition is not matched for 
anti join, so we can skip rest of the

Review comment:
   done





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 463442)
Time Spent: 13h 10m  (was: 13h)

> Support Anti Join in Hive 
> --
>
> Key: HIVE-23716
> URL: https://issues.apache.org/jira/browse/HIVE-23716
> Project: Hive
>  Issue Type: Bug
>Reporter: mahesh kumar behera
>Assignee: mahesh kumar behera
>Priority: Major
>  Labels: pull-request-available
> Attachments: HIVE-23716.01.patch
>
>  Time Spent: 13h 10m
>  Remaining Estimate: 0h
>
> Currently hive does not support Anti join. The query for anti join is 
> converted to left outer join and null filter on right side join key is added 
> to get the desired result. This is causing
>  # Extra computation — The left outer join projects the redundant columns 
> from right side. Along with that, filtering is done to remove the redundant 
> rows. This is can be avoided in case of anti join as anti join will project 
> only the required columns and rows from the left side table.
>  # Extra shuffle — In case of anti join the duplicate records moved to join 
> node can be avoided from the child node. This can reduce significant amount 
> of data movement if the number of distinct rows( join keys) is significant.
>  # Extra Memory Usage - In case of map based anti join , hash set is 
> sufficient as just the key is required to check  if the records matches the 
> join condition. In case of left join, we need the key and the non key columns 
> also and thus a hash table will be required.
> For a query like
> {code:java}
>  select wr_order_number FROM web_returns LEFT JOIN web_sales  ON 
> wr_order_number = ws_order_number WHERE ws_order_number IS NULL;{code}
> The number of distinct ws_order_number in web_sales table in a typical 10TB 
> TPCDS set up is just 10% of total records. So when we convert this query to 
> anti join, instead of 7 billion rows, only 600 million rows are moved to join 
> node.
> In the current patch, just one conversion is done. The pattern of 
> project->filter->left-join is converted to project->anti-join. This will take 
> care of sub queries with “not exists” clause. The queries with “not exists” 
> are converted first to filter + left-join and then its converted to anti 
> join. The queries with “not in” are not handled in the current patch.
> From execution side, both merge join and map join with vectorized execution  
> is supported for anti join.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (HIVE-23896) hiveserver2 not listening on any port, am i miss some configurations?

2020-07-26 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-23896?focusedWorklogId=463440=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-463440
 ]

ASF GitHub Bot logged work on HIVE-23896:
-

Author: ASF GitHub Bot
Created on: 27/Jul/20 01:22
Start Date: 27/Jul/20 01:22
Worklog Time Spent: 10m 
  Work Description: dh20 commented on pull request #1307:
URL: https://github.com/apache/hive/pull/1307#issuecomment-664071568


   Someone help me to have a look, this error and I submit seems to have 
nothing to do with it! very thank!



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 463440)
Time Spent: 20m  (was: 10m)

> hiveserver2 not listening on any port, am i miss some configurations?
> -
>
> Key: HIVE-23896
> URL: https://issues.apache.org/jira/browse/HIVE-23896
> Project: Hive
>  Issue Type: Bug
>  Components: HiveServer2
>Affects Versions: 3.1.2
> Environment: hive: 3.1.2
> hadoop: 3.2.1, standalone, url: hdfs://namenode.hadoop.svc.cluster.local:9000
> {quote}$ $HADOOP_HOME/bin/hadoop fs -mkdir /tmp
>  $ $HADOOP_HOME/bin/hadoop fs -mkdir /user/hive/warehouse
> {quote}
> hadoop commands  are workable in the hiveserver node(POD).
>  
>Reporter: alanwake
>Priority: Blocker
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
>  
>  
> i try deply hive 3.1.2 on k8s.  it was worked on version 2.3.2.
> metastore node and postgres node are ok, but hiveserver look like i miss some 
> important configuration properties?
> {code:java}
>  {code}
>  
>  
>  
> {code:java}
> [root@master hive]# ./get.sh 
> NAME READY   STATUSRESTARTS   AGE   IP
>  NODE   NOMINATED NODE   READINESS GATES
> hive-7bd48747d4-5zjmh1/1 Running   0  56s   10.244.3.110  
>  node03.51.local  
> metastore-66b58f9f76-6wsxj   1/1 Running   0  56s   10.244.3.109  
>  node03.51.local  
> postgres-57794b99b7-pqxwm1/1 Running   0  56s   10.244.2.241  
>  node02.51.local  NAMETYPECLUSTER-IP  
>  EXTERNAL-IP   PORT(S)   AGE   SELECTOR
> hiveNodePort10.108.40.17 
> 10002:30626/TCP,1:31845/TCP   56s   app=hive
> metastore   ClusterIP   10.106.159.220   9083/TCP   
>56s   app=metastore
> postgresClusterIP   10.108.85.47 5432/TCP   
>56s   app=postgres
> {code}
>  
>  
> {code:java}
> [root@master hive]# kubectl logs hive-7bd48747d4-5zjmh -n=hive
> Configuring core
>  - Setting hadoop.proxyuser.hue.hosts=*
>  - Setting fs.defaultFS=hdfs://namenode.hadoop.svc.cluster.local:9000
>  - Setting hadoop.http.staticuser.user=root
>  - Setting hadoop.proxyuser.hue.groups=*
> Configuring hdfs
>  - Setting dfs.namenode.datanode.registration.ip-hostname-check=false
>  - Setting dfs.webhdfs.enabled=true
>  - Setting dfs.permissions.enabled=false
> Configuring yarn
>  - Setting yarn.timeline-service.enabled=true
>  - Setting yarn.resourcemanager.system-metrics-publisher.enabled=true
>  - Setting 
> yarn.resourcemanager.store.class=org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore
>  - Setting 
> yarn.log.server.url=http://historyserver.hadoop.svc.cluster.local:8188/applicationhistory/logs/
>  - Setting yarn.resourcemanager.fs.state-store.uri=/rmstate
>  - Setting yarn.timeline-service.generic-application-history.enabled=true
>  - Setting yarn.log-aggregation-enable=true
>  - Setting 
> yarn.resourcemanager.hostname=resourcemanager.hadoop.svc.cluster.local
>  - Setting 
> yarn.resourcemanager.resource.tracker.address=resourcemanager.hadoop.svc.cluster.local:8031
>  - Setting 
> yarn.timeline-service.hostname=historyserver.hadoop.svc.cluster.local
>  - Setting 
> yarn.resourcemanager.scheduler.address=resourcemanager.hadoop.svc.cluster.local:8030
>  - Setting 
> yarn.resourcemanager.address=resourcemanager.hadoop.svc.cluster.local:8032
>  - Setting yarn.nodemanager.remote-app-log-dir=/app-logs
>  - Setting yarn.resourcemanager.recovery.enabled=true
> Configuring httpfs
> Configuring kms
> Configuring mapred
> Configuring hive
>  - Setting datanucleus.autoCreateSchema=false
>  - Setting javax.jdo.option.ConnectionPassword=hive
>  - Setting hive.metastore.uris=thrift://metastore:9083
>  - Setting 
>

[jira] [Work logged] (HIVE-23935) Fetching primaryKey through beeline fails with NPE

2020-07-26 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-23935?focusedWorklogId=463408=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-463408
 ]

ASF GitHub Bot logged work on HIVE-23935:
-

Author: ASF GitHub Bot
Created on: 26/Jul/20 21:00
Start Date: 26/Jul/20 21:00
Worklog Time Spent: 10m 
  Work Description: ayushtkn commented on pull request #1321:
URL: https://github.com/apache/hive/pull/1321#issuecomment-664039641


   Reason -
   ``  public boolean primarykeys(String line) throws Exception {
   return metadata("getPrimaryKeys", new String[] {
   beeLine.getConnection().getCatalog(), null,
   arg1(line, "table name"),});``
   Db_Name is passed `null` from `Command.java` and that is a valid value, 
which is handled later at `MetaStoreDirectSql.java`



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 463408)
Time Spent: 20m  (was: 10m)

> Fetching primaryKey through beeline fails with NPE
> --
>
> Key: HIVE-23935
> URL: https://issues.apache.org/jira/browse/HIVE-23935
> Project: Hive
>  Issue Type: Bug
>Reporter: Ayush Saxena
>Assignee: Ayush Saxena
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Fetching PrimaryKey of a table through Beeline !primarykey fails with NPE
> {noformat}
> 0: jdbc:hive2://localhost:1> !primarykeys Persons
> Error: MetaException(message:java.lang.NullPointerException) (state=,code=0)
> org.apache.hive.service.cli.HiveSQLException: 
> MetaException(message:java.lang.NullPointerException)
>   at org.apache.hive.jdbc.Utils.verifySuccess(Utils.java:360)
>   at org.apache.hive.jdbc.Utils.verifySuccess(Utils.java:351)
>   at 
> org.apache.hive.jdbc.HiveDatabaseMetaData.getPrimaryKeys(HiveDatabaseMetaData.java:573)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at org.apache.hive.beeline.Reflector.invoke(Reflector.java:89)
>   at org.apache.hive.beeline.Commands.metadata(Commands.java:125)
>   at org.apache.hive.beeline.Commands.primarykeys(Commands.java:231)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.apache.hive.beeline.ReflectiveCommandHandler.execute(ReflectiveCommandHandler.java:57)
>   at 
> org.apache.hive.beeline.BeeLine.execCommandWithPrefix(BeeLine.java:1465)
>   at org.apache.hive.beeline.BeeLine.dispatch(BeeLine.java:1504)
>   at org.apache.hive.beeline.BeeLine.execute(BeeLine.java:1364)
>   at org.apache.hive.beeline.BeeLine.begin(BeeLine.java:1134)
>   at org.apache.hive.beeline.BeeLine.begin(BeeLine.java:1082)
>   at 
> org.apache.hive.beeline.BeeLine.mainWithInputRedirection(BeeLine.java:546)
>   at org.apache.hive.beeline.BeeLine.main(BeeLine.java:528)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at org.apache.hadoop.util.RunJar.run(RunJar.java:323)
>   at org.apache.hadoop.util.RunJar.main(RunJar.java:236){noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (HIVE-23935) Fetching primaryKey through beeline fails with NPE

2020-07-26 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-23935?focusedWorklogId=463407=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-463407
 ]

ASF GitHub Bot logged work on HIVE-23935:
-

Author: ASF GitHub Bot
Created on: 26/Jul/20 20:56
Start Date: 26/Jul/20 20:56
Worklog Time Spent: 10m 
  Work Description: ayushtkn opened a new pull request #1321:
URL: https://github.com/apache/hive/pull/1321


   https://issues.apache.org/jira/browse/HIVE-23935
   
   Entire Trace -
   
   ```
   0: jdbc:hive2://localhost:1> !primarykeys Persons
   Error: MetaException(message:java.lang.NullPointerException) (state=,code=0)
   org.apache.hive.service.cli.HiveSQLException: 
MetaException(message:java.lang.NullPointerException)
at org.apache.hive.jdbc.Utils.verifySuccess(Utils.java:360)
at org.apache.hive.jdbc.Utils.verifySuccess(Utils.java:351)
at 
org.apache.hive.jdbc.HiveDatabaseMetaData.getPrimaryKeys(HiveDatabaseMetaData.java:573)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.hive.beeline.Reflector.invoke(Reflector.java:89)
at org.apache.hive.beeline.Commands.metadata(Commands.java:125)
at org.apache.hive.beeline.Commands.primarykeys(Commands.java:231)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
org.apache.hive.beeline.ReflectiveCommandHandler.execute(ReflectiveCommandHandler.java:57)
at 
org.apache.hive.beeline.BeeLine.execCommandWithPrefix(BeeLine.java:1465)
at org.apache.hive.beeline.BeeLine.dispatch(BeeLine.java:1504)
at org.apache.hive.beeline.BeeLine.execute(BeeLine.java:1364)
at org.apache.hive.beeline.BeeLine.begin(BeeLine.java:1134)
at org.apache.hive.beeline.BeeLine.begin(BeeLine.java:1082)
at 
org.apache.hive.beeline.BeeLine.mainWithInputRedirection(BeeLine.java:546)
at org.apache.hive.beeline.BeeLine.main(BeeLine.java:528)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.hadoop.util.RunJar.run(RunJar.java:323)
at org.apache.hadoop.util.RunJar.main(RunJar.java:236)
   Caused by: org.apache.hive.service.cli.HiveSQLException: 
MetaException(message:java.lang.NullPointerException)
at 
org.apache.hive.service.cli.operation.GetPrimaryKeysOperation.runInternal(GetPrimaryKeysOperation.java:120)
at 
org.apache.hive.service.cli.operation.Operation.run(Operation.java:277)
at 
org.apache.hive.service.cli.session.HiveSessionImpl.getPrimaryKeys(HiveSessionImpl.java:997)
at 
org.apache.hive.service.cli.CLIService.getPrimaryKeys(CLIService.java:416)
at 
org.apache.hive.service.cli.thrift.ThriftCLIService.GetPrimaryKeys(ThriftCLIService.java:838)
at 
org.apache.hive.service.rpc.thrift.TCLIService$Processor$GetPrimaryKeys.getResult(TCLIService.java:1717)
at 
org.apache.hive.service.rpc.thrift.TCLIService$Processor$GetPrimaryKeys.getResult(TCLIService.java:1702)
at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39)
at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39)
at 
org.apache.hive.service.auth.TSetIpAddressProcessor.process(TSetIpAddressProcessor.java:56)
at 
org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:286)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
   Caused by: MetaException(message:java.lang.NullPointerException)
at 
org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.newMetaException(HiveMetaStore.java:7921)
at 
org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.throwMetaException(HiveMetaStore.java:9105)
at 
org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.get_primary_keys(HiveMetaStore.java:9067)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at

[jira] [Updated] (HIVE-23935) Fetching primaryKey through beeline fails with NPE

2020-07-26 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-23935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HIVE-23935:
--
Labels: pull-request-available  (was: )

> Fetching primaryKey through beeline fails with NPE
> --
>
> Key: HIVE-23935
> URL: https://issues.apache.org/jira/browse/HIVE-23935
> Project: Hive
>  Issue Type: Bug
>Reporter: Ayush Saxena
>Assignee: Ayush Saxena
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Fetching PrimaryKey of a table through Beeline !primarykey fails with NPE
> {noformat}
> 0: jdbc:hive2://localhost:1> !primarykeys Persons
> Error: MetaException(message:java.lang.NullPointerException) (state=,code=0)
> org.apache.hive.service.cli.HiveSQLException: 
> MetaException(message:java.lang.NullPointerException)
>   at org.apache.hive.jdbc.Utils.verifySuccess(Utils.java:360)
>   at org.apache.hive.jdbc.Utils.verifySuccess(Utils.java:351)
>   at 
> org.apache.hive.jdbc.HiveDatabaseMetaData.getPrimaryKeys(HiveDatabaseMetaData.java:573)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at org.apache.hive.beeline.Reflector.invoke(Reflector.java:89)
>   at org.apache.hive.beeline.Commands.metadata(Commands.java:125)
>   at org.apache.hive.beeline.Commands.primarykeys(Commands.java:231)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.apache.hive.beeline.ReflectiveCommandHandler.execute(ReflectiveCommandHandler.java:57)
>   at 
> org.apache.hive.beeline.BeeLine.execCommandWithPrefix(BeeLine.java:1465)
>   at org.apache.hive.beeline.BeeLine.dispatch(BeeLine.java:1504)
>   at org.apache.hive.beeline.BeeLine.execute(BeeLine.java:1364)
>   at org.apache.hive.beeline.BeeLine.begin(BeeLine.java:1134)
>   at org.apache.hive.beeline.BeeLine.begin(BeeLine.java:1082)
>   at 
> org.apache.hive.beeline.BeeLine.mainWithInputRedirection(BeeLine.java:546)
>   at org.apache.hive.beeline.BeeLine.main(BeeLine.java:528)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at org.apache.hadoop.util.RunJar.run(RunJar.java:323)
>   at org.apache.hadoop.util.RunJar.main(RunJar.java:236){noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (HIVE-23935) Fetching primaryKey through beeline fails with NPE

2020-07-26 Thread Ayush Saxena (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-23935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ayush Saxena reassigned HIVE-23935:
---


> Fetching primaryKey through beeline fails with NPE
> --
>
> Key: HIVE-23935
> URL: https://issues.apache.org/jira/browse/HIVE-23935
> Project: Hive
>  Issue Type: Bug
>Reporter: Ayush Saxena
>Assignee: Ayush Saxena
>Priority: Major
>
> Fetching PrimaryKey of a table through Beeline !primarykey fails with NPE
> {noformat}
> 0: jdbc:hive2://localhost:1> !primarykeys Persons
> Error: MetaException(message:java.lang.NullPointerException) (state=,code=0)
> org.apache.hive.service.cli.HiveSQLException: 
> MetaException(message:java.lang.NullPointerException)
>   at org.apache.hive.jdbc.Utils.verifySuccess(Utils.java:360)
>   at org.apache.hive.jdbc.Utils.verifySuccess(Utils.java:351)
>   at 
> org.apache.hive.jdbc.HiveDatabaseMetaData.getPrimaryKeys(HiveDatabaseMetaData.java:573)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at org.apache.hive.beeline.Reflector.invoke(Reflector.java:89)
>   at org.apache.hive.beeline.Commands.metadata(Commands.java:125)
>   at org.apache.hive.beeline.Commands.primarykeys(Commands.java:231)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.apache.hive.beeline.ReflectiveCommandHandler.execute(ReflectiveCommandHandler.java:57)
>   at 
> org.apache.hive.beeline.BeeLine.execCommandWithPrefix(BeeLine.java:1465)
>   at org.apache.hive.beeline.BeeLine.dispatch(BeeLine.java:1504)
>   at org.apache.hive.beeline.BeeLine.execute(BeeLine.java:1364)
>   at org.apache.hive.beeline.BeeLine.begin(BeeLine.java:1134)
>   at org.apache.hive.beeline.BeeLine.begin(BeeLine.java:1082)
>   at 
> org.apache.hive.beeline.BeeLine.mainWithInputRedirection(BeeLine.java:546)
>   at org.apache.hive.beeline.BeeLine.main(BeeLine.java:528)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at org.apache.hadoop.util.RunJar.run(RunJar.java:323)
>   at org.apache.hadoop.util.RunJar.main(RunJar.java:236){noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (HIVE-23934) Refactor TezCompiler#markSemiJoinForDPP to avoid redundant operations in nested while

2020-07-26 Thread Stamatis Zampetakis (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-23934?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stamatis Zampetakis reassigned HIVE-23934:
--


> Refactor TezCompiler#markSemiJoinForDPP to avoid redundant operations in 
> nested while
> -
>
> Key: HIVE-23934
> URL: https://issues.apache.org/jira/browse/HIVE-23934
> Project: Hive
>  Issue Type: Improvement
>Reporter: Stamatis Zampetakis
>Assignee: Stamatis Zampetakis
>Priority: Minor
>
> Most of the code inside the nested while loop can be extracted and computed 
> only once in the external loop. Moreover there are catch clauses for NPE 
> which seem rather predictable and could possibly be avoided by proper checks. 
>  
> The goal of this issue is to refactor TezCompiler#markSemiJoinForDPP method 
> to avoid redundant operations and improve code readability. As a side effect 
> of this refactoring the method will be slightly more efficient although 
> unlikely to have observable difference in practice.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HIVE-23835) Repl Dump should dump function binaries to staging directory

2020-07-26 Thread Pravin Sinha (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-23835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pravin Sinha updated HIVE-23835:

Attachment: HIVE-23835.04.patch

> Repl Dump should dump function binaries to staging directory
> 
>
> Key: HIVE-23835
> URL: https://issues.apache.org/jira/browse/HIVE-23835
> Project: Hive
>  Issue Type: Task
>Reporter: Pravin Sinha
>Assignee: Pravin Sinha
>Priority: Major
>  Labels: pull-request-available
> Attachments: HIVE-23835.01.patch, HIVE-23835.02.patch, 
> HIVE-23835.03.patch, HIVE-23835.04.patch
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> {color:#172b4d}When hive function's binaries are on source HDFS, repl dump 
> should dump it to the staging location in order to break cross clusters 
> visibility requirement.{color}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HIVE-23835) Repl Dump should dump function binaries to staging directory

2020-07-26 Thread Pravin Sinha (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-23835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pravin Sinha updated HIVE-23835:

Attachment: (was: HIVE-23835.04.patch)

> Repl Dump should dump function binaries to staging directory
> 
>
> Key: HIVE-23835
> URL: https://issues.apache.org/jira/browse/HIVE-23835
> Project: Hive
>  Issue Type: Task
>Reporter: Pravin Sinha
>Assignee: Pravin Sinha
>Priority: Major
>  Labels: pull-request-available
> Attachments: HIVE-23835.01.patch, HIVE-23835.02.patch, 
> HIVE-23835.03.patch, HIVE-23835.04.patch
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> {color:#172b4d}When hive function's binaries are on source HDFS, repl dump 
> should dump it to the staging location in order to break cross clusters 
> visibility requirement.{color}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (HIVE-23932) Test TypeCheckProcFactory reorg

2020-07-26 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-23932?focusedWorklogId=463391=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-463391
 ]

ASF GitHub Bot logged work on HIVE-23932:
-

Author: ASF GitHub Bot
Created on: 26/Jul/20 18:49
Start Date: 26/Jul/20 18:49
Worklog Time Spent: 10m 
  Work Description: scarlin-cloudera opened a new pull request #1320:
URL: https://github.com/apache/hive/pull/1320


   ## NOTICE
   
   Please create an issue in ASF JIRA before opening a pull request,
   and you need to set the title of the pull request which starts with
   the corresponding JIRA issue number. (e.g. HIVE-X: Fix a typo in YYY)
   For more details, please see 
https://cwiki.apache.org/confluence/display/Hive/HowToContribute
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 463391)
Time Spent: 1h 50m  (was: 1h 40m)

> Test TypeCheckProcFactory reorg
> ---
>
> Key: HIVE-23932
> URL: https://issues.apache.org/jira/browse/HIVE-23932
> Project: Hive
>  Issue Type: Improvement
>  Components: Hive
>Reporter: Steve Carlin
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 1h 50m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (HIVE-23932) Test TypeCheckProcFactory reorg

2020-07-26 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-23932?focusedWorklogId=463389=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-463389
 ]

ASF GitHub Bot logged work on HIVE-23932:
-

Author: ASF GitHub Bot
Created on: 26/Jul/20 18:44
Start Date: 26/Jul/20 18:44
Worklog Time Spent: 10m 
  Work Description: scarlin-cloudera opened a new pull request #1319:
URL: https://github.com/apache/hive/pull/1319


   ## NOTICE
   
   Please create an issue in ASF JIRA before opening a pull request,
   and you need to set the title of the pull request which starts with
   the corresponding JIRA issue number. (e.g. HIVE-X: Fix a typo in YYY)
   For more details, please see 
https://cwiki.apache.org/confluence/display/Hive/HowToContribute
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 463389)
Time Spent: 1h 40m  (was: 1.5h)

> Test TypeCheckProcFactory reorg
> ---
>
> Key: HIVE-23932
> URL: https://issues.apache.org/jira/browse/HIVE-23932
> Project: Hive
>  Issue Type: Improvement
>  Components: Hive
>Reporter: Steve Carlin
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (HIVE-23932) Test TypeCheckProcFactory reorg

2020-07-26 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-23932?focusedWorklogId=463388=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-463388
 ]

ASF GitHub Bot logged work on HIVE-23932:
-

Author: ASF GitHub Bot
Created on: 26/Jul/20 18:41
Start Date: 26/Jul/20 18:41
Worklog Time Spent: 10m 
  Work Description: scarlin-cloudera closed pull request #1319:
URL: https://github.com/apache/hive/pull/1319


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 463388)
Time Spent: 1.5h  (was: 1h 20m)

> Test TypeCheckProcFactory reorg
> ---
>
> Key: HIVE-23932
> URL: https://issues.apache.org/jira/browse/HIVE-23932
> Project: Hive
>  Issue Type: Improvement
>  Components: Hive
>Reporter: Steve Carlin
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (HIVE-23932) Test TypeCheckProcFactory reorg

2020-07-26 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-23932?focusedWorklogId=463385=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-463385
 ]

ASF GitHub Bot logged work on HIVE-23932:
-

Author: ASF GitHub Bot
Created on: 26/Jul/20 17:51
Start Date: 26/Jul/20 17:51
Worklog Time Spent: 10m 
  Work Description: scarlin-cloudera opened a new pull request #1319:
URL: https://github.com/apache/hive/pull/1319


   ## NOTICE
   
   Please create an issue in ASF JIRA before opening a pull request,
   and you need to set the title of the pull request which starts with
   the corresponding JIRA issue number. (e.g. HIVE-X: Fix a typo in YYY)
   For more details, please see 
https://cwiki.apache.org/confluence/display/Hive/HowToContribute
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 463385)
Time Spent: 1h 20m  (was: 1h 10m)

> Test TypeCheckProcFactory reorg
> ---
>
> Key: HIVE-23932
> URL: https://issues.apache.org/jira/browse/HIVE-23932
> Project: Hive
>  Issue Type: Improvement
>  Components: Hive
>Reporter: Steve Carlin
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (HIVE-23932) Test TypeCheckProcFactory reorg

2020-07-26 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-23932?focusedWorklogId=463383=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-463383
 ]

ASF GitHub Bot logged work on HIVE-23932:
-

Author: ASF GitHub Bot
Created on: 26/Jul/20 17:46
Start Date: 26/Jul/20 17:46
Worklog Time Spent: 10m 
  Work Description: scarlin-cloudera opened a new pull request #1316:
URL: https://github.com/apache/hive/pull/1316


   ## NOTICE
   
   Please create an issue in ASF JIRA before opening a pull request,
   and you need to set the title of the pull request which starts with
   the corresponding JIRA issue number. (e.g. HIVE-X: Fix a typo in YYY)
   For more details, please see 
https://cwiki.apache.org/confluence/display/Hive/HowToContribute
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 463383)
Time Spent: 1h 10m  (was: 1h)

> Test TypeCheckProcFactory reorg
> ---
>
> Key: HIVE-23932
> URL: https://issues.apache.org/jira/browse/HIVE-23932
> Project: Hive
>  Issue Type: Improvement
>  Components: Hive
>Reporter: Steve Carlin
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (HIVE-23932) Test TypeCheckProcFactory reorg

2020-07-26 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-23932?focusedWorklogId=463382=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-463382
 ]

ASF GitHub Bot logged work on HIVE-23932:
-

Author: ASF GitHub Bot
Created on: 26/Jul/20 17:46
Start Date: 26/Jul/20 17:46
Worklog Time Spent: 10m 
  Work Description: scarlin-cloudera closed pull request #1316:
URL: https://github.com/apache/hive/pull/1316


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 463382)
Time Spent: 1h  (was: 50m)

> Test TypeCheckProcFactory reorg
> ---
>
> Key: HIVE-23932
> URL: https://issues.apache.org/jira/browse/HIVE-23932
> Project: Hive
>  Issue Type: Improvement
>  Components: Hive
>Reporter: Steve Carlin
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 1h
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (HIVE-23716) Support Anti Join in Hive

2020-07-26 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-23716?focusedWorklogId=463351=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-463351
 ]

ASF GitHub Bot logged work on HIVE-23716:
-

Author: ASF GitHub Bot
Created on: 26/Jul/20 12:38
Start Date: 26/Jul/20 12:38
Worklog Time Spent: 10m 
  Work Description: maheshk114 commented on a change in pull request #1147:
URL: https://github.com/apache/hive/pull/1147#discussion_r460522562



##
File path: 
ql/src/java/org/apache/hadoop/hive/ql/exec/vector/mapjoin/VectorMapJoinAntiJoinLongOperator.java
##
@@ -0,0 +1,315 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hadoop.hive.ql.exec.vector.mapjoin;
+
+import org.apache.hadoop.hive.ql.CompilationOpContext;
+import org.apache.hadoop.hive.ql.exec.JoinUtil;
+import org.apache.hadoop.hive.ql.exec.vector.LongColumnVector;
+import org.apache.hadoop.hive.ql.exec.vector.VectorizationContext;
+import org.apache.hadoop.hive.ql.exec.vector.VectorizedRowBatch;
+import org.apache.hadoop.hive.ql.exec.vector.expressions.VectorExpression;
+import 
org.apache.hadoop.hive.ql.exec.vector.mapjoin.hashtable.VectorMapJoinLongHashSet;
+import org.apache.hadoop.hive.ql.metadata.HiveException;
+import org.apache.hadoop.hive.ql.plan.OperatorDesc;
+import org.apache.hadoop.hive.ql.plan.VectorDesc;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import java.util.Arrays;
+
+// TODO : Duplicate codes need to merge with semi join.
+// Single-Column Long hash table import.
+// Single-Column Long specific imports.
+
+/*
+ * Specialized class for doing a vectorized map join that is an anti join on a 
Single-Column Long
+ * using a hash set.
+ */
+public class VectorMapJoinAntiJoinLongOperator extends 
VectorMapJoinAntiJoinGenerateResultOperator {
+
+  private static final long serialVersionUID = 1L;
+  private static final String CLASS_NAME = 
VectorMapJoinAntiJoinLongOperator.class.getName();
+  private static final Logger LOG = LoggerFactory.getLogger(CLASS_NAME);
+  protected String getLoggingPrefix() {
+return super.getLoggingPrefix(CLASS_NAME);
+  }
+
+  // The above members are initialized by the constructor and must not be
+  // transient.
+
+  // The hash map for this specialized class.
+  private transient VectorMapJoinLongHashSet hashSet;
+
+  // Single-Column Long specific members.
+  // For integers, we have optional min/max filtering.
+  private transient boolean useMinMax;
+  private transient long min;
+  private transient long max;
+
+  // The column number for this one column join specialization.
+  private transient int singleJoinColumn;
+
+  // Pass-thru constructors.
+  /** Kryo ctor. */
+  protected VectorMapJoinAntiJoinLongOperator() {
+super();
+  }
+
+  public VectorMapJoinAntiJoinLongOperator(CompilationOpContext ctx) {
+super(ctx);
+  }
+
+  public VectorMapJoinAntiJoinLongOperator(CompilationOpContext ctx, 
OperatorDesc conf,
+   VectorizationContext vContext, 
VectorDesc vectorDesc) throws HiveException {
+super(ctx, conf, vContext, vectorDesc);
+  }
+
+  // Process Single-Column Long Anti Join on a vectorized row batch.
+  @Override
+  protected void commonSetup() throws HiveException {
+super.commonSetup();
+
+// Initialize Single-Column Long members for this specialized class.
+singleJoinColumn = bigTableKeyColumnMap[0];
+  }
+
+  @Override
+  public void hashTableSetup() throws HiveException {
+super.hashTableSetup();
+
+// Get our Single-Column Long hash set information for this specialized 
class.
+hashSet = (VectorMapJoinLongHashSet) vectorMapJoinHashTable;
+useMinMax = hashSet.useMinMax();
+if (useMinMax) {
+  min = hashSet.min();
+  max = hashSet.max();
+}
+  }
+
+  @Override
+  public void processBatch(VectorizedRowBatch batch) throws HiveException {
+
+try {
+  // (Currently none)
+  // antiPerBatchSetup(batch);
+
+  // For anti joins, we may apply the filter(s) now.
+  for(VectorExpression ve : bigTableFilterExpressions) {
+ve.evaluate(batch);
+  }
+

[jira] [Work logged] (HIVE-23716) Support Anti Join in Hive

2020-07-26 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-23716?focusedWorklogId=463350=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-463350
 ]

ASF GitHub Bot logged work on HIVE-23716:
-

Author: ASF GitHub Bot
Created on: 26/Jul/20 12:37
Start Date: 26/Jul/20 12:37
Worklog Time Spent: 10m 
  Work Description: maheshk114 commented on a change in pull request #1147:
URL: https://github.com/apache/hive/pull/1147#discussion_r460522454



##
File path: 
ql/src/java/org/apache/hadoop/hive/ql/exec/vector/mapjoin/VectorMapJoinAntiJoinLongOperator.java
##
@@ -0,0 +1,315 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hadoop.hive.ql.exec.vector.mapjoin;
+
+import org.apache.hadoop.hive.ql.CompilationOpContext;
+import org.apache.hadoop.hive.ql.exec.JoinUtil;
+import org.apache.hadoop.hive.ql.exec.vector.LongColumnVector;
+import org.apache.hadoop.hive.ql.exec.vector.VectorizationContext;
+import org.apache.hadoop.hive.ql.exec.vector.VectorizedRowBatch;
+import org.apache.hadoop.hive.ql.exec.vector.expressions.VectorExpression;
+import 
org.apache.hadoop.hive.ql.exec.vector.mapjoin.hashtable.VectorMapJoinLongHashSet;
+import org.apache.hadoop.hive.ql.metadata.HiveException;
+import org.apache.hadoop.hive.ql.plan.OperatorDesc;
+import org.apache.hadoop.hive.ql.plan.VectorDesc;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import java.util.Arrays;
+
+// TODO : Duplicate codes need to merge with semi join.
+// Single-Column Long hash table import.
+// Single-Column Long specific imports.
+
+/*
+ * Specialized class for doing a vectorized map join that is an anti join on a 
Single-Column Long
+ * using a hash set.
+ */
+public class VectorMapJoinAntiJoinLongOperator extends 
VectorMapJoinAntiJoinGenerateResultOperator {
+
+  private static final long serialVersionUID = 1L;
+  private static final String CLASS_NAME = 
VectorMapJoinAntiJoinLongOperator.class.getName();
+  private static final Logger LOG = LoggerFactory.getLogger(CLASS_NAME);
+  protected String getLoggingPrefix() {
+return super.getLoggingPrefix(CLASS_NAME);
+  }
+
+  // The above members are initialized by the constructor and must not be
+  // transient.
+
+  // The hash map for this specialized class.
+  private transient VectorMapJoinLongHashSet hashSet;
+
+  // Single-Column Long specific members.
+  // For integers, we have optional min/max filtering.
+  private transient boolean useMinMax;
+  private transient long min;
+  private transient long max;
+
+  // The column number for this one column join specialization.
+  private transient int singleJoinColumn;
+
+  // Pass-thru constructors.
+  /** Kryo ctor. */
+  protected VectorMapJoinAntiJoinLongOperator() {
+super();
+  }
+
+  public VectorMapJoinAntiJoinLongOperator(CompilationOpContext ctx) {
+super(ctx);
+  }
+
+  public VectorMapJoinAntiJoinLongOperator(CompilationOpContext ctx, 
OperatorDesc conf,
+   VectorizationContext vContext, 
VectorDesc vectorDesc) throws HiveException {
+super(ctx, conf, vContext, vectorDesc);
+  }
+
+  // Process Single-Column Long Anti Join on a vectorized row batch.
+  @Override
+  protected void commonSetup() throws HiveException {
+super.commonSetup();
+
+// Initialize Single-Column Long members for this specialized class.
+singleJoinColumn = bigTableKeyColumnMap[0];
+  }
+
+  @Override
+  public void hashTableSetup() throws HiveException {
+super.hashTableSetup();
+
+// Get our Single-Column Long hash set information for this specialized 
class.
+hashSet = (VectorMapJoinLongHashSet) vectorMapJoinHashTable;
+useMinMax = hashSet.useMinMax();
+if (useMinMax) {
+  min = hashSet.min();
+  max = hashSet.max();
+}
+  }
+
+  @Override
+  public void processBatch(VectorizedRowBatch batch) throws HiveException {
+
+try {
+  // (Currently none)
+  // antiPerBatchSetup(batch);
+
+  // For anti joins, we may apply the filter(s) now.
+  for(VectorExpression ve : bigTableFilterExpressions) {
+ve.evaluate(batch);
+  }
+

[jira] [Work logged] (HIVE-23716) Support Anti Join in Hive

2020-07-26 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-23716?focusedWorklogId=463348=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-463348
 ]

ASF GitHub Bot logged work on HIVE-23716:
-

Author: ASF GitHub Bot
Created on: 26/Jul/20 12:36
Start Date: 26/Jul/20 12:36
Worklog Time Spent: 10m 
  Work Description: maheshk114 commented on a change in pull request #1147:
URL: https://github.com/apache/hive/pull/1147#discussion_r460522261



##
File path: 
ql/src/java/org/apache/hadoop/hive/ql/exec/vector/mapjoin/VectorMapJoinAntiJoinStringOperator.java
##
@@ -0,0 +1,371 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hadoop.hive.ql.exec.vector.mapjoin;
+
+import org.apache.hadoop.hive.ql.CompilationOpContext;
+import org.apache.hadoop.hive.ql.exec.JoinUtil;
+import org.apache.hadoop.hive.ql.exec.vector.BytesColumnVector;
+import org.apache.hadoop.hive.ql.exec.vector.VectorizationContext;
+import org.apache.hadoop.hive.ql.exec.vector.VectorizedRowBatch;
+import org.apache.hadoop.hive.ql.exec.vector.expressions.StringExpr;
+import org.apache.hadoop.hive.ql.exec.vector.expressions.VectorExpression;
+import 
org.apache.hadoop.hive.ql.exec.vector.mapjoin.hashtable.VectorMapJoinBytesHashSet;
+import org.apache.hadoop.hive.ql.metadata.HiveException;
+import org.apache.hadoop.hive.ql.plan.OperatorDesc;
+import org.apache.hadoop.hive.ql.plan.VectorDesc;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import java.util.Arrays;
+
+// Single-Column String hash table import.
+// Single-Column String specific imports.
+
+// TODO : Duplicate codes need to merge with semi join.
+/*
+ * Specialized class for doing a vectorized map join that is an anti join on a 
Single-Column String
+ * using a hash set.
+ */
+public class VectorMapJoinAntiJoinStringOperator extends 
VectorMapJoinAntiJoinGenerateResultOperator {
+
+  private static final long serialVersionUID = 1L;
+
+  
//
+
+  private static final String CLASS_NAME = 
VectorMapJoinAntiJoinStringOperator.class.getName();
+  private static final Logger LOG = LoggerFactory.getLogger(CLASS_NAME);
+
+  protected String getLoggingPrefix() {
+return super.getLoggingPrefix(CLASS_NAME);
+  }
+
+  
//
+
+  // (none)
+
+  // The above members are initialized by the constructor and must not be
+  // transient.
+  //---
+
+  // The hash map for this specialized class.
+  private transient VectorMapJoinBytesHashSet hashSet;
+
+  //---
+  // Single-Column String specific members.
+  //
+
+  // The column number for this one column join specialization.
+  private transient int singleJoinColumn;
+
+  //---
+  // Pass-thru constructors.
+  //
+
+  /** Kryo ctor. */
+  protected VectorMapJoinAntiJoinStringOperator() {
+super();
+  }
+
+  public VectorMapJoinAntiJoinStringOperator(CompilationOpContext ctx) {
+super(ctx);
+  }
+
+  public VectorMapJoinAntiJoinStringOperator(CompilationOpContext ctx, 
OperatorDesc conf,
+ VectorizationContext vContext, 
VectorDesc vectorDesc) throws HiveException {
+super(ctx, conf, vContext, vectorDesc);
+  }
+
+  //---
+  // Process Single-Column String anti Join on a vectorized row batch.
+  //
+
+  @Override
+  protected void commonSetup() throws HiveException {
+super.commonSetup();
+
+/*
+ * Initialize Single-Column String members for this specialized class.
+ */
+
+singleJoinColumn = bigTableKeyColumnMap[0];
+  }
+
+  @Override
+  public void hashTableSetup() throws HiveException {
+super.hashTableSetup();
+
+/*
+ * Get our Single-Column String hash set information for this

[jira] [Work logged] (HIVE-23716) Support Anti Join in Hive

2020-07-26 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-23716?focusedWorklogId=463349=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-463349
 ]

ASF GitHub Bot logged work on HIVE-23716:
-

Author: ASF GitHub Bot
Created on: 26/Jul/20 12:36
Start Date: 26/Jul/20 12:36
Worklog Time Spent: 10m 
  Work Description: maheshk114 commented on a change in pull request #1147:
URL: https://github.com/apache/hive/pull/1147#discussion_r460522312



##
File path: 
ql/src/java/org/apache/hadoop/hive/ql/exec/vector/mapjoin/VectorMapJoinAntiJoinLongOperator.java
##
@@ -0,0 +1,315 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hadoop.hive.ql.exec.vector.mapjoin;
+
+import org.apache.hadoop.hive.ql.CompilationOpContext;
+import org.apache.hadoop.hive.ql.exec.JoinUtil;
+import org.apache.hadoop.hive.ql.exec.vector.LongColumnVector;
+import org.apache.hadoop.hive.ql.exec.vector.VectorizationContext;
+import org.apache.hadoop.hive.ql.exec.vector.VectorizedRowBatch;
+import org.apache.hadoop.hive.ql.exec.vector.expressions.VectorExpression;
+import 
org.apache.hadoop.hive.ql.exec.vector.mapjoin.hashtable.VectorMapJoinLongHashSet;
+import org.apache.hadoop.hive.ql.metadata.HiveException;
+import org.apache.hadoop.hive.ql.plan.OperatorDesc;
+import org.apache.hadoop.hive.ql.plan.VectorDesc;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import java.util.Arrays;
+
+// TODO : Duplicate codes need to merge with semi join.
+// Single-Column Long hash table import.
+// Single-Column Long specific imports.
+
+/*
+ * Specialized class for doing a vectorized map join that is an anti join on a 
Single-Column Long
+ * using a hash set.
+ */
+public class VectorMapJoinAntiJoinLongOperator extends 
VectorMapJoinAntiJoinGenerateResultOperator {
+
+  private static final long serialVersionUID = 1L;
+  private static final String CLASS_NAME = 
VectorMapJoinAntiJoinLongOperator.class.getName();
+  private static final Logger LOG = LoggerFactory.getLogger(CLASS_NAME);
+  protected String getLoggingPrefix() {
+return super.getLoggingPrefix(CLASS_NAME);
+  }
+
+  // The above members are initialized by the constructor and must not be
+  // transient.
+
+  // The hash map for this specialized class.
+  private transient VectorMapJoinLongHashSet hashSet;
+
+  // Single-Column Long specific members.
+  // For integers, we have optional min/max filtering.
+  private transient boolean useMinMax;
+  private transient long min;
+  private transient long max;
+
+  // The column number for this one column join specialization.
+  private transient int singleJoinColumn;
+
+  // Pass-thru constructors.
+  /** Kryo ctor. */
+  protected VectorMapJoinAntiJoinLongOperator() {
+super();
+  }
+
+  public VectorMapJoinAntiJoinLongOperator(CompilationOpContext ctx) {
+super(ctx);
+  }
+
+  public VectorMapJoinAntiJoinLongOperator(CompilationOpContext ctx, 
OperatorDesc conf,
+   VectorizationContext vContext, 
VectorDesc vectorDesc) throws HiveException {
+super(ctx, conf, vContext, vectorDesc);
+  }
+
+  // Process Single-Column Long Anti Join on a vectorized row batch.
+  @Override
+  protected void commonSetup() throws HiveException {
+super.commonSetup();
+
+// Initialize Single-Column Long members for this specialized class.
+singleJoinColumn = bigTableKeyColumnMap[0];
+  }
+
+  @Override
+  public void hashTableSetup() throws HiveException {
+super.hashTableSetup();
+
+// Get our Single-Column Long hash set information for this specialized 
class.
+hashSet = (VectorMapJoinLongHashSet) vectorMapJoinHashTable;
+useMinMax = hashSet.useMinMax();
+if (useMinMax) {
+  min = hashSet.min();
+  max = hashSet.max();
+}
+  }
+
+  @Override
+  public void processBatch(VectorizedRowBatch batch) throws HiveException {
+
+try {
+  // (Currently none)
+  // antiPerBatchSetup(batch);
+
+  // For anti joins, we may apply the filter(s) now.
+  for(VectorExpression ve : bigTableFilterExpressions) {
+ve.evaluate(batch);
+  }
+

[jira] [Work logged] (HIVE-23716) Support Anti Join in Hive

2020-07-26 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-23716?focusedWorklogId=463346=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-463346
 ]

ASF GitHub Bot logged work on HIVE-23716:
-

Author: ASF GitHub Bot
Created on: 26/Jul/20 12:27
Start Date: 26/Jul/20 12:27
Worklog Time Spent: 10m 
  Work Description: maheshk114 commented on a change in pull request #1147:
URL: https://github.com/apache/hive/pull/1147#discussion_r460521384



##
File path: ql/src/java/org/apache/hadoop/hive/ql/exec/CommonJoinOperator.java
##
@@ -509,11 +513,17 @@ protected void addToAliasFilterTags(byte alias, 
List object, boolean isN
 }
   }
 
+  private void createForwardJoinObjectForAntiJoin(boolean[] skip) throws 
HiveException {
+boolean forward = fillFwdCache(skip);

Review comment:
   done





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 463346)
Time Spent: 12h 20m  (was: 12h 10m)

> Support Anti Join in Hive 
> --
>
> Key: HIVE-23716
> URL: https://issues.apache.org/jira/browse/HIVE-23716
> Project: Hive
>  Issue Type: Bug
>Reporter: mahesh kumar behera
>Assignee: mahesh kumar behera
>Priority: Major
>  Labels: pull-request-available
> Attachments: HIVE-23716.01.patch
>
>  Time Spent: 12h 20m
>  Remaining Estimate: 0h
>
> Currently hive does not support Anti join. The query for anti join is 
> converted to left outer join and null filter on right side join key is added 
> to get the desired result. This is causing
>  # Extra computation — The left outer join projects the redundant columns 
> from right side. Along with that, filtering is done to remove the redundant 
> rows. This is can be avoided in case of anti join as anti join will project 
> only the required columns and rows from the left side table.
>  # Extra shuffle — In case of anti join the duplicate records moved to join 
> node can be avoided from the child node. This can reduce significant amount 
> of data movement if the number of distinct rows( join keys) is significant.
>  # Extra Memory Usage - In case of map based anti join , hash set is 
> sufficient as just the key is required to check  if the records matches the 
> join condition. In case of left join, we need the key and the non key columns 
> also and thus a hash table will be required.
> For a query like
> {code:java}
>  select wr_order_number FROM web_returns LEFT JOIN web_sales  ON 
> wr_order_number = ws_order_number WHERE ws_order_number IS NULL;{code}
> The number of distinct ws_order_number in web_sales table in a typical 10TB 
> TPCDS set up is just 10% of total records. So when we convert this query to 
> anti join, instead of 7 billion rows, only 600 million rows are moved to join 
> node.
> In the current patch, just one conversion is done. The pattern of 
> project->filter->left-join is converted to project->anti-join. This will take 
> care of sub queries with “not exists” clause. The queries with “not exists” 
> are converted first to filter + left-join and then its converted to anti 
> join. The queries with “not in” are not handled in the current patch.
> From execution side, both merge join and map join with vectorized execution  
> is supported for anti join.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (HIVE-23716) Support Anti Join in Hive

2020-07-26 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-23716?focusedWorklogId=463345=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-463345
 ]

ASF GitHub Bot logged work on HIVE-23716:
-

Author: ASF GitHub Bot
Created on: 26/Jul/20 12:26
Start Date: 26/Jul/20 12:26
Worklog Time Spent: 10m 
  Work Description: maheshk114 commented on a change in pull request #1147:
URL: https://github.com/apache/hive/pull/1147#discussion_r460521246



##
File path: parser/src/java/org/apache/hadoop/hive/ql/parse/FromClauseParser.g
##
@@ -145,6 +145,7 @@ joinToken
 | KW_RIGHT (KW_OUTER)? KW_JOIN -> TOK_RIGHTOUTERJOIN
 | KW_FULL  (KW_OUTER)? KW_JOIN -> TOK_FULLOUTERJOIN
 | KW_LEFT KW_SEMI KW_JOIN  -> TOK_LEFTSEMIJOIN
+| KW_ANTI KW_JOIN  -> TOK_ANTIJOIN

Review comment:
   done





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 463345)
Time Spent: 12h 10m  (was: 12h)

> Support Anti Join in Hive 
> --
>
> Key: HIVE-23716
> URL: https://issues.apache.org/jira/browse/HIVE-23716
> Project: Hive
>  Issue Type: Bug
>Reporter: mahesh kumar behera
>Assignee: mahesh kumar behera
>Priority: Major
>  Labels: pull-request-available
> Attachments: HIVE-23716.01.patch
>
>  Time Spent: 12h 10m
>  Remaining Estimate: 0h
>
> Currently hive does not support Anti join. The query for anti join is 
> converted to left outer join and null filter on right side join key is added 
> to get the desired result. This is causing
>  # Extra computation — The left outer join projects the redundant columns 
> from right side. Along with that, filtering is done to remove the redundant 
> rows. This is can be avoided in case of anti join as anti join will project 
> only the required columns and rows from the left side table.
>  # Extra shuffle — In case of anti join the duplicate records moved to join 
> node can be avoided from the child node. This can reduce significant amount 
> of data movement if the number of distinct rows( join keys) is significant.
>  # Extra Memory Usage - In case of map based anti join , hash set is 
> sufficient as just the key is required to check  if the records matches the 
> join condition. In case of left join, we need the key and the non key columns 
> also and thus a hash table will be required.
> For a query like
> {code:java}
>  select wr_order_number FROM web_returns LEFT JOIN web_sales  ON 
> wr_order_number = ws_order_number WHERE ws_order_number IS NULL;{code}
> The number of distinct ws_order_number in web_sales table in a typical 10TB 
> TPCDS set up is just 10% of total records. So when we convert this query to 
> anti join, instead of 7 billion rows, only 600 million rows are moved to join 
> node.
> In the current patch, just one conversion is done. The pattern of 
> project->filter->left-join is converted to project->anti-join. This will take 
> care of sub queries with “not exists” clause. The queries with “not exists” 
> are converted first to filter + left-join and then its converted to anti 
> join. The queries with “not in” are not handled in the current patch.
> From execution side, both merge join and map join with vectorized execution  
> is supported for anti join.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (HIVE-23716) Support Anti Join in Hive

2020-07-26 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-23716?focusedWorklogId=463344=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-463344
 ]

ASF GitHub Bot logged work on HIVE-23716:
-

Author: ASF GitHub Bot
Created on: 26/Jul/20 12:24
Start Date: 26/Jul/20 12:24
Worklog Time Spent: 10m 
  Work Description: maheshk114 commented on a change in pull request #1147:
URL: https://github.com/apache/hive/pull/1147#discussion_r460521056



##
File path: ql/src/java/org/apache/hadoop/hive/ql/exec/CommonJoinOperator.java
##
@@ -153,6 +153,8 @@
 
   transient boolean hasLeftSemiJoin = false;
 
+  transient boolean hasAntiJoin = false;

Review comment:
   done





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 463344)
Time Spent: 12h  (was: 11h 50m)

> Support Anti Join in Hive 
> --
>
> Key: HIVE-23716
> URL: https://issues.apache.org/jira/browse/HIVE-23716
> Project: Hive
>  Issue Type: Bug
>Reporter: mahesh kumar behera
>Assignee: mahesh kumar behera
>Priority: Major
>  Labels: pull-request-available
> Attachments: HIVE-23716.01.patch
>
>  Time Spent: 12h
>  Remaining Estimate: 0h
>
> Currently hive does not support Anti join. The query for anti join is 
> converted to left outer join and null filter on right side join key is added 
> to get the desired result. This is causing
>  # Extra computation — The left outer join projects the redundant columns 
> from right side. Along with that, filtering is done to remove the redundant 
> rows. This is can be avoided in case of anti join as anti join will project 
> only the required columns and rows from the left side table.
>  # Extra shuffle — In case of anti join the duplicate records moved to join 
> node can be avoided from the child node. This can reduce significant amount 
> of data movement if the number of distinct rows( join keys) is significant.
>  # Extra Memory Usage - In case of map based anti join , hash set is 
> sufficient as just the key is required to check  if the records matches the 
> join condition. In case of left join, we need the key and the non key columns 
> also and thus a hash table will be required.
> For a query like
> {code:java}
>  select wr_order_number FROM web_returns LEFT JOIN web_sales  ON 
> wr_order_number = ws_order_number WHERE ws_order_number IS NULL;{code}
> The number of distinct ws_order_number in web_sales table in a typical 10TB 
> TPCDS set up is just 10% of total records. So when we convert this query to 
> anti join, instead of 7 billion rows, only 600 million rows are moved to join 
> node.
> In the current patch, just one conversion is done. The pattern of 
> project->filter->left-join is converted to project->anti-join. This will take 
> care of sub queries with “not exists” clause. The queries with “not exists” 
> are converted first to filter + left-join and then its converted to anti 
> join. The queries with “not in” are not handled in the current patch.
> From execution side, both merge join and map join with vectorized execution  
> is supported for anti join.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (HIVE-23716) Support Anti Join in Hive

2020-07-26 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-23716?focusedWorklogId=463343=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-463343
 ]

ASF GitHub Bot logged work on HIVE-23716:
-

Author: ASF GitHub Bot
Created on: 26/Jul/20 12:23
Start Date: 26/Jul/20 12:23
Worklog Time Spent: 10m 
  Work Description: maheshk114 commented on a change in pull request #1147:
URL: https://github.com/apache/hive/pull/1147#discussion_r460520930



##
File path: 
ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/HiveRelFactories.java
##
@@ -188,6 +193,20 @@ public RelNode createSemiJoin(RelNode left, RelNode right,
 }
   }
 
+  /**
+   * Implementation of {@link AntiJoinFactory} that returns
+   * {@link 
org.apache.hadoop.hive.ql.optimizer.calcite.reloperators.HiveAntiJoin}
+   * .
+   */
+  private static class HiveAntiJoinFactoryImpl implements SemiJoinFactory {

Review comment:
   HiveAntiJoinFactoryImpl is removed





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 463343)
Time Spent: 11h 50m  (was: 11h 40m)

> Support Anti Join in Hive 
> --
>
> Key: HIVE-23716
> URL: https://issues.apache.org/jira/browse/HIVE-23716
> Project: Hive
>  Issue Type: Bug
>Reporter: mahesh kumar behera
>Assignee: mahesh kumar behera
>Priority: Major
>  Labels: pull-request-available
> Attachments: HIVE-23716.01.patch
>
>  Time Spent: 11h 50m
>  Remaining Estimate: 0h
>
> Currently hive does not support Anti join. The query for anti join is 
> converted to left outer join and null filter on right side join key is added 
> to get the desired result. This is causing
>  # Extra computation — The left outer join projects the redundant columns 
> from right side. Along with that, filtering is done to remove the redundant 
> rows. This is can be avoided in case of anti join as anti join will project 
> only the required columns and rows from the left side table.
>  # Extra shuffle — In case of anti join the duplicate records moved to join 
> node can be avoided from the child node. This can reduce significant amount 
> of data movement if the number of distinct rows( join keys) is significant.
>  # Extra Memory Usage - In case of map based anti join , hash set is 
> sufficient as just the key is required to check  if the records matches the 
> join condition. In case of left join, we need the key and the non key columns 
> also and thus a hash table will be required.
> For a query like
> {code:java}
>  select wr_order_number FROM web_returns LEFT JOIN web_sales  ON 
> wr_order_number = ws_order_number WHERE ws_order_number IS NULL;{code}
> The number of distinct ws_order_number in web_sales table in a typical 10TB 
> TPCDS set up is just 10% of total records. So when we convert this query to 
> anti join, instead of 7 billion rows, only 600 million rows are moved to join 
> node.
> In the current patch, just one conversion is done. The pattern of 
> project->filter->left-join is converted to project->anti-join. This will take 
> care of sub queries with “not exists” clause. The queries with “not exists” 
> are converted first to filter + left-join and then its converted to anti 
> join. The queries with “not in” are not handled in the current patch.
> From execution side, both merge join and map join with vectorized execution  
> is supported for anti join.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (HIVE-23716) Support Anti Join in Hive

2020-07-26 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-23716?focusedWorklogId=463342=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-463342
 ]

ASF GitHub Bot logged work on HIVE-23716:
-

Author: ASF GitHub Bot
Created on: 26/Jul/20 12:23
Start Date: 26/Jul/20 12:23
Worklog Time Spent: 10m 
  Work Description: maheshk114 commented on a change in pull request #1147:
URL: https://github.com/apache/hive/pull/1147#discussion_r460520903



##
File path: 
ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/HiveRelOptMaterializationValidator.java
##
@@ -253,6 +256,14 @@ private RelNode visit(HiveSemiJoin semiJoin) {
 return visitChildren(semiJoin);
   }
 
+  // Note: Not currently part of the HiveRelNode interface
+  private RelNode visit(HiveAntiJoin antiJoin) {

Review comment:
   Not sure ..copy pasted from semi join.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 463342)
Time Spent: 11h 40m  (was: 11.5h)

> Support Anti Join in Hive 
> --
>
> Key: HIVE-23716
> URL: https://issues.apache.org/jira/browse/HIVE-23716
> Project: Hive
>  Issue Type: Bug
>Reporter: mahesh kumar behera
>Assignee: mahesh kumar behera
>Priority: Major
>  Labels: pull-request-available
> Attachments: HIVE-23716.01.patch
>
>  Time Spent: 11h 40m
>  Remaining Estimate: 0h
>
> Currently hive does not support Anti join. The query for anti join is 
> converted to left outer join and null filter on right side join key is added 
> to get the desired result. This is causing
>  # Extra computation — The left outer join projects the redundant columns 
> from right side. Along with that, filtering is done to remove the redundant 
> rows. This is can be avoided in case of anti join as anti join will project 
> only the required columns and rows from the left side table.
>  # Extra shuffle — In case of anti join the duplicate records moved to join 
> node can be avoided from the child node. This can reduce significant amount 
> of data movement if the number of distinct rows( join keys) is significant.
>  # Extra Memory Usage - In case of map based anti join , hash set is 
> sufficient as just the key is required to check  if the records matches the 
> join condition. In case of left join, we need the key and the non key columns 
> also and thus a hash table will be required.
> For a query like
> {code:java}
>  select wr_order_number FROM web_returns LEFT JOIN web_sales  ON 
> wr_order_number = ws_order_number WHERE ws_order_number IS NULL;{code}
> The number of distinct ws_order_number in web_sales table in a typical 10TB 
> TPCDS set up is just 10% of total records. So when we convert this query to 
> anti join, instead of 7 billion rows, only 600 million rows are moved to join 
> node.
> In the current patch, just one conversion is done. The pattern of 
> project->filter->left-join is converted to project->anti-join. This will take 
> care of sub queries with “not exists” clause. The queries with “not exists” 
> are converted first to filter + left-join and then its converted to anti 
> join. The queries with “not in” are not handled in the current patch.
> From execution side, both merge join and map join with vectorized execution  
> is supported for anti join.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (HIVE-23716) Support Anti Join in Hive

2020-07-26 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-23716?focusedWorklogId=463341=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-463341
 ]

ASF GitHub Bot logged work on HIVE-23716:
-

Author: ASF GitHub Bot
Created on: 26/Jul/20 12:20
Start Date: 26/Jul/20 12:20
Worklog Time Spent: 10m 
  Work Description: maheshk114 commented on a change in pull request #1147:
URL: https://github.com/apache/hive/pull/1147#discussion_r460520647



##
File path: 
ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/HiveSubQRemoveRelBuilder.java
##
@@ -1112,7 +1112,7 @@ public RexNode field(RexNode e, String name) {
   }
 
   public HiveSubQRemoveRelBuilder join(JoinRelType joinType, RexNode condition,
-   Set variablesSet, 
boolean createSemiJoin) {
+   Set variablesSet, 
JoinRelType semiJoinType) {

Review comment:
   done





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 463341)
Time Spent: 11.5h  (was: 11h 20m)

> Support Anti Join in Hive 
> --
>
> Key: HIVE-23716
> URL: https://issues.apache.org/jira/browse/HIVE-23716
> Project: Hive
>  Issue Type: Bug
>Reporter: mahesh kumar behera
>Assignee: mahesh kumar behera
>Priority: Major
>  Labels: pull-request-available
> Attachments: HIVE-23716.01.patch
>
>  Time Spent: 11.5h
>  Remaining Estimate: 0h
>
> Currently hive does not support Anti join. The query for anti join is 
> converted to left outer join and null filter on right side join key is added 
> to get the desired result. This is causing
>  # Extra computation — The left outer join projects the redundant columns 
> from right side. Along with that, filtering is done to remove the redundant 
> rows. This is can be avoided in case of anti join as anti join will project 
> only the required columns and rows from the left side table.
>  # Extra shuffle — In case of anti join the duplicate records moved to join 
> node can be avoided from the child node. This can reduce significant amount 
> of data movement if the number of distinct rows( join keys) is significant.
>  # Extra Memory Usage - In case of map based anti join , hash set is 
> sufficient as just the key is required to check  if the records matches the 
> join condition. In case of left join, we need the key and the non key columns 
> also and thus a hash table will be required.
> For a query like
> {code:java}
>  select wr_order_number FROM web_returns LEFT JOIN web_sales  ON 
> wr_order_number = ws_order_number WHERE ws_order_number IS NULL;{code}
> The number of distinct ws_order_number in web_sales table in a typical 10TB 
> TPCDS set up is just 10% of total records. So when we convert this query to 
> anti join, instead of 7 billion rows, only 600 million rows are moved to join 
> node.
> In the current patch, just one conversion is done. The pattern of 
> project->filter->left-join is converted to project->anti-join. This will take 
> care of sub queries with “not exists” clause. The queries with “not exists” 
> are converted first to filter + left-join and then its converted to anti 
> join. The queries with “not in” are not handled in the current patch.
> From execution side, both merge join and map join with vectorized execution  
> is supported for anti join.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (HIVE-23716) Support Anti Join in Hive

2020-07-26 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-23716?focusedWorklogId=463340=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-463340
 ]

ASF GitHub Bot logged work on HIVE-23716:
-

Author: ASF GitHub Bot
Created on: 26/Jul/20 12:10
Start Date: 26/Jul/20 12:10
Worklog Time Spent: 10m 
  Work Description: maheshk114 commented on a change in pull request #1147:
URL: https://github.com/apache/hive/pull/1147#discussion_r460519523



##
File path: 
ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/rules/HiveJoinAddNotNullRule.java
##
@@ -56,6 +57,9 @@
   public static final HiveJoinAddNotNullRule INSTANCE_SEMIJOIN =
   new HiveJoinAddNotNullRule(HiveSemiJoin.class, 
HiveRelFactories.HIVE_FILTER_FACTORY);
 
+  public static final HiveJoinAddNotNullRule INSTANCE_ANTIJOIN =
+  new HiveJoinAddNotNullRule(HiveAntiJoin.class, 
HiveRelFactories.HIVE_FILTER_FACTORY);

Review comment:
   done





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 463340)
Time Spent: 11h 20m  (was: 11h 10m)

> Support Anti Join in Hive 
> --
>
> Key: HIVE-23716
> URL: https://issues.apache.org/jira/browse/HIVE-23716
> Project: Hive
>  Issue Type: Bug
>Reporter: mahesh kumar behera
>Assignee: mahesh kumar behera
>Priority: Major
>  Labels: pull-request-available
> Attachments: HIVE-23716.01.patch
>
>  Time Spent: 11h 20m
>  Remaining Estimate: 0h
>
> Currently hive does not support Anti join. The query for anti join is 
> converted to left outer join and null filter on right side join key is added 
> to get the desired result. This is causing
>  # Extra computation — The left outer join projects the redundant columns 
> from right side. Along with that, filtering is done to remove the redundant 
> rows. This is can be avoided in case of anti join as anti join will project 
> only the required columns and rows from the left side table.
>  # Extra shuffle — In case of anti join the duplicate records moved to join 
> node can be avoided from the child node. This can reduce significant amount 
> of data movement if the number of distinct rows( join keys) is significant.
>  # Extra Memory Usage - In case of map based anti join , hash set is 
> sufficient as just the key is required to check  if the records matches the 
> join condition. In case of left join, we need the key and the non key columns 
> also and thus a hash table will be required.
> For a query like
> {code:java}
>  select wr_order_number FROM web_returns LEFT JOIN web_sales  ON 
> wr_order_number = ws_order_number WHERE ws_order_number IS NULL;{code}
> The number of distinct ws_order_number in web_sales table in a typical 10TB 
> TPCDS set up is just 10% of total records. So when we convert this query to 
> anti join, instead of 7 billion rows, only 600 million rows are moved to join 
> node.
> In the current patch, just one conversion is done. The pattern of 
> project->filter->left-join is converted to project->anti-join. This will take 
> care of sub queries with “not exists” clause. The queries with “not exists” 
> are converted first to filter + left-join and then its converted to anti 
> join. The queries with “not in” are not handled in the current patch.
> From execution side, both merge join and map join with vectorized execution  
> is supported for anti join.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (HIVE-23716) Support Anti Join in Hive

2020-07-26 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-23716?focusedWorklogId=463339=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-463339
 ]

ASF GitHub Bot logged work on HIVE-23716:
-

Author: ASF GitHub Bot
Created on: 26/Jul/20 12:05
Start Date: 26/Jul/20 12:05
Worklog Time Spent: 10m 
  Work Description: maheshk114 commented on a change in pull request #1147:
URL: https://github.com/apache/hive/pull/1147#discussion_r460519111



##
File path: ql/src/java/org/apache/hadoop/hive/ql/plan/VectorMapJoinDesc.java
##
@@ -89,7 +89,8 @@ public PrimitiveTypeInfo getPrimitiveTypeInfo() {
 INNER_BIG_ONLY,
 LEFT_SEMI,
 OUTER,
-FULL_OUTER
+FULL_OUTER,
+ANTI

Review comment:
   LEFT_ANTI





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 463339)
Time Spent: 11h 10m  (was: 11h)

> Support Anti Join in Hive 
> --
>
> Key: HIVE-23716
> URL: https://issues.apache.org/jira/browse/HIVE-23716
> Project: Hive
>  Issue Type: Bug
>Reporter: mahesh kumar behera
>Assignee: mahesh kumar behera
>Priority: Major
>  Labels: pull-request-available
> Attachments: HIVE-23716.01.patch
>
>  Time Spent: 11h 10m
>  Remaining Estimate: 0h
>
> Currently hive does not support Anti join. The query for anti join is 
> converted to left outer join and null filter on right side join key is added 
> to get the desired result. This is causing
>  # Extra computation — The left outer join projects the redundant columns 
> from right side. Along with that, filtering is done to remove the redundant 
> rows. This is can be avoided in case of anti join as anti join will project 
> only the required columns and rows from the left side table.
>  # Extra shuffle — In case of anti join the duplicate records moved to join 
> node can be avoided from the child node. This can reduce significant amount 
> of data movement if the number of distinct rows( join keys) is significant.
>  # Extra Memory Usage - In case of map based anti join , hash set is 
> sufficient as just the key is required to check  if the records matches the 
> join condition. In case of left join, we need the key and the non key columns 
> also and thus a hash table will be required.
> For a query like
> {code:java}
>  select wr_order_number FROM web_returns LEFT JOIN web_sales  ON 
> wr_order_number = ws_order_number WHERE ws_order_number IS NULL;{code}
> The number of distinct ws_order_number in web_sales table in a typical 10TB 
> TPCDS set up is just 10% of total records. So when we convert this query to 
> anti join, instead of 7 billion rows, only 600 million rows are moved to join 
> node.
> In the current patch, just one conversion is done. The pattern of 
> project->filter->left-join is converted to project->anti-join. This will take 
> care of sub queries with “not exists” clause. The queries with “not exists” 
> are converted first to filter + left-join and then its converted to anti 
> join. The queries with “not in” are not handled in the current patch.
> From execution side, both merge join and map join with vectorized execution  
> is supported for anti join.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (HIVE-23716) Support Anti Join in Hive

2020-07-26 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-23716?focusedWorklogId=463338=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-463338
 ]

ASF GitHub Bot logged work on HIVE-23716:
-

Author: ASF GitHub Bot
Created on: 26/Jul/20 12:04
Start Date: 26/Jul/20 12:04
Worklog Time Spent: 10m 
  Work Description: maheshk114 commented on a change in pull request #1147:
URL: https://github.com/apache/hive/pull/1147#discussion_r460518974



##
File path: 
ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/rules/HiveJoinWithFilterToAntiJoinRule.java
##
@@ -0,0 +1,149 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.hadoop.hive.ql.optimizer.calcite.rules;
+
+import org.apache.calcite.plan.RelOptRule;
+import org.apache.calcite.plan.RelOptRuleCall;
+import org.apache.calcite.plan.RelOptUtil;
+import org.apache.calcite.rel.RelNode;
+import org.apache.calcite.rel.core.Filter;
+import org.apache.calcite.rel.core.Join;
+import org.apache.calcite.rel.core.JoinRelType;
+import org.apache.calcite.rel.core.Project;
+import org.apache.calcite.rel.type.RelDataTypeField;
+import org.apache.calcite.rex.RexInputRef;
+import org.apache.calcite.rex.RexNode;
+import org.apache.calcite.sql.SqlKind;
+import org.apache.calcite.util.ImmutableBitSet;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import java.util.ArrayList;
+import java.util.List;
+
+/**
+ * Planner rule that converts a join plus filter to anti join.
+ */
+public class HiveJoinWithFilterToAntiJoinRule extends RelOptRule {

Review comment:
   done





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 463338)
Time Spent: 11h  (was: 10h 50m)

> Support Anti Join in Hive 
> --
>
> Key: HIVE-23716
> URL: https://issues.apache.org/jira/browse/HIVE-23716
> Project: Hive
>  Issue Type: Bug
>Reporter: mahesh kumar behera
>Assignee: mahesh kumar behera
>Priority: Major
>  Labels: pull-request-available
> Attachments: HIVE-23716.01.patch
>
>  Time Spent: 11h
>  Remaining Estimate: 0h
>
> Currently hive does not support Anti join. The query for anti join is 
> converted to left outer join and null filter on right side join key is added 
> to get the desired result. This is causing
>  # Extra computation — The left outer join projects the redundant columns 
> from right side. Along with that, filtering is done to remove the redundant 
> rows. This is can be avoided in case of anti join as anti join will project 
> only the required columns and rows from the left side table.
>  # Extra shuffle — In case of anti join the duplicate records moved to join 
> node can be avoided from the child node. This can reduce significant amount 
> of data movement if the number of distinct rows( join keys) is significant.
>  # Extra Memory Usage - In case of map based anti join , hash set is 
> sufficient as just the key is required to check  if the records matches the 
> join condition. In case of left join, we need the key and the non key columns 
> also and thus a hash table will be required.
> For a query like
> {code:java}
>  select wr_order_number FROM web_returns LEFT JOIN web_sales  ON 
> wr_order_number = ws_order_number WHERE ws_order_number IS NULL;{code}
> The number of distinct ws_order_number in web_sales table in a typical 10TB 
> TPCDS set up is just 10% of total records. So when we convert this query to 
> anti join, instead of 7 billion rows, only 600 million rows are moved to join 
> node.
> In the current patch, just one conversion is done. The pattern of 
> project->filter->left-join is converted to project->anti-join. This will take 
> care of sub queries with “not exists”

[jira] [Work logged] (HIVE-23716) Support Anti Join in Hive

2020-07-26 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-23716?focusedWorklogId=463337=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-463337
 ]

ASF GitHub Bot logged work on HIVE-23716:
-

Author: ASF GitHub Bot
Created on: 26/Jul/20 12:03
Start Date: 26/Jul/20 12:03
Worklog Time Spent: 10m 
  Work Description: maheshk114 commented on a change in pull request #1147:
URL: https://github.com/apache/hive/pull/1147#discussion_r460518799



##
File path: 
ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/rules/HiveJoinWithFilterToAntiJoinRule.java
##
@@ -0,0 +1,149 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.hadoop.hive.ql.optimizer.calcite.rules;
+
+import org.apache.calcite.plan.RelOptRule;
+import org.apache.calcite.plan.RelOptRuleCall;
+import org.apache.calcite.plan.RelOptUtil;
+import org.apache.calcite.rel.RelNode;
+import org.apache.calcite.rel.core.Filter;
+import org.apache.calcite.rel.core.Join;
+import org.apache.calcite.rel.core.JoinRelType;
+import org.apache.calcite.rel.core.Project;
+import org.apache.calcite.rel.type.RelDataTypeField;
+import org.apache.calcite.rex.RexInputRef;
+import org.apache.calcite.rex.RexNode;
+import org.apache.calcite.sql.SqlKind;
+import org.apache.calcite.util.ImmutableBitSet;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import java.util.ArrayList;
+import java.util.List;
+
+/**
+ * Planner rule that converts a join plus filter to anti join.
+ */
+public class HiveJoinWithFilterToAntiJoinRule extends RelOptRule {
+  protected static final Logger LOG = 
LoggerFactory.getLogger(HiveJoinWithFilterToAntiJoinRule.class);
+  public static final HiveJoinWithFilterToAntiJoinRule INSTANCE = new 
HiveJoinWithFilterToAntiJoinRule();
+
+  //HiveProject(fld=[$0])
+  //  HiveFilter(condition=[IS NULL($1)])
+  //HiveJoin(condition=[=($0, $1)], joinType=[left], algorithm=[none], 
cost=[not available])
+  //
+  // TO
+  //
+  //HiveProject(fld_tbl=[$0])
+  //  HiveAntiJoin(condition=[=($0, $1)], joinType=[anti])
+  //
+  public HiveJoinWithFilterToAntiJoinRule() {
+super(operand(Project.class, operand(Filter.class, operand(Join.class, 
RelOptRule.any(,
+"HiveJoinWithFilterToAntiJoinRule:filter");
+  }
+
+  // is null filter over a left join.
+  public void onMatch(final RelOptRuleCall call) {
+final Project project = call.rel(0);
+final Filter filter = call.rel(1);
+final Join join = call.rel(2);
+perform(call, project, filter, join);
+  }
+
+  protected void perform(RelOptRuleCall call, Project project, Filter filter, 
Join join) {
+LOG.debug("Matched HiveAntiJoinRule");
+
+if (join.getCondition().isAlwaysTrue()) {
+  return;
+}
+
+//We support conversion from left outer join only.
+if (join.getJoinType() != JoinRelType.LEFT) {
+  return;
+}
+
+assert (filter != null);
+
+List aboveFilters = 
RelOptUtil.conjunctions(filter.getCondition());
+boolean hasIsNull = false;
+
+// Get all filter condition and check if any of them is a "is null" kind.
+for (RexNode filterNode : aboveFilters) {
+  if (filterNode.getKind() == SqlKind.IS_NULL &&
+  isFilterFromRightSide(join, filterNode, join.getJoinType())) {
+hasIsNull = true;
+break;
+  }
+}
+
+// Is null should be on a key from right side of the join.
+if (!hasIsNull) {
+  return;
+}
+
+// Build anti join with same left, right child and condition as original 
left outer join.
+Join anti = join.copy(join.getTraitSet(), join.getCondition(),
+join.getLeft(), join.getRight(), JoinRelType.ANTI, false);
+
+//TODO : Do we really need it
+call.getPlanner().onCopy(join, anti);
+
+RelNode newProject = getNewProjectNode(project, anti);
+if (newProject != null) {
+  call.getPlanner().onCopy(project, newProject);

Review comment:
   done

##
File path: 
ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/rules/HiveJoinWithFilterToAntiJoinRule.java
##
@@ -0,0 +1,149 @@
+/*
+ * Licensed to the

[jira] [Updated] (HIVE-23933) Add getRowCountInt and getJoinDistinctRowCount support for anti join in calcite.

2020-07-26 Thread mahesh kumar behera (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-23933?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

mahesh kumar behera updated HIVE-23933:
---
Summary: Add getRowCountInt and getJoinDistinctRowCount support for  anti 
join in calcite.   (was: Add getRowCountInt support for  anti join in calcite. )

> Add getRowCountInt and getJoinDistinctRowCount support for  anti join in 
> calcite. 
> --
>
> Key: HIVE-23933
> URL: https://issues.apache.org/jira/browse/HIVE-23933
> Project: Hive
>  Issue Type: Bug
>Reporter: mahesh kumar behera
>Assignee: mahesh kumar behera
>Priority: Major
>
> Current calcite 21 does not support getRowCountInt for anti join.The 
> selectivity calculation for anti join should be different than semi join. It 
> should be 1-semi join selectivity.
> Need to handle getJoinDistinctRowCount also in calcite.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (HIVE-23716) Support Anti Join in Hive

2020-07-26 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-23716?focusedWorklogId=463336=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-463336
 ]

ASF GitHub Bot logged work on HIVE-23716:
-

Author: ASF GitHub Bot
Created on: 26/Jul/20 11:59
Start Date: 26/Jul/20 11:59
Worklog Time Spent: 10m 
  Work Description: maheshk114 commented on a change in pull request #1147:
URL: https://github.com/apache/hive/pull/1147#discussion_r460518454



##
File path: 
ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/stats/HiveRelMdDistinctRowCount.java
##
@@ -79,6 +80,11 @@ public Double getDistinctRowCount(HiveSemiJoin rel, 
RelMetadataQuery mq, Immutab
 return super.getDistinctRowCount(rel, mq, groupKey, predicate);
   }
 
+  public Double getDistinctRowCount(HiveAntiJoin rel, RelMetadataQuery mq, 
ImmutableBitSet groupKey,
+RexNode predicate) {
+return super.getDistinctRowCount(rel, mq, groupKey, predicate);

Review comment:
   https://issues.apache.org/jira/browse/HIVE-23933





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 463336)
Time Spent: 10h 40m  (was: 10.5h)

> Support Anti Join in Hive 
> --
>
> Key: HIVE-23716
> URL: https://issues.apache.org/jira/browse/HIVE-23716
> Project: Hive
>  Issue Type: Bug
>Reporter: mahesh kumar behera
>Assignee: mahesh kumar behera
>Priority: Major
>  Labels: pull-request-available
> Attachments: HIVE-23716.01.patch
>
>  Time Spent: 10h 40m
>  Remaining Estimate: 0h
>
> Currently hive does not support Anti join. The query for anti join is 
> converted to left outer join and null filter on right side join key is added 
> to get the desired result. This is causing
>  # Extra computation — The left outer join projects the redundant columns 
> from right side. Along with that, filtering is done to remove the redundant 
> rows. This is can be avoided in case of anti join as anti join will project 
> only the required columns and rows from the left side table.
>  # Extra shuffle — In case of anti join the duplicate records moved to join 
> node can be avoided from the child node. This can reduce significant amount 
> of data movement if the number of distinct rows( join keys) is significant.
>  # Extra Memory Usage - In case of map based anti join , hash set is 
> sufficient as just the key is required to check  if the records matches the 
> join condition. In case of left join, we need the key and the non key columns 
> also and thus a hash table will be required.
> For a query like
> {code:java}
>  select wr_order_number FROM web_returns LEFT JOIN web_sales  ON 
> wr_order_number = ws_order_number WHERE ws_order_number IS NULL;{code}
> The number of distinct ws_order_number in web_sales table in a typical 10TB 
> TPCDS set up is just 10% of total records. So when we convert this query to 
> anti join, instead of 7 billion rows, only 600 million rows are moved to join 
> node.
> In the current patch, just one conversion is done. The pattern of 
> project->filter->left-join is converted to project->anti-join. This will take 
> care of sub queries with “not exists” clause. The queries with “not exists” 
> are converted first to filter + left-join and then its converted to anti 
> join. The queries with “not in” are not handled in the current patch.
> From execution side, both merge join and map join with vectorized execution  
> is supported for anti join.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (HIVE-23716) Support Anti Join in Hive

2020-07-26 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-23716?focusedWorklogId=463335=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-463335
 ]

ASF GitHub Bot logged work on HIVE-23716:
-

Author: ASF GitHub Bot
Created on: 26/Jul/20 11:59
Start Date: 26/Jul/20 11:59
Worklog Time Spent: 10m 
  Work Description: maheshk114 commented on a change in pull request #1147:
URL: https://github.com/apache/hive/pull/1147#discussion_r460518411



##
File path: 
ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/stats/HiveRelMdRowCount.java
##
@@ -118,6 +119,15 @@ public Double getRowCount(HiveJoin join, RelMetadataQuery 
mq) {
   }
 
   public Double getRowCount(HiveSemiJoin rel, RelMetadataQuery mq) {
+return getRowCountInt(rel, mq);
+  }
+
+  public Double getRowCount(HiveAntiJoin rel, RelMetadataQuery mq) {
+return getRowCountInt(rel, mq);
+  }
+
+  private Double getRowCountInt(Join rel, RelMetadataQuery mq) {

Review comment:
   super.getRowCount(rel, mq) does not support Anti join. I think we need 
to handle it.
   https://issues.apache.org/jira/browse/HIVE-23933





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 463335)
Time Spent: 10.5h  (was: 10h 20m)

> Support Anti Join in Hive 
> --
>
> Key: HIVE-23716
> URL: https://issues.apache.org/jira/browse/HIVE-23716
> Project: Hive
>  Issue Type: Bug
>Reporter: mahesh kumar behera
>Assignee: mahesh kumar behera
>Priority: Major
>  Labels: pull-request-available
> Attachments: HIVE-23716.01.patch
>
>  Time Spent: 10.5h
>  Remaining Estimate: 0h
>
> Currently hive does not support Anti join. The query for anti join is 
> converted to left outer join and null filter on right side join key is added 
> to get the desired result. This is causing
>  # Extra computation — The left outer join projects the redundant columns 
> from right side. Along with that, filtering is done to remove the redundant 
> rows. This is can be avoided in case of anti join as anti join will project 
> only the required columns and rows from the left side table.
>  # Extra shuffle — In case of anti join the duplicate records moved to join 
> node can be avoided from the child node. This can reduce significant amount 
> of data movement if the number of distinct rows( join keys) is significant.
>  # Extra Memory Usage - In case of map based anti join , hash set is 
> sufficient as just the key is required to check  if the records matches the 
> join condition. In case of left join, we need the key and the non key columns 
> also and thus a hash table will be required.
> For a query like
> {code:java}
>  select wr_order_number FROM web_returns LEFT JOIN web_sales  ON 
> wr_order_number = ws_order_number WHERE ws_order_number IS NULL;{code}
> The number of distinct ws_order_number in web_sales table in a typical 10TB 
> TPCDS set up is just 10% of total records. So when we convert this query to 
> anti join, instead of 7 billion rows, only 600 million rows are moved to join 
> node.
> In the current patch, just one conversion is done. The pattern of 
> project->filter->left-join is converted to project->anti-join. This will take 
> care of sub queries with “not exists” clause. The queries with “not exists” 
> are converted first to filter + left-join and then its converted to anti 
> join. The queries with “not in” are not handled in the current patch.
> From execution side, both merge join and map join with vectorized execution  
> is supported for anti join.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HIVE-23933) Add getRowCountInt support for anti join in calcite.

2020-07-26 Thread mahesh kumar behera (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-23933?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

mahesh kumar behera updated HIVE-23933:
---
Description: 
Current calcite 21 does not support getRowCountInt for anti join.The 
selectivity calculation for anti join should be different than semi join. It 
should be 1-semi join selectivity.

Need to handle getJoinDistinctRowCount also in calcite.

  was:Current calcite 21 does not support getRowCountInt for anti join.The 
selectivity calculation for anti join should be different than semi join. It 
should be 1-semi join selectivity.


> Add getRowCountInt support for  anti join in calcite. 
> --
>
> Key: HIVE-23933
> URL: https://issues.apache.org/jira/browse/HIVE-23933
> Project: Hive
>  Issue Type: Bug
>Reporter: mahesh kumar behera
>Assignee: mahesh kumar behera
>Priority: Major
>
> Current calcite 21 does not support getRowCountInt for anti join.The 
> selectivity calculation for anti join should be different than semi join. It 
> should be 1-semi join selectivity.
> Need to handle getJoinDistinctRowCount also in calcite.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HIVE-23933) Add getRowCountInt support for anti join in calcite.

2020-07-26 Thread mahesh kumar behera (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-23933?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

mahesh kumar behera updated HIVE-23933:
---
Description: Current calcite 21 does not support getRowCountInt for anti 
join.The selectivity calculation for anti join should be different than semi 
join. It should be 1-semi join selectivity.  (was: The current anti join 
conversion does not support direct conversion of not-exists to anti join. The 
not exists sub query is converted first to left out join and then its converted 
to anti join. This may cause some of the optimization rule to be skipped.

 )

> Add getRowCountInt support for  anti join in calcite. 
> --
>
> Key: HIVE-23933
> URL: https://issues.apache.org/jira/browse/HIVE-23933
> Project: Hive
>  Issue Type: Bug
>Reporter: mahesh kumar behera
>Assignee: mahesh kumar behera
>Priority: Major
>
> Current calcite 21 does not support getRowCountInt for anti join.The 
> selectivity calculation for anti join should be different than semi join. It 
> should be 1-semi join selectivity.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (HIVE-23933) Add getRowCountInt support for anti join in calcite.

2020-07-26 Thread mahesh kumar behera (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-23933?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

mahesh kumar behera reassigned HIVE-23933:
--


> Add getRowCountInt support for  anti join in calcite. 
> --
>
> Key: HIVE-23933
> URL: https://issues.apache.org/jira/browse/HIVE-23933
> Project: Hive
>  Issue Type: Bug
>Reporter: mahesh kumar behera
>Assignee: mahesh kumar behera
>Priority: Major
>
> The current anti join conversion does not support direct conversion of 
> not-exists to anti join. The not exists sub query is converted first to left 
> out join and then its converted to anti join. This may cause some of the 
> optimization rule to be skipped.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (HIVE-23716) Support Anti Join in Hive

2020-07-26 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-23716?focusedWorklogId=463334=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-463334
 ]

ASF GitHub Bot logged work on HIVE-23716:
-

Author: ASF GitHub Bot
Created on: 26/Jul/20 11:34
Start Date: 26/Jul/20 11:34
Worklog Time Spent: 10m 
  Work Description: maheshk114 commented on a change in pull request #1147:
URL: https://github.com/apache/hive/pull/1147#discussion_r460515695



##
File path: 
ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/stats/HiveRelMdRowCount.java
##
@@ -118,6 +119,15 @@ public Double getRowCount(HiveJoin join, RelMetadataQuery 
mq) {
   }
 
   public Double getRowCount(HiveSemiJoin rel, RelMetadataQuery mq) {
+return getRowCountInt(rel, mq);
+  }
+
+  public Double getRowCount(HiveAntiJoin rel, RelMetadataQuery mq) {
+return getRowCountInt(rel, mq);
+  }
+
+  private Double getRowCountInt(Join rel, RelMetadataQuery mq) {

Review comment:
   Yes done.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 463334)
Time Spent: 10h 20m  (was: 10h 10m)

> Support Anti Join in Hive 
> --
>
> Key: HIVE-23716
> URL: https://issues.apache.org/jira/browse/HIVE-23716
> Project: Hive
>  Issue Type: Bug
>Reporter: mahesh kumar behera
>Assignee: mahesh kumar behera
>Priority: Major
>  Labels: pull-request-available
> Attachments: HIVE-23716.01.patch
>
>  Time Spent: 10h 20m
>  Remaining Estimate: 0h
>
> Currently hive does not support Anti join. The query for anti join is 
> converted to left outer join and null filter on right side join key is added 
> to get the desired result. This is causing
>  # Extra computation — The left outer join projects the redundant columns 
> from right side. Along with that, filtering is done to remove the redundant 
> rows. This is can be avoided in case of anti join as anti join will project 
> only the required columns and rows from the left side table.
>  # Extra shuffle — In case of anti join the duplicate records moved to join 
> node can be avoided from the child node. This can reduce significant amount 
> of data movement if the number of distinct rows( join keys) is significant.
>  # Extra Memory Usage - In case of map based anti join , hash set is 
> sufficient as just the key is required to check  if the records matches the 
> join condition. In case of left join, we need the key and the non key columns 
> also and thus a hash table will be required.
> For a query like
> {code:java}
>  select wr_order_number FROM web_returns LEFT JOIN web_sales  ON 
> wr_order_number = ws_order_number WHERE ws_order_number IS NULL;{code}
> The number of distinct ws_order_number in web_sales table in a typical 10TB 
> TPCDS set up is just 10% of total records. So when we convert this query to 
> anti join, instead of 7 billion rows, only 600 million rows are moved to join 
> node.
> In the current patch, just one conversion is done. The pattern of 
> project->filter->left-join is converted to project->anti-join. This will take 
> care of sub queries with “not exists” clause. The queries with “not exists” 
> are converted first to filter + left-join and then its converted to anti 
> join. The queries with “not in” are not handled in the current patch.
> From execution side, both merge join and map join with vectorized execution  
> is supported for anti join.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (HIVE-23716) Support Anti Join in Hive

2020-07-26 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-23716?focusedWorklogId=46=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-46
 ]

ASF GitHub Bot logged work on HIVE-23716:
-

Author: ASF GitHub Bot
Created on: 26/Jul/20 11:29
Start Date: 26/Jul/20 11:29
Worklog Time Spent: 10m 
  Work Description: maheshk114 commented on a change in pull request #1147:
URL: https://github.com/apache/hive/pull/1147#discussion_r460515257



##
File path: 
ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/rules/HiveRemoveGBYSemiJoinRule.java
##
@@ -41,17 +41,19 @@
 
   public HiveRemoveGBYSemiJoinRule() {
 super(
-operand(HiveSemiJoin.class,
+operand(Join.class,
 some(
 operand(RelNode.class, any()),
 operand(Aggregate.class, any(,
 HiveRelFactories.HIVE_BUILDER, "HiveRemoveGBYSemiJoinRule");
   }
 
   @Override public void onMatch(RelOptRuleCall call) {
-final HiveSemiJoin semijoin= call.rel(0);
+final Join join= call.rel(0);

Review comment:
   done





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 46)
Time Spent: 10h 10m  (was: 10h)

> Support Anti Join in Hive 
> --
>
> Key: HIVE-23716
> URL: https://issues.apache.org/jira/browse/HIVE-23716
> Project: Hive
>  Issue Type: Bug
>Reporter: mahesh kumar behera
>Assignee: mahesh kumar behera
>Priority: Major
>  Labels: pull-request-available
> Attachments: HIVE-23716.01.patch
>
>  Time Spent: 10h 10m
>  Remaining Estimate: 0h
>
> Currently hive does not support Anti join. The query for anti join is 
> converted to left outer join and null filter on right side join key is added 
> to get the desired result. This is causing
>  # Extra computation — The left outer join projects the redundant columns 
> from right side. Along with that, filtering is done to remove the redundant 
> rows. This is can be avoided in case of anti join as anti join will project 
> only the required columns and rows from the left side table.
>  # Extra shuffle — In case of anti join the duplicate records moved to join 
> node can be avoided from the child node. This can reduce significant amount 
> of data movement if the number of distinct rows( join keys) is significant.
>  # Extra Memory Usage - In case of map based anti join , hash set is 
> sufficient as just the key is required to check  if the records matches the 
> join condition. In case of left join, we need the key and the non key columns 
> also and thus a hash table will be required.
> For a query like
> {code:java}
>  select wr_order_number FROM web_returns LEFT JOIN web_sales  ON 
> wr_order_number = ws_order_number WHERE ws_order_number IS NULL;{code}
> The number of distinct ws_order_number in web_sales table in a typical 10TB 
> TPCDS set up is just 10% of total records. So when we convert this query to 
> anti join, instead of 7 billion rows, only 600 million rows are moved to join 
> node.
> In the current patch, just one conversion is done. The pattern of 
> project->filter->left-join is converted to project->anti-join. This will take 
> care of sub queries with “not exists” clause. The queries with “not exists” 
> are converted first to filter + left-join and then its converted to anti 
> join. The queries with “not in” are not handled in the current patch.
> From execution side, both merge join and map join with vectorized execution  
> is supported for anti join.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HIVE-23916) Fix Atlas client dependency version

2020-07-26 Thread Pravin Sinha (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-23916?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pravin Sinha updated HIVE-23916:

Attachment: HIVE-23916.01.patch

> Fix Atlas client dependency version
> ---
>
> Key: HIVE-23916
> URL: https://issues.apache.org/jira/browse/HIVE-23916
> Project: Hive
>  Issue Type: Task
>Reporter: Pravin Sinha
>Assignee: Pravin Sinha
>Priority: Major
>  Labels: pull-request-available
> Attachments: HIVE-23916.01.patch
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HIVE-23916) Fix Atlas client dependency version

2020-07-26 Thread Pravin Sinha (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-23916?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pravin Sinha updated HIVE-23916:

Attachment: (was: HIVE-23916.01.patch)

> Fix Atlas client dependency version
> ---
>
> Key: HIVE-23916
> URL: https://issues.apache.org/jira/browse/HIVE-23916
> Project: Hive
>  Issue Type: Task
>Reporter: Pravin Sinha
>Assignee: Pravin Sinha
>Priority: Major
>  Labels: pull-request-available
> Attachments: HIVE-23916.01.patch
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

48 matches

Mail list logo