[jira] [Work logged] (HIVE-24037) Parallelize hash table constructions in map joins

2021-07-20 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24037?focusedWorklogId=625457=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-625457
 ]

ASF GitHub Bot logged work on HIVE-24037:
-

Author: ASF GitHub Bot
Created on: 20/Jul/21 12:12
Start Date: 20/Jul/21 12:12
Worklog Time Spent: 10m 
  Work Description: github-actions[bot] closed pull request #2004:
URL: https://github.com/apache/hive/pull/2004


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: gitbox-unsubscr...@hive.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 625457)
Time Spent: 3h  (was: 2h 50m)

> Parallelize hash table constructions in map joins
> -
>
> Key: HIVE-24037
> URL: https://issues.apache.org/jira/browse/HIVE-24037
> Project: Hive
>  Issue Type: Improvement
>Reporter: Ramesh Kumar Thangarajan
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 3h
>  Remaining Estimate: 0h
>
> Parallelize hash table constructions in map joins



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (HIVE-24037) Parallelize hash table constructions in map joins

2021-07-20 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24037?focusedWorklogId=625064=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-625064
 ]

ASF GitHub Bot logged work on HIVE-24037:
-

Author: ASF GitHub Bot
Created on: 20/Jul/21 10:01
Start Date: 20/Jul/21 10:01
Worklog Time Spent: 10m 
  Work Description: github-actions[bot] closed pull request #2004:
URL: https://github.com/apache/hive/pull/2004


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: gitbox-unsubscr...@hive.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 625064)
Time Spent: 2h 50m  (was: 2h 40m)

> Parallelize hash table constructions in map joins
> -
>
> Key: HIVE-24037
> URL: https://issues.apache.org/jira/browse/HIVE-24037
> Project: Hive
>  Issue Type: Improvement
>Reporter: Ramesh Kumar Thangarajan
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 2h 50m
>  Remaining Estimate: 0h
>
> Parallelize hash table constructions in map joins



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (HIVE-24037) Parallelize hash table constructions in map joins

2021-07-19 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24037?focusedWorklogId=624709=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-624709
 ]

ASF GitHub Bot logged work on HIVE-24037:
-

Author: ASF GitHub Bot
Created on: 20/Jul/21 00:08
Start Date: 20/Jul/21 00:08
Worklog Time Spent: 10m 
  Work Description: github-actions[bot] closed pull request #2004:
URL: https://github.com/apache/hive/pull/2004


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: gitbox-unsubscr...@hive.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 624709)
Time Spent: 2h 40m  (was: 2.5h)

> Parallelize hash table constructions in map joins
> -
>
> Key: HIVE-24037
> URL: https://issues.apache.org/jira/browse/HIVE-24037
> Project: Hive
>  Issue Type: Improvement
>Reporter: Ramesh Kumar Thangarajan
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 2h 40m
>  Remaining Estimate: 0h
>
> Parallelize hash table constructions in map joins



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (HIVE-24037) Parallelize hash table constructions in map joins

2021-07-11 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24037?focusedWorklogId=621409=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-621409
 ]

ASF GitHub Bot logged work on HIVE-24037:
-

Author: ASF GitHub Bot
Created on: 12/Jul/21 00:09
Start Date: 12/Jul/21 00:09
Worklog Time Spent: 10m 
  Work Description: github-actions[bot] commented on pull request #2004:
URL: https://github.com/apache/hive/pull/2004#issuecomment-877884465


   This pull request has been automatically marked as stale because it has not 
had recent activity. It will be closed if no further activity occurs.
   Feel free to reach out on the d...@hive.apache.org list if the patch is in 
need of reviews.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: gitbox-unsubscr...@hive.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 621409)
Time Spent: 2.5h  (was: 2h 20m)

> Parallelize hash table constructions in map joins
> -
>
> Key: HIVE-24037
> URL: https://issues.apache.org/jira/browse/HIVE-24037
> Project: Hive
>  Issue Type: Improvement
>Reporter: Ramesh Kumar Thangarajan
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 2.5h
>  Remaining Estimate: 0h
>
> Parallelize hash table constructions in map joins



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (HIVE-24037) Parallelize hash table constructions in map joins

2021-05-21 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24037?focusedWorklogId=600301=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-600301
 ]

ASF GitHub Bot logged work on HIVE-24037:
-

Author: ASF GitHub Bot
Created on: 21/May/21 11:40
Start Date: 21/May/21 11:40
Worklog Time Spent: 10m 
  Work Description: pgaref opened a new pull request #2305:
URL: https://github.com/apache/hive/pull/2305


   
   
   ### What changes were proposed in this pull request?
   
   
   
   ### Why are the changes needed?
   
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   
   
   ### How was this patch tested?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 600301)
Time Spent: 2h 20m  (was: 2h 10m)

> Parallelize hash table constructions in map joins
> -
>
> Key: HIVE-24037
> URL: https://issues.apache.org/jira/browse/HIVE-24037
> Project: Hive
>  Issue Type: Improvement
>Reporter: Ramesh Kumar Thangarajan
>Assignee: Ramesh Kumar Thangarajan
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 2h 20m
>  Remaining Estimate: 0h
>
> Parallelize hash table constructions in map joins



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (HIVE-24037) Parallelize hash table constructions in map joins

2021-05-12 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24037?focusedWorklogId=595452=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-595452
 ]

ASF GitHub Bot logged work on HIVE-24037:
-

Author: ASF GitHub Bot
Created on: 12/May/21 16:23
Start Date: 12/May/21 16:23
Worklog Time Spent: 10m 
  Work Description: pgaref commented on a change in pull request #2004:
URL: https://github.com/apache/hive/pull/2004#discussion_r631202108



##
File path: 
ql/src/java/org/apache/hadoop/hive/ql/exec/vector/mapjoin/fast/VectorMapJoinFastHashTableWrapper.java
##
@@ -0,0 +1,40 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hadoop.hive.ql.exec.vector.mapjoin.fast;
+
+import org.apache.hadoop.hive.ql.metadata.HiveException;
+import org.apache.hadoop.io.BytesWritable;
+
+import java.io.IOException;
+
+public abstract class VectorMapJoinFastHashTableWrapper {
+
+  public abstract long calculateLongHashCode(long key, BytesWritable 
currentKey) throws HiveException, IOException;
+
+  public abstract long deserializeToKey(BytesWritable currentKey) throws 
HiveException, IOException;

Review comment:
   Maybe have a default impl of deserializeToKey() throwing an exception or 
returning 0 and only have Long implementations to Override?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 595452)
Time Spent: 2h 10m  (was: 2h)

> Parallelize hash table constructions in map joins
> -
>
> Key: HIVE-24037
> URL: https://issues.apache.org/jira/browse/HIVE-24037
> Project: Hive
>  Issue Type: Improvement
>Reporter: Ramesh Kumar Thangarajan
>Assignee: Ramesh Kumar Thangarajan
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 2h 10m
>  Remaining Estimate: 0h
>
> Parallelize hash table constructions in map joins



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (HIVE-24037) Parallelize hash table constructions in map joins

2021-05-12 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24037?focusedWorklogId=595425=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-595425
 ]

ASF GitHub Bot logged work on HIVE-24037:
-

Author: ASF GitHub Bot
Created on: 12/May/21 15:46
Start Date: 12/May/21 15:46
Worklog Time Spent: 10m 
  Work Description: pgaref commented on a change in pull request #2004:
URL: https://github.com/apache/hive/pull/2004#discussion_r631163313



##
File path: 
ql/src/java/org/apache/hadoop/hive/ql/exec/vector/mapjoin/fast/VectorMapJoinFastHashTableLoader.java
##
@@ -141,35 +280,64 @@ public void load(MapJoinTableContainer[] mapJoinTables,
 long keyCount = Math.max(estKeyCount, inputRecords);
 
 VectorMapJoinFastTableContainer vectorMapJoinFastTableContainer =
-new VectorMapJoinFastTableContainer(desc, hconf, keyCount);
+new VectorMapJoinFastTableContainer(desc, hconf, keyCount, 
numThreads);
 
 LOG.info("Loading hash table for input: {} cacheKey: {} 
tableContainer: {} smallTablePos: {} " +
 "estKeyCount : {} keyCount : {}", inputName, cacheKey,
 vectorMapJoinFastTableContainer.getClass().getSimpleName(), 
pos, estKeyCount, keyCount);
 
 vectorMapJoinFastTableContainer.setSerde(null, null); // No SerDes 
here.
+ExecutorService executorService = 
Executors.newFixedThreadPool(numThreads);
+BlockingQueue[] sharedQ = new 
BlockingQueue[numThreads];
+for(int i = 0; i < numThreads; ++i) {
+  sharedQ[i] = new LinkedBlockingQueue<>();
+}
+QueueElementBatch[] batches = new QueueElementBatch[numThreads];
+for (int i = 0; i < numThreads; ++i) {
+  batches[i] = new QueueElementBatch();
+}
+//start the threads
+drain(vectorMapJoinFastTableContainer, doMemCheck, inputName, 
memoryMonitorInfo,
+effectiveThreshold, executorService, sharedQ);
 long startTime = System.currentTimeMillis();
 while (kvReader.next()) {
-  
vectorMapJoinFastTableContainer.putRow((BytesWritable)kvReader.getCurrentKey(),
-  (BytesWritable)kvReader.getCurrentValue());
+  BytesWritable currentKey = (BytesWritable) kvReader.getCurrentKey();
+  BytesWritable currentValue = (BytesWritable) 
kvReader.getCurrentValue();
+  long key = 
vectorMapJoinFastTableContainer.deserializeToKey(currentKey);
+  long hashCode = 
vectorMapJoinFastTableContainer.calculateLongHashCode(key, currentKey);
+  int partitionId = (int) ((numThreads - 1) & hashCode);
   numEntries++;
-  if (doMemCheck && (numEntries % 
memoryMonitorInfo.getMemoryCheckInterval() == 0)) {
-  final long estMemUsage = 
vectorMapJoinFastTableContainer.getEstimatedMemorySize();
-  if (estMemUsage > effectiveThreshold) {
-String msg = "Hash table loading exceeded memory limits for 
input: " + inputName +
-  " numEntries: " + numEntries + " estimatedMemoryUsage: " + 
estMemUsage +
-  " effectiveThreshold: " + effectiveThreshold + " 
memoryMonitorInfo: " + memoryMonitorInfo;
-LOG.error(msg);
-throw new MapJoinMemoryExhaustionError(msg);
-  } else {
-if (LOG.isInfoEnabled()) {
-  LOG.info("Checking hash table loader memory usage for input: 
{} numEntries: {} " +
-  "estimatedMemoryUsage: {} effectiveThreshold: {}", 
inputName, numEntries, estMemUsage,
-effectiveThreshold);
-}
-  }
+  // call getBytes as copy is called later
+  byte[] valueBytes = currentValue.copyBytes();
+  int valueLength = currentValue.getLength();
+  byte[] keyBytes = currentKey.copyBytes();
+  int keyLength = currentKey.getLength();
+  HashTableElement h = new HashTableElement(keyBytes, keyLength, 
valueBytes, valueLength, key, hashCode);
+  if (batches[partitionId].addElement(h)) {
+  sharedQ[partitionId].add(batches[partitionId]);
+  batches[partitionId] = new QueueElementBatch();
   }
 }
+
+LOG.info("Finished loading the queue for input: {} endTime : {}", 
inputName, System.currentTimeMillis());
+
+// Add sentinel at the end of queue
+for (int i=0; i<4; ++i) {
+  // add sentinel to the q not the batch
+  sharedQ[i].add(batches[i]);
+  sharedQ[i].add(sentinel);
+}
+
+executorService.shutdown();
+try {
+  executorService.awaitTermination(Long.MAX_VALUE, 
TimeUnit.NANOSECONDS);
+} catch (InterruptedException e) {

Review comment:
   handle exception along with others at the end?




-- 
This is an automated message from the Apache Git Service.

[jira] [Work logged] (HIVE-24037) Parallelize hash table constructions in map joins

2021-05-12 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24037?focusedWorklogId=595424=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-595424
 ]

ASF GitHub Bot logged work on HIVE-24037:
-

Author: ASF GitHub Bot
Created on: 12/May/21 15:45
Start Date: 12/May/21 15:45
Worklog Time Spent: 10m 
  Work Description: pgaref commented on a change in pull request #2004:
URL: https://github.com/apache/hive/pull/2004#discussion_r631163313



##
File path: 
ql/src/java/org/apache/hadoop/hive/ql/exec/vector/mapjoin/fast/VectorMapJoinFastHashTableLoader.java
##
@@ -141,35 +280,64 @@ public void load(MapJoinTableContainer[] mapJoinTables,
 long keyCount = Math.max(estKeyCount, inputRecords);
 
 VectorMapJoinFastTableContainer vectorMapJoinFastTableContainer =
-new VectorMapJoinFastTableContainer(desc, hconf, keyCount);
+new VectorMapJoinFastTableContainer(desc, hconf, keyCount, 
numThreads);
 
 LOG.info("Loading hash table for input: {} cacheKey: {} 
tableContainer: {} smallTablePos: {} " +
 "estKeyCount : {} keyCount : {}", inputName, cacheKey,
 vectorMapJoinFastTableContainer.getClass().getSimpleName(), 
pos, estKeyCount, keyCount);
 
 vectorMapJoinFastTableContainer.setSerde(null, null); // No SerDes 
here.
+ExecutorService executorService = 
Executors.newFixedThreadPool(numThreads);
+BlockingQueue[] sharedQ = new 
BlockingQueue[numThreads];
+for(int i = 0; i < numThreads; ++i) {
+  sharedQ[i] = new LinkedBlockingQueue<>();
+}
+QueueElementBatch[] batches = new QueueElementBatch[numThreads];
+for (int i = 0; i < numThreads; ++i) {
+  batches[i] = new QueueElementBatch();
+}
+//start the threads
+drain(vectorMapJoinFastTableContainer, doMemCheck, inputName, 
memoryMonitorInfo,
+effectiveThreshold, executorService, sharedQ);
 long startTime = System.currentTimeMillis();
 while (kvReader.next()) {
-  
vectorMapJoinFastTableContainer.putRow((BytesWritable)kvReader.getCurrentKey(),
-  (BytesWritable)kvReader.getCurrentValue());
+  BytesWritable currentKey = (BytesWritable) kvReader.getCurrentKey();
+  BytesWritable currentValue = (BytesWritable) 
kvReader.getCurrentValue();
+  long key = 
vectorMapJoinFastTableContainer.deserializeToKey(currentKey);
+  long hashCode = 
vectorMapJoinFastTableContainer.calculateLongHashCode(key, currentKey);
+  int partitionId = (int) ((numThreads - 1) & hashCode);
   numEntries++;
-  if (doMemCheck && (numEntries % 
memoryMonitorInfo.getMemoryCheckInterval() == 0)) {
-  final long estMemUsage = 
vectorMapJoinFastTableContainer.getEstimatedMemorySize();
-  if (estMemUsage > effectiveThreshold) {
-String msg = "Hash table loading exceeded memory limits for 
input: " + inputName +
-  " numEntries: " + numEntries + " estimatedMemoryUsage: " + 
estMemUsage +
-  " effectiveThreshold: " + effectiveThreshold + " 
memoryMonitorInfo: " + memoryMonitorInfo;
-LOG.error(msg);
-throw new MapJoinMemoryExhaustionError(msg);
-  } else {
-if (LOG.isInfoEnabled()) {
-  LOG.info("Checking hash table loader memory usage for input: 
{} numEntries: {} " +
-  "estimatedMemoryUsage: {} effectiveThreshold: {}", 
inputName, numEntries, estMemUsage,
-effectiveThreshold);
-}
-  }
+  // call getBytes as copy is called later
+  byte[] valueBytes = currentValue.copyBytes();
+  int valueLength = currentValue.getLength();
+  byte[] keyBytes = currentKey.copyBytes();
+  int keyLength = currentKey.getLength();
+  HashTableElement h = new HashTableElement(keyBytes, keyLength, 
valueBytes, valueLength, key, hashCode);
+  if (batches[partitionId].addElement(h)) {
+  sharedQ[partitionId].add(batches[partitionId]);
+  batches[partitionId] = new QueueElementBatch();
   }
 }
+
+LOG.info("Finished loading the queue for input: {} endTime : {}", 
inputName, System.currentTimeMillis());
+
+// Add sentinel at the end of queue
+for (int i=0; i<4; ++i) {
+  // add sentinel to the q not the batch
+  sharedQ[i].add(batches[i]);
+  sharedQ[i].add(sentinel);
+}
+
+executorService.shutdown();
+try {
+  executorService.awaitTermination(Long.MAX_VALUE, 
TimeUnit.NANOSECONDS);
+} catch (InterruptedException e) {

Review comment:
   handle exception along with others at the end




-- 
This is an automated message from the Apache Git Service.

[jira] [Work logged] (HIVE-24037) Parallelize hash table constructions in map joins

2021-05-12 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24037?focusedWorklogId=595419=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-595419
 ]

ASF GitHub Bot logged work on HIVE-24037:
-

Author: ASF GitHub Bot
Created on: 12/May/21 15:38
Start Date: 12/May/21 15:38
Worklog Time Spent: 10m 
  Work Description: pgaref commented on a change in pull request #2004:
URL: https://github.com/apache/hive/pull/2004#discussion_r631157229



##
File path: 
ql/src/java/org/apache/hadoop/hive/ql/exec/vector/mapjoin/fast/VectorMapJoinFastHashTableLoader.java
##
@@ -141,35 +280,64 @@ public void load(MapJoinTableContainer[] mapJoinTables,
 long keyCount = Math.max(estKeyCount, inputRecords);
 
 VectorMapJoinFastTableContainer vectorMapJoinFastTableContainer =
-new VectorMapJoinFastTableContainer(desc, hconf, keyCount);
+new VectorMapJoinFastTableContainer(desc, hconf, keyCount, 
numThreads);
 
 LOG.info("Loading hash table for input: {} cacheKey: {} 
tableContainer: {} smallTablePos: {} " +
 "estKeyCount : {} keyCount : {}", inputName, cacheKey,
 vectorMapJoinFastTableContainer.getClass().getSimpleName(), 
pos, estKeyCount, keyCount);
 
 vectorMapJoinFastTableContainer.setSerde(null, null); // No SerDes 
here.
+ExecutorService executorService = 
Executors.newFixedThreadPool(numThreads);
+BlockingQueue[] sharedQ = new 
BlockingQueue[numThreads];
+for(int i = 0; i < numThreads; ++i) {
+  sharedQ[i] = new LinkedBlockingQueue<>();
+}
+QueueElementBatch[] batches = new QueueElementBatch[numThreads];
+for (int i = 0; i < numThreads; ++i) {
+  batches[i] = new QueueElementBatch();
+}

Review comment:
   ets use a init method for these lines

##
File path: 
ql/src/java/org/apache/hadoop/hive/ql/exec/vector/mapjoin/fast/VectorMapJoinFastHashTableLoader.java
##
@@ -141,35 +280,64 @@ public void load(MapJoinTableContainer[] mapJoinTables,
 long keyCount = Math.max(estKeyCount, inputRecords);
 
 VectorMapJoinFastTableContainer vectorMapJoinFastTableContainer =
-new VectorMapJoinFastTableContainer(desc, hconf, keyCount);
+new VectorMapJoinFastTableContainer(desc, hconf, keyCount, 
numThreads);
 
 LOG.info("Loading hash table for input: {} cacheKey: {} 
tableContainer: {} smallTablePos: {} " +
 "estKeyCount : {} keyCount : {}", inputName, cacheKey,
 vectorMapJoinFastTableContainer.getClass().getSimpleName(), 
pos, estKeyCount, keyCount);
 
 vectorMapJoinFastTableContainer.setSerde(null, null); // No SerDes 
here.
+ExecutorService executorService = 
Executors.newFixedThreadPool(numThreads);
+BlockingQueue[] sharedQ = new 
BlockingQueue[numThreads];
+for(int i = 0; i < numThreads; ++i) {
+  sharedQ[i] = new LinkedBlockingQueue<>();
+}
+QueueElementBatch[] batches = new QueueElementBatch[numThreads];
+for (int i = 0; i < numThreads; ++i) {
+  batches[i] = new QueueElementBatch();
+}

Review comment:
   Lets use a init method for these lines




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 595419)
Time Spent: 1h 40m  (was: 1.5h)

> Parallelize hash table constructions in map joins
> -
>
> Key: HIVE-24037
> URL: https://issues.apache.org/jira/browse/HIVE-24037
> Project: Hive
>  Issue Type: Improvement
>Reporter: Ramesh Kumar Thangarajan
>Assignee: Ramesh Kumar Thangarajan
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> Parallelize hash table constructions in map joins



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (HIVE-24037) Parallelize hash table constructions in map joins

2021-05-12 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24037?focusedWorklogId=595409=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-595409
 ]

ASF GitHub Bot logged work on HIVE-24037:
-

Author: ASF GitHub Bot
Created on: 12/May/21 15:18
Start Date: 12/May/21 15:18
Worklog Time Spent: 10m 
  Work Description: pgaref commented on a change in pull request #2004:
URL: https://github.com/apache/hive/pull/2004#discussion_r631139021



##
File path: 
ql/src/java/org/apache/hadoop/hive/ql/exec/vector/mapjoin/fast/VectorMapJoinFastHashTableLoader.java
##
@@ -54,11 +61,73 @@
 
   private static final Logger LOG = 
LoggerFactory.getLogger(VectorMapJoinFastHashTableLoader.class.getName());
 
+  public static class HashTableElement {
+byte[] keyBytes;
+int keyLength;
+byte[] valueBytes;
+int valueLength;
+long deserializeKey;
+long hashCode;
+
+public HashTableElement(byte[] keyBytes, int keyLength, byte[] valueBytes, 
int valueLength, long key, long hashCode) {

Review comment:
   KeyLen and ValueLen seems redundant ? copyBytes goes up to size() 
anyway..




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 595409)
Time Spent: 1.5h  (was: 1h 20m)

> Parallelize hash table constructions in map joins
> -
>
> Key: HIVE-24037
> URL: https://issues.apache.org/jira/browse/HIVE-24037
> Project: Hive
>  Issue Type: Improvement
>Reporter: Ramesh Kumar Thangarajan
>Assignee: Ramesh Kumar Thangarajan
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> Parallelize hash table constructions in map joins



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (HIVE-24037) Parallelize hash table constructions in map joins

2021-05-12 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24037?focusedWorklogId=595405=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-595405
 ]

ASF GitHub Bot logged work on HIVE-24037:
-

Author: ASF GitHub Bot
Created on: 12/May/21 15:09
Start Date: 12/May/21 15:09
Worklog Time Spent: 10m 
  Work Description: pgaref commented on a change in pull request #2004:
URL: https://github.com/apache/hive/pull/2004#discussion_r631130583



##
File path: 
ql/src/java/org/apache/hadoop/hive/ql/exec/persistence/MapJoinTableContainerDirectAccess.java
##
@@ -21,11 +21,15 @@
 
 import java.io.IOException;
 
+import org.apache.hadoop.hive.ql.metadata.HiveException;
 import org.apache.hadoop.hive.serde2.SerDeException;
+import org.apache.hadoop.io.BytesWritable;
 import org.apache.hadoop.io.Writable;
 
 public interface MapJoinTableContainerDirectAccess {
 
   void put(Writable currentKey, Writable currentValue) throws SerDeException, 
IOException;
 
+  long calculateLongHashCode(BytesWritable currentKey) throws HiveException, 
IOException, SerDeException;

Review comment:
   Nit. Shall we simplify this to something like getHashCode(BytesWritable 
currentKey)  




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 595405)
Time Spent: 1h 20m  (was: 1h 10m)

> Parallelize hash table constructions in map joins
> -
>
> Key: HIVE-24037
> URL: https://issues.apache.org/jira/browse/HIVE-24037
> Project: Hive
>  Issue Type: Improvement
>Reporter: Ramesh Kumar Thangarajan
>Assignee: Ramesh Kumar Thangarajan
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> Parallelize hash table constructions in map joins



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (HIVE-24037) Parallelize hash table constructions in map joins

2021-05-12 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24037?focusedWorklogId=595404=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-595404
 ]

ASF GitHub Bot logged work on HIVE-24037:
-

Author: ASF GitHub Bot
Created on: 12/May/21 15:06
Start Date: 12/May/21 15:06
Worklog Time Spent: 10m 
  Work Description: pgaref commented on a change in pull request #2004:
URL: https://github.com/apache/hive/pull/2004#discussion_r631128300



##
File path: 
ql/src/java/org/apache/hadoop/hive/ql/exec/persistence/HybridHashTableContainer.java
##
@@ -806,6 +806,12 @@ public void put(Writable currentKey, Writable 
currentValue) throws SerDeExceptio
 internalPutRow(directWriteHelper, currentKey, currentValue);
   }
 
+  @Override
+  public long calculateLongHashCode(BytesWritable currentKey) throws 
HiveException, IOException, SerDeException {
+directWriteHelper.setKeyValue(currentKey, null);
+return (long)directWriteHelper.getHashFromKey();

Review comment:
   Casting seems redundant here




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 595404)
Time Spent: 1h 10m  (was: 1h)

> Parallelize hash table constructions in map joins
> -
>
> Key: HIVE-24037
> URL: https://issues.apache.org/jira/browse/HIVE-24037
> Project: Hive
>  Issue Type: Improvement
>Reporter: Ramesh Kumar Thangarajan
>Assignee: Ramesh Kumar Thangarajan
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> Parallelize hash table constructions in map joins



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (HIVE-24037) Parallelize hash table constructions in map joins

2021-05-12 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24037?focusedWorklogId=595400=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-595400
 ]

ASF GitHub Bot logged work on HIVE-24037:
-

Author: ASF GitHub Bot
Created on: 12/May/21 15:05
Start Date: 12/May/21 15:05
Worklog Time Spent: 10m 
  Work Description: pgaref commented on a change in pull request #2004:
URL: https://github.com/apache/hive/pull/2004#discussion_r631127066



##
File path: 
ql/src/java/org/apache/hadoop/hive/ql/exec/persistence/MapJoinBytesTableContainer.java
##
@@ -499,6 +500,13 @@ public void put(Writable currentKey, Writable 
currentValue) throws SerDeExceptio
 hashMap.put(directWriteHelper, -1);
   }
 
+  @Override public long calculateLongHashCode(BytesWritable currentKey)
+  throws HiveException, IOException, SerDeException {
+directWriteHelper.setKeyValue(currentKey, null);
+directWriteHelper.getHashFromKey();
+return 0;

Review comment:
   Maybe return directWriteHelper.getHashFromKey()   ?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 595400)
Time Spent: 1h  (was: 50m)

> Parallelize hash table constructions in map joins
> -
>
> Key: HIVE-24037
> URL: https://issues.apache.org/jira/browse/HIVE-24037
> Project: Hive
>  Issue Type: Improvement
>Reporter: Ramesh Kumar Thangarajan
>Assignee: Ramesh Kumar Thangarajan
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> Parallelize hash table constructions in map joins



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (HIVE-24037) Parallelize hash table constructions in map joins

2021-02-26 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24037?focusedWorklogId=558560=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-558560
 ]

ASF GitHub Bot logged work on HIVE-24037:
-

Author: ASF GitHub Bot
Created on: 26/Feb/21 13:07
Start Date: 26/Feb/21 13:07
Worklog Time Spent: 10m 
  Work Description: pgaref commented on a change in pull request #2004:
URL: https://github.com/apache/hive/pull/2004#discussion_r583062970



##
File path: 
ql/src/java/org/apache/hadoop/hive/ql/exec/persistence/MapJoinTableContainerDirectAccess.java
##
@@ -21,11 +21,15 @@
 
 import java.io.IOException;
 
+import org.apache.hadoop.hive.ql.metadata.HiveException;
 import org.apache.hadoop.hive.serde2.SerDeException;
+import org.apache.hadoop.io.BytesWritable;
 import org.apache.hadoop.io.Writable;
 
 public interface MapJoinTableContainerDirectAccess {
 
   void put(Writable currentKey, Writable currentValue) throws SerDeException, 
IOException;
 
+  long calculateLongHashCode(BytesWritable currentKey, BytesWritable 
currentValue) throws HiveException, IOException, SerDeException;

Review comment:
   lets use only the Key here

##
File path: 
ql/src/java/org/apache/hadoop/hive/ql/exec/vector/mapjoin/VectorMapJoinCommonOperator.java
##
@@ -667,6 +673,8 @@ private void setUpHashTable() {
 VectorMapJoinTableContainer vectorMapJoinTableContainer =
 (VectorMapJoinTableContainer) 
mapJoinTables[posSingleVectorMapJoinSmallTable];
 vectorMapJoinHashTable = 
vectorMapJoinTableContainer.vectorMapJoinHashTable();
+vectorMapJoinFastHashTableWrapper = 
((VectorMapJoinFastHashTableParallel)vectorMapJoinTableContainer.

Review comment:
   Can we just keep it as a vectorMapJoinHashTable ? What is the reason for 
the extra interface (VectorMapJoinFastHashTableWrapper)

##
File path: 
ql/src/java/org/apache/hadoop/hive/ql/exec/vector/mapjoin/fast/VectorMapJoinFastBytesHashMap.java
##
@@ -48,7 +48,7 @@
 
   private long fullOuterNullKeyRefWord;
 
-  private static class NonMatchedBytesHashMapIterator extends 
VectorMapJoinFastNonMatchedIterator {
+  public static class NonMatchedBytesHashMapIterator extends 
VectorMapJoinFastNonMatchedIterator {

Review comment:
   Maybe move this along with NonMatchedLongHashMapIterator to their own 
class? 

##
File path: 
ql/src/java/org/apache/hadoop/hive/ql/exec/vector/mapjoin/fast/VectorMapJoinFastHashTable.java
##
@@ -64,15 +64,19 @@ public void throwExpandError(int limit, String 
dataTypeName) {
 
   private static void validateCapacity(long capacity) {
 if (Long.bitCount(capacity) != 1) {
-  throw new AssertionError("Capacity must be a power of two");
+  throw new AssertionError("Capacity must be a power of two " + capacity);

Review comment:
   Nit. A better way to calculate if number is power of two would be:
   `capacity & (capacity -1) == 0`

##
File path: 
ql/src/java/org/apache/hadoop/hive/ql/exec/vector/mapjoin/fast/VectorMapJoinFastHashTableLoader.java
##
@@ -71,6 +196,79 @@ public void init(ExecMapperContext context, MapredContext 
mrContext,
 String vertexName = hconf.get(Operator.CONTEXT_NAME_KEY, "");
 String counterName = 
Utilities.getVertexCounterName(HashTableLoaderCounters.HASHTABLE_LOAD_TIME_MS.name(),
 vertexName);
 this.htLoadCounter = 
tezContext.getTezProcessorContext().getCounters().findCounter(counterGroup, 
counterName);
+this.numEntries = 0;
+totalEntries = new AtomicLong(0);
+  }
+
+  public void drainQueueAndLoad(VectorMapJoinFastTableContainer 
vectorMapJoinFastTableContainer, boolean doMemCheck,
+  String inputName, MemoryMonitorInfo memoryMonitorInfo, long 
effectiveThreshold, int partitionId,
+  BlockingQueue[] sharedQ)
+  throws InterruptedException, IOException, HiveException, SerDeException {
+LOG.info("Draining thread " + partitionId + " started");
+long entries = 0;
+BlockingQueue[] partitionQ = sharedQ;

Review comment:
   is this assignment needed?

##
File path: 
ql/src/java/org/apache/hadoop/hive/ql/exec/vector/mapjoin/fast/VectorMapJoinFastHashTableLoader.java
##
@@ -71,6 +196,79 @@ public void init(ExecMapperContext context, MapredContext 
mrContext,
 String vertexName = hconf.get(Operator.CONTEXT_NAME_KEY, "");
 String counterName = 
Utilities.getVertexCounterName(HashTableLoaderCounters.HASHTABLE_LOAD_TIME_MS.name(),
 vertexName);
 this.htLoadCounter = 
tezContext.getTezProcessorContext().getCounters().findCounter(counterGroup, 
counterName);
+this.numEntries = 0;
+totalEntries = new AtomicLong(0);
+  }
+
+  public void drainQueueAndLoad(VectorMapJoinFastTableContainer 
vectorMapJoinFastTableContainer, boolean doMemCheck,
+  String inputName, MemoryMonitorInfo memoryMonitorInfo, long 
effectiveThreshold, int 

[jira] [Work logged] (HIVE-24037) Parallelize hash table constructions in map joins

2021-02-22 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24037?focusedWorklogId=556013=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-556013
 ]

ASF GitHub Bot logged work on HIVE-24037:
-

Author: ASF GitHub Bot
Created on: 22/Feb/21 20:37
Start Date: 22/Feb/21 20:37
Worklog Time Spent: 10m 
  Work Description: ramesh0201 opened a new pull request #2004:
URL: https://github.com/apache/hive/pull/2004


   
   
   ### What changes were proposed in this pull request?
   
   
   
   ### Why are the changes needed?
   
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   
   
   ### How was this patch tested?
   
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 556013)
Time Spent: 40m  (was: 0.5h)

> Parallelize hash table constructions in map joins
> -
>
> Key: HIVE-24037
> URL: https://issues.apache.org/jira/browse/HIVE-24037
> Project: Hive
>  Issue Type: Improvement
>Reporter: Ramesh Kumar Thangarajan
>Assignee: Ramesh Kumar Thangarajan
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> Parallelize hash table constructions in map joins



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (HIVE-24037) Parallelize hash table constructions in map joins

2020-10-27 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24037?focusedWorklogId=505480=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-505480
 ]

ASF GitHub Bot logged work on HIVE-24037:
-

Author: ASF GitHub Bot
Created on: 28/Oct/20 01:01
Start Date: 28/Oct/20 01:01
Worklog Time Spent: 10m 
  Work Description: github-actions[bot] closed pull request #1401:
URL: https://github.com/apache/hive/pull/1401


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 505480)
Time Spent: 0.5h  (was: 20m)

> Parallelize hash table constructions in map joins
> -
>
> Key: HIVE-24037
> URL: https://issues.apache.org/jira/browse/HIVE-24037
> Project: Hive
>  Issue Type: Improvement
>Reporter: Ramesh Kumar Thangarajan
>Assignee: Ramesh Kumar Thangarajan
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> Parallelize hash table constructions in map joins



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (HIVE-24037) Parallelize hash table constructions in map joins

2020-10-20 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24037?focusedWorklogId=502921=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-502921
 ]

ASF GitHub Bot logged work on HIVE-24037:
-

Author: ASF GitHub Bot
Created on: 21/Oct/20 00:57
Start Date: 21/Oct/20 00:57
Worklog Time Spent: 10m 
  Work Description: github-actions[bot] commented on pull request #1401:
URL: https://github.com/apache/hive/pull/1401#issuecomment-713224730


   This pull request has been automatically marked as stale because it has not 
had recent activity. It will be closed if no further activity occurs.
   Feel free to reach out on the d...@hive.apache.org list if the patch is in 
need of reviews.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 502921)
Time Spent: 20m  (was: 10m)

> Parallelize hash table constructions in map joins
> -
>
> Key: HIVE-24037
> URL: https://issues.apache.org/jira/browse/HIVE-24037
> Project: Hive
>  Issue Type: Improvement
>Reporter: Ramesh Kumar Thangarajan
>Assignee: Ramesh Kumar Thangarajan
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Parallelize hash table constructions in map joins



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (HIVE-24037) Parallelize hash table constructions in map joins

2020-08-13 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24037?focusedWorklogId=470163=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-470163
 ]

ASF GitHub Bot logged work on HIVE-24037:
-

Author: ASF GitHub Bot
Created on: 13/Aug/20 09:46
Start Date: 13/Aug/20 09:46
Worklog Time Spent: 10m 
  Work Description: ramesh0201 opened a new pull request #1401:
URL: https://github.com/apache/hive/pull/1401


   
   
   ### What changes were proposed in this pull request?
   
   
   
   ### Why are the changes needed?
   
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   
   
   ### How was this patch tested?
   
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 470163)
Remaining Estimate: 0h
Time Spent: 10m

> Parallelize hash table constructions in map joins
> -
>
> Key: HIVE-24037
> URL: https://issues.apache.org/jira/browse/HIVE-24037
> Project: Hive
>  Issue Type: Improvement
>Reporter: Ramesh Kumar Thangarajan
>Assignee: Ramesh Kumar Thangarajan
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Parallelize hash table constructions in map joins



--
This message was sent by Atlassian Jira
(v8.3.4#803005)