[jira] [Work logged] (HIVE-24037) Parallelize hash table constructions in map joins
[ https://issues.apache.org/jira/browse/HIVE-24037?focusedWorklogId=625457=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-625457 ] ASF GitHub Bot logged work on HIVE-24037: - Author: ASF GitHub Bot Created on: 20/Jul/21 12:12 Start Date: 20/Jul/21 12:12 Worklog Time Spent: 10m Work Description: github-actions[bot] closed pull request #2004: URL: https://github.com/apache/hive/pull/2004 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: gitbox-unsubscr...@hive.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 625457) Time Spent: 3h (was: 2h 50m) > Parallelize hash table constructions in map joins > - > > Key: HIVE-24037 > URL: https://issues.apache.org/jira/browse/HIVE-24037 > Project: Hive > Issue Type: Improvement >Reporter: Ramesh Kumar Thangarajan >Priority: Major > Labels: pull-request-available > Time Spent: 3h > Remaining Estimate: 0h > > Parallelize hash table constructions in map joins -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-24037) Parallelize hash table constructions in map joins
[ https://issues.apache.org/jira/browse/HIVE-24037?focusedWorklogId=625064=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-625064 ] ASF GitHub Bot logged work on HIVE-24037: - Author: ASF GitHub Bot Created on: 20/Jul/21 10:01 Start Date: 20/Jul/21 10:01 Worklog Time Spent: 10m Work Description: github-actions[bot] closed pull request #2004: URL: https://github.com/apache/hive/pull/2004 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: gitbox-unsubscr...@hive.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 625064) Time Spent: 2h 50m (was: 2h 40m) > Parallelize hash table constructions in map joins > - > > Key: HIVE-24037 > URL: https://issues.apache.org/jira/browse/HIVE-24037 > Project: Hive > Issue Type: Improvement >Reporter: Ramesh Kumar Thangarajan >Priority: Major > Labels: pull-request-available > Time Spent: 2h 50m > Remaining Estimate: 0h > > Parallelize hash table constructions in map joins -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-24037) Parallelize hash table constructions in map joins
[ https://issues.apache.org/jira/browse/HIVE-24037?focusedWorklogId=624709=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-624709 ] ASF GitHub Bot logged work on HIVE-24037: - Author: ASF GitHub Bot Created on: 20/Jul/21 00:08 Start Date: 20/Jul/21 00:08 Worklog Time Spent: 10m Work Description: github-actions[bot] closed pull request #2004: URL: https://github.com/apache/hive/pull/2004 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: gitbox-unsubscr...@hive.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 624709) Time Spent: 2h 40m (was: 2.5h) > Parallelize hash table constructions in map joins > - > > Key: HIVE-24037 > URL: https://issues.apache.org/jira/browse/HIVE-24037 > Project: Hive > Issue Type: Improvement >Reporter: Ramesh Kumar Thangarajan >Priority: Major > Labels: pull-request-available > Time Spent: 2h 40m > Remaining Estimate: 0h > > Parallelize hash table constructions in map joins -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-24037) Parallelize hash table constructions in map joins
[ https://issues.apache.org/jira/browse/HIVE-24037?focusedWorklogId=621409=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-621409 ] ASF GitHub Bot logged work on HIVE-24037: - Author: ASF GitHub Bot Created on: 12/Jul/21 00:09 Start Date: 12/Jul/21 00:09 Worklog Time Spent: 10m Work Description: github-actions[bot] commented on pull request #2004: URL: https://github.com/apache/hive/pull/2004#issuecomment-877884465 This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Feel free to reach out on the d...@hive.apache.org list if the patch is in need of reviews. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: gitbox-unsubscr...@hive.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 621409) Time Spent: 2.5h (was: 2h 20m) > Parallelize hash table constructions in map joins > - > > Key: HIVE-24037 > URL: https://issues.apache.org/jira/browse/HIVE-24037 > Project: Hive > Issue Type: Improvement >Reporter: Ramesh Kumar Thangarajan >Priority: Major > Labels: pull-request-available > Time Spent: 2.5h > Remaining Estimate: 0h > > Parallelize hash table constructions in map joins -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-24037) Parallelize hash table constructions in map joins
[ https://issues.apache.org/jira/browse/HIVE-24037?focusedWorklogId=600301=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-600301 ] ASF GitHub Bot logged work on HIVE-24037: - Author: ASF GitHub Bot Created on: 21/May/21 11:40 Start Date: 21/May/21 11:40 Worklog Time Spent: 10m Work Description: pgaref opened a new pull request #2305: URL: https://github.com/apache/hive/pull/2305 ### What changes were proposed in this pull request? ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 600301) Time Spent: 2h 20m (was: 2h 10m) > Parallelize hash table constructions in map joins > - > > Key: HIVE-24037 > URL: https://issues.apache.org/jira/browse/HIVE-24037 > Project: Hive > Issue Type: Improvement >Reporter: Ramesh Kumar Thangarajan >Assignee: Ramesh Kumar Thangarajan >Priority: Major > Labels: pull-request-available > Time Spent: 2h 20m > Remaining Estimate: 0h > > Parallelize hash table constructions in map joins -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-24037) Parallelize hash table constructions in map joins
[ https://issues.apache.org/jira/browse/HIVE-24037?focusedWorklogId=595452=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-595452 ] ASF GitHub Bot logged work on HIVE-24037: - Author: ASF GitHub Bot Created on: 12/May/21 16:23 Start Date: 12/May/21 16:23 Worklog Time Spent: 10m Work Description: pgaref commented on a change in pull request #2004: URL: https://github.com/apache/hive/pull/2004#discussion_r631202108 ## File path: ql/src/java/org/apache/hadoop/hive/ql/exec/vector/mapjoin/fast/VectorMapJoinFastHashTableWrapper.java ## @@ -0,0 +1,40 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.hadoop.hive.ql.exec.vector.mapjoin.fast; + +import org.apache.hadoop.hive.ql.metadata.HiveException; +import org.apache.hadoop.io.BytesWritable; + +import java.io.IOException; + +public abstract class VectorMapJoinFastHashTableWrapper { + + public abstract long calculateLongHashCode(long key, BytesWritable currentKey) throws HiveException, IOException; + + public abstract long deserializeToKey(BytesWritable currentKey) throws HiveException, IOException; Review comment: Maybe have a default impl of deserializeToKey() throwing an exception or returning 0 and only have Long implementations to Override? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 595452) Time Spent: 2h 10m (was: 2h) > Parallelize hash table constructions in map joins > - > > Key: HIVE-24037 > URL: https://issues.apache.org/jira/browse/HIVE-24037 > Project: Hive > Issue Type: Improvement >Reporter: Ramesh Kumar Thangarajan >Assignee: Ramesh Kumar Thangarajan >Priority: Major > Labels: pull-request-available > Time Spent: 2h 10m > Remaining Estimate: 0h > > Parallelize hash table constructions in map joins -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-24037) Parallelize hash table constructions in map joins
[ https://issues.apache.org/jira/browse/HIVE-24037?focusedWorklogId=595425=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-595425 ] ASF GitHub Bot logged work on HIVE-24037: - Author: ASF GitHub Bot Created on: 12/May/21 15:46 Start Date: 12/May/21 15:46 Worklog Time Spent: 10m Work Description: pgaref commented on a change in pull request #2004: URL: https://github.com/apache/hive/pull/2004#discussion_r631163313 ## File path: ql/src/java/org/apache/hadoop/hive/ql/exec/vector/mapjoin/fast/VectorMapJoinFastHashTableLoader.java ## @@ -141,35 +280,64 @@ public void load(MapJoinTableContainer[] mapJoinTables, long keyCount = Math.max(estKeyCount, inputRecords); VectorMapJoinFastTableContainer vectorMapJoinFastTableContainer = -new VectorMapJoinFastTableContainer(desc, hconf, keyCount); +new VectorMapJoinFastTableContainer(desc, hconf, keyCount, numThreads); LOG.info("Loading hash table for input: {} cacheKey: {} tableContainer: {} smallTablePos: {} " + "estKeyCount : {} keyCount : {}", inputName, cacheKey, vectorMapJoinFastTableContainer.getClass().getSimpleName(), pos, estKeyCount, keyCount); vectorMapJoinFastTableContainer.setSerde(null, null); // No SerDes here. +ExecutorService executorService = Executors.newFixedThreadPool(numThreads); +BlockingQueue[] sharedQ = new BlockingQueue[numThreads]; +for(int i = 0; i < numThreads; ++i) { + sharedQ[i] = new LinkedBlockingQueue<>(); +} +QueueElementBatch[] batches = new QueueElementBatch[numThreads]; +for (int i = 0; i < numThreads; ++i) { + batches[i] = new QueueElementBatch(); +} +//start the threads +drain(vectorMapJoinFastTableContainer, doMemCheck, inputName, memoryMonitorInfo, +effectiveThreshold, executorService, sharedQ); long startTime = System.currentTimeMillis(); while (kvReader.next()) { - vectorMapJoinFastTableContainer.putRow((BytesWritable)kvReader.getCurrentKey(), - (BytesWritable)kvReader.getCurrentValue()); + BytesWritable currentKey = (BytesWritable) kvReader.getCurrentKey(); + BytesWritable currentValue = (BytesWritable) kvReader.getCurrentValue(); + long key = vectorMapJoinFastTableContainer.deserializeToKey(currentKey); + long hashCode = vectorMapJoinFastTableContainer.calculateLongHashCode(key, currentKey); + int partitionId = (int) ((numThreads - 1) & hashCode); numEntries++; - if (doMemCheck && (numEntries % memoryMonitorInfo.getMemoryCheckInterval() == 0)) { - final long estMemUsage = vectorMapJoinFastTableContainer.getEstimatedMemorySize(); - if (estMemUsage > effectiveThreshold) { -String msg = "Hash table loading exceeded memory limits for input: " + inputName + - " numEntries: " + numEntries + " estimatedMemoryUsage: " + estMemUsage + - " effectiveThreshold: " + effectiveThreshold + " memoryMonitorInfo: " + memoryMonitorInfo; -LOG.error(msg); -throw new MapJoinMemoryExhaustionError(msg); - } else { -if (LOG.isInfoEnabled()) { - LOG.info("Checking hash table loader memory usage for input: {} numEntries: {} " + - "estimatedMemoryUsage: {} effectiveThreshold: {}", inputName, numEntries, estMemUsage, -effectiveThreshold); -} - } + // call getBytes as copy is called later + byte[] valueBytes = currentValue.copyBytes(); + int valueLength = currentValue.getLength(); + byte[] keyBytes = currentKey.copyBytes(); + int keyLength = currentKey.getLength(); + HashTableElement h = new HashTableElement(keyBytes, keyLength, valueBytes, valueLength, key, hashCode); + if (batches[partitionId].addElement(h)) { + sharedQ[partitionId].add(batches[partitionId]); + batches[partitionId] = new QueueElementBatch(); } } + +LOG.info("Finished loading the queue for input: {} endTime : {}", inputName, System.currentTimeMillis()); + +// Add sentinel at the end of queue +for (int i=0; i<4; ++i) { + // add sentinel to the q not the batch + sharedQ[i].add(batches[i]); + sharedQ[i].add(sentinel); +} + +executorService.shutdown(); +try { + executorService.awaitTermination(Long.MAX_VALUE, TimeUnit.NANOSECONDS); +} catch (InterruptedException e) { Review comment: handle exception along with others at the end? -- This is an automated message from the Apache Git Service.
[jira] [Work logged] (HIVE-24037) Parallelize hash table constructions in map joins
[ https://issues.apache.org/jira/browse/HIVE-24037?focusedWorklogId=595424=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-595424 ] ASF GitHub Bot logged work on HIVE-24037: - Author: ASF GitHub Bot Created on: 12/May/21 15:45 Start Date: 12/May/21 15:45 Worklog Time Spent: 10m Work Description: pgaref commented on a change in pull request #2004: URL: https://github.com/apache/hive/pull/2004#discussion_r631163313 ## File path: ql/src/java/org/apache/hadoop/hive/ql/exec/vector/mapjoin/fast/VectorMapJoinFastHashTableLoader.java ## @@ -141,35 +280,64 @@ public void load(MapJoinTableContainer[] mapJoinTables, long keyCount = Math.max(estKeyCount, inputRecords); VectorMapJoinFastTableContainer vectorMapJoinFastTableContainer = -new VectorMapJoinFastTableContainer(desc, hconf, keyCount); +new VectorMapJoinFastTableContainer(desc, hconf, keyCount, numThreads); LOG.info("Loading hash table for input: {} cacheKey: {} tableContainer: {} smallTablePos: {} " + "estKeyCount : {} keyCount : {}", inputName, cacheKey, vectorMapJoinFastTableContainer.getClass().getSimpleName(), pos, estKeyCount, keyCount); vectorMapJoinFastTableContainer.setSerde(null, null); // No SerDes here. +ExecutorService executorService = Executors.newFixedThreadPool(numThreads); +BlockingQueue[] sharedQ = new BlockingQueue[numThreads]; +for(int i = 0; i < numThreads; ++i) { + sharedQ[i] = new LinkedBlockingQueue<>(); +} +QueueElementBatch[] batches = new QueueElementBatch[numThreads]; +for (int i = 0; i < numThreads; ++i) { + batches[i] = new QueueElementBatch(); +} +//start the threads +drain(vectorMapJoinFastTableContainer, doMemCheck, inputName, memoryMonitorInfo, +effectiveThreshold, executorService, sharedQ); long startTime = System.currentTimeMillis(); while (kvReader.next()) { - vectorMapJoinFastTableContainer.putRow((BytesWritable)kvReader.getCurrentKey(), - (BytesWritable)kvReader.getCurrentValue()); + BytesWritable currentKey = (BytesWritable) kvReader.getCurrentKey(); + BytesWritable currentValue = (BytesWritable) kvReader.getCurrentValue(); + long key = vectorMapJoinFastTableContainer.deserializeToKey(currentKey); + long hashCode = vectorMapJoinFastTableContainer.calculateLongHashCode(key, currentKey); + int partitionId = (int) ((numThreads - 1) & hashCode); numEntries++; - if (doMemCheck && (numEntries % memoryMonitorInfo.getMemoryCheckInterval() == 0)) { - final long estMemUsage = vectorMapJoinFastTableContainer.getEstimatedMemorySize(); - if (estMemUsage > effectiveThreshold) { -String msg = "Hash table loading exceeded memory limits for input: " + inputName + - " numEntries: " + numEntries + " estimatedMemoryUsage: " + estMemUsage + - " effectiveThreshold: " + effectiveThreshold + " memoryMonitorInfo: " + memoryMonitorInfo; -LOG.error(msg); -throw new MapJoinMemoryExhaustionError(msg); - } else { -if (LOG.isInfoEnabled()) { - LOG.info("Checking hash table loader memory usage for input: {} numEntries: {} " + - "estimatedMemoryUsage: {} effectiveThreshold: {}", inputName, numEntries, estMemUsage, -effectiveThreshold); -} - } + // call getBytes as copy is called later + byte[] valueBytes = currentValue.copyBytes(); + int valueLength = currentValue.getLength(); + byte[] keyBytes = currentKey.copyBytes(); + int keyLength = currentKey.getLength(); + HashTableElement h = new HashTableElement(keyBytes, keyLength, valueBytes, valueLength, key, hashCode); + if (batches[partitionId].addElement(h)) { + sharedQ[partitionId].add(batches[partitionId]); + batches[partitionId] = new QueueElementBatch(); } } + +LOG.info("Finished loading the queue for input: {} endTime : {}", inputName, System.currentTimeMillis()); + +// Add sentinel at the end of queue +for (int i=0; i<4; ++i) { + // add sentinel to the q not the batch + sharedQ[i].add(batches[i]); + sharedQ[i].add(sentinel); +} + +executorService.shutdown(); +try { + executorService.awaitTermination(Long.MAX_VALUE, TimeUnit.NANOSECONDS); +} catch (InterruptedException e) { Review comment: handle exception along with others at the end -- This is an automated message from the Apache Git Service.
[jira] [Work logged] (HIVE-24037) Parallelize hash table constructions in map joins
[ https://issues.apache.org/jira/browse/HIVE-24037?focusedWorklogId=595419=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-595419 ] ASF GitHub Bot logged work on HIVE-24037: - Author: ASF GitHub Bot Created on: 12/May/21 15:38 Start Date: 12/May/21 15:38 Worklog Time Spent: 10m Work Description: pgaref commented on a change in pull request #2004: URL: https://github.com/apache/hive/pull/2004#discussion_r631157229 ## File path: ql/src/java/org/apache/hadoop/hive/ql/exec/vector/mapjoin/fast/VectorMapJoinFastHashTableLoader.java ## @@ -141,35 +280,64 @@ public void load(MapJoinTableContainer[] mapJoinTables, long keyCount = Math.max(estKeyCount, inputRecords); VectorMapJoinFastTableContainer vectorMapJoinFastTableContainer = -new VectorMapJoinFastTableContainer(desc, hconf, keyCount); +new VectorMapJoinFastTableContainer(desc, hconf, keyCount, numThreads); LOG.info("Loading hash table for input: {} cacheKey: {} tableContainer: {} smallTablePos: {} " + "estKeyCount : {} keyCount : {}", inputName, cacheKey, vectorMapJoinFastTableContainer.getClass().getSimpleName(), pos, estKeyCount, keyCount); vectorMapJoinFastTableContainer.setSerde(null, null); // No SerDes here. +ExecutorService executorService = Executors.newFixedThreadPool(numThreads); +BlockingQueue[] sharedQ = new BlockingQueue[numThreads]; +for(int i = 0; i < numThreads; ++i) { + sharedQ[i] = new LinkedBlockingQueue<>(); +} +QueueElementBatch[] batches = new QueueElementBatch[numThreads]; +for (int i = 0; i < numThreads; ++i) { + batches[i] = new QueueElementBatch(); +} Review comment: ets use a init method for these lines ## File path: ql/src/java/org/apache/hadoop/hive/ql/exec/vector/mapjoin/fast/VectorMapJoinFastHashTableLoader.java ## @@ -141,35 +280,64 @@ public void load(MapJoinTableContainer[] mapJoinTables, long keyCount = Math.max(estKeyCount, inputRecords); VectorMapJoinFastTableContainer vectorMapJoinFastTableContainer = -new VectorMapJoinFastTableContainer(desc, hconf, keyCount); +new VectorMapJoinFastTableContainer(desc, hconf, keyCount, numThreads); LOG.info("Loading hash table for input: {} cacheKey: {} tableContainer: {} smallTablePos: {} " + "estKeyCount : {} keyCount : {}", inputName, cacheKey, vectorMapJoinFastTableContainer.getClass().getSimpleName(), pos, estKeyCount, keyCount); vectorMapJoinFastTableContainer.setSerde(null, null); // No SerDes here. +ExecutorService executorService = Executors.newFixedThreadPool(numThreads); +BlockingQueue[] sharedQ = new BlockingQueue[numThreads]; +for(int i = 0; i < numThreads; ++i) { + sharedQ[i] = new LinkedBlockingQueue<>(); +} +QueueElementBatch[] batches = new QueueElementBatch[numThreads]; +for (int i = 0; i < numThreads; ++i) { + batches[i] = new QueueElementBatch(); +} Review comment: Lets use a init method for these lines -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 595419) Time Spent: 1h 40m (was: 1.5h) > Parallelize hash table constructions in map joins > - > > Key: HIVE-24037 > URL: https://issues.apache.org/jira/browse/HIVE-24037 > Project: Hive > Issue Type: Improvement >Reporter: Ramesh Kumar Thangarajan >Assignee: Ramesh Kumar Thangarajan >Priority: Major > Labels: pull-request-available > Time Spent: 1h 40m > Remaining Estimate: 0h > > Parallelize hash table constructions in map joins -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-24037) Parallelize hash table constructions in map joins
[ https://issues.apache.org/jira/browse/HIVE-24037?focusedWorklogId=595409=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-595409 ] ASF GitHub Bot logged work on HIVE-24037: - Author: ASF GitHub Bot Created on: 12/May/21 15:18 Start Date: 12/May/21 15:18 Worklog Time Spent: 10m Work Description: pgaref commented on a change in pull request #2004: URL: https://github.com/apache/hive/pull/2004#discussion_r631139021 ## File path: ql/src/java/org/apache/hadoop/hive/ql/exec/vector/mapjoin/fast/VectorMapJoinFastHashTableLoader.java ## @@ -54,11 +61,73 @@ private static final Logger LOG = LoggerFactory.getLogger(VectorMapJoinFastHashTableLoader.class.getName()); + public static class HashTableElement { +byte[] keyBytes; +int keyLength; +byte[] valueBytes; +int valueLength; +long deserializeKey; +long hashCode; + +public HashTableElement(byte[] keyBytes, int keyLength, byte[] valueBytes, int valueLength, long key, long hashCode) { Review comment: KeyLen and ValueLen seems redundant ? copyBytes goes up to size() anyway.. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 595409) Time Spent: 1.5h (was: 1h 20m) > Parallelize hash table constructions in map joins > - > > Key: HIVE-24037 > URL: https://issues.apache.org/jira/browse/HIVE-24037 > Project: Hive > Issue Type: Improvement >Reporter: Ramesh Kumar Thangarajan >Assignee: Ramesh Kumar Thangarajan >Priority: Major > Labels: pull-request-available > Time Spent: 1.5h > Remaining Estimate: 0h > > Parallelize hash table constructions in map joins -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-24037) Parallelize hash table constructions in map joins
[ https://issues.apache.org/jira/browse/HIVE-24037?focusedWorklogId=595405=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-595405 ] ASF GitHub Bot logged work on HIVE-24037: - Author: ASF GitHub Bot Created on: 12/May/21 15:09 Start Date: 12/May/21 15:09 Worklog Time Spent: 10m Work Description: pgaref commented on a change in pull request #2004: URL: https://github.com/apache/hive/pull/2004#discussion_r631130583 ## File path: ql/src/java/org/apache/hadoop/hive/ql/exec/persistence/MapJoinTableContainerDirectAccess.java ## @@ -21,11 +21,15 @@ import java.io.IOException; +import org.apache.hadoop.hive.ql.metadata.HiveException; import org.apache.hadoop.hive.serde2.SerDeException; +import org.apache.hadoop.io.BytesWritable; import org.apache.hadoop.io.Writable; public interface MapJoinTableContainerDirectAccess { void put(Writable currentKey, Writable currentValue) throws SerDeException, IOException; + long calculateLongHashCode(BytesWritable currentKey) throws HiveException, IOException, SerDeException; Review comment: Nit. Shall we simplify this to something like getHashCode(BytesWritable currentKey) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 595405) Time Spent: 1h 20m (was: 1h 10m) > Parallelize hash table constructions in map joins > - > > Key: HIVE-24037 > URL: https://issues.apache.org/jira/browse/HIVE-24037 > Project: Hive > Issue Type: Improvement >Reporter: Ramesh Kumar Thangarajan >Assignee: Ramesh Kumar Thangarajan >Priority: Major > Labels: pull-request-available > Time Spent: 1h 20m > Remaining Estimate: 0h > > Parallelize hash table constructions in map joins -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-24037) Parallelize hash table constructions in map joins
[ https://issues.apache.org/jira/browse/HIVE-24037?focusedWorklogId=595404=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-595404 ] ASF GitHub Bot logged work on HIVE-24037: - Author: ASF GitHub Bot Created on: 12/May/21 15:06 Start Date: 12/May/21 15:06 Worklog Time Spent: 10m Work Description: pgaref commented on a change in pull request #2004: URL: https://github.com/apache/hive/pull/2004#discussion_r631128300 ## File path: ql/src/java/org/apache/hadoop/hive/ql/exec/persistence/HybridHashTableContainer.java ## @@ -806,6 +806,12 @@ public void put(Writable currentKey, Writable currentValue) throws SerDeExceptio internalPutRow(directWriteHelper, currentKey, currentValue); } + @Override + public long calculateLongHashCode(BytesWritable currentKey) throws HiveException, IOException, SerDeException { +directWriteHelper.setKeyValue(currentKey, null); +return (long)directWriteHelper.getHashFromKey(); Review comment: Casting seems redundant here -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 595404) Time Spent: 1h 10m (was: 1h) > Parallelize hash table constructions in map joins > - > > Key: HIVE-24037 > URL: https://issues.apache.org/jira/browse/HIVE-24037 > Project: Hive > Issue Type: Improvement >Reporter: Ramesh Kumar Thangarajan >Assignee: Ramesh Kumar Thangarajan >Priority: Major > Labels: pull-request-available > Time Spent: 1h 10m > Remaining Estimate: 0h > > Parallelize hash table constructions in map joins -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-24037) Parallelize hash table constructions in map joins
[ https://issues.apache.org/jira/browse/HIVE-24037?focusedWorklogId=595400=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-595400 ] ASF GitHub Bot logged work on HIVE-24037: - Author: ASF GitHub Bot Created on: 12/May/21 15:05 Start Date: 12/May/21 15:05 Worklog Time Spent: 10m Work Description: pgaref commented on a change in pull request #2004: URL: https://github.com/apache/hive/pull/2004#discussion_r631127066 ## File path: ql/src/java/org/apache/hadoop/hive/ql/exec/persistence/MapJoinBytesTableContainer.java ## @@ -499,6 +500,13 @@ public void put(Writable currentKey, Writable currentValue) throws SerDeExceptio hashMap.put(directWriteHelper, -1); } + @Override public long calculateLongHashCode(BytesWritable currentKey) + throws HiveException, IOException, SerDeException { +directWriteHelper.setKeyValue(currentKey, null); +directWriteHelper.getHashFromKey(); +return 0; Review comment: Maybe return directWriteHelper.getHashFromKey() ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 595400) Time Spent: 1h (was: 50m) > Parallelize hash table constructions in map joins > - > > Key: HIVE-24037 > URL: https://issues.apache.org/jira/browse/HIVE-24037 > Project: Hive > Issue Type: Improvement >Reporter: Ramesh Kumar Thangarajan >Assignee: Ramesh Kumar Thangarajan >Priority: Major > Labels: pull-request-available > Time Spent: 1h > Remaining Estimate: 0h > > Parallelize hash table constructions in map joins -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-24037) Parallelize hash table constructions in map joins
[ https://issues.apache.org/jira/browse/HIVE-24037?focusedWorklogId=558560=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-558560 ] ASF GitHub Bot logged work on HIVE-24037: - Author: ASF GitHub Bot Created on: 26/Feb/21 13:07 Start Date: 26/Feb/21 13:07 Worklog Time Spent: 10m Work Description: pgaref commented on a change in pull request #2004: URL: https://github.com/apache/hive/pull/2004#discussion_r583062970 ## File path: ql/src/java/org/apache/hadoop/hive/ql/exec/persistence/MapJoinTableContainerDirectAccess.java ## @@ -21,11 +21,15 @@ import java.io.IOException; +import org.apache.hadoop.hive.ql.metadata.HiveException; import org.apache.hadoop.hive.serde2.SerDeException; +import org.apache.hadoop.io.BytesWritable; import org.apache.hadoop.io.Writable; public interface MapJoinTableContainerDirectAccess { void put(Writable currentKey, Writable currentValue) throws SerDeException, IOException; + long calculateLongHashCode(BytesWritable currentKey, BytesWritable currentValue) throws HiveException, IOException, SerDeException; Review comment: lets use only the Key here ## File path: ql/src/java/org/apache/hadoop/hive/ql/exec/vector/mapjoin/VectorMapJoinCommonOperator.java ## @@ -667,6 +673,8 @@ private void setUpHashTable() { VectorMapJoinTableContainer vectorMapJoinTableContainer = (VectorMapJoinTableContainer) mapJoinTables[posSingleVectorMapJoinSmallTable]; vectorMapJoinHashTable = vectorMapJoinTableContainer.vectorMapJoinHashTable(); +vectorMapJoinFastHashTableWrapper = ((VectorMapJoinFastHashTableParallel)vectorMapJoinTableContainer. Review comment: Can we just keep it as a vectorMapJoinHashTable ? What is the reason for the extra interface (VectorMapJoinFastHashTableWrapper) ## File path: ql/src/java/org/apache/hadoop/hive/ql/exec/vector/mapjoin/fast/VectorMapJoinFastBytesHashMap.java ## @@ -48,7 +48,7 @@ private long fullOuterNullKeyRefWord; - private static class NonMatchedBytesHashMapIterator extends VectorMapJoinFastNonMatchedIterator { + public static class NonMatchedBytesHashMapIterator extends VectorMapJoinFastNonMatchedIterator { Review comment: Maybe move this along with NonMatchedLongHashMapIterator to their own class? ## File path: ql/src/java/org/apache/hadoop/hive/ql/exec/vector/mapjoin/fast/VectorMapJoinFastHashTable.java ## @@ -64,15 +64,19 @@ public void throwExpandError(int limit, String dataTypeName) { private static void validateCapacity(long capacity) { if (Long.bitCount(capacity) != 1) { - throw new AssertionError("Capacity must be a power of two"); + throw new AssertionError("Capacity must be a power of two " + capacity); Review comment: Nit. A better way to calculate if number is power of two would be: `capacity & (capacity -1) == 0` ## File path: ql/src/java/org/apache/hadoop/hive/ql/exec/vector/mapjoin/fast/VectorMapJoinFastHashTableLoader.java ## @@ -71,6 +196,79 @@ public void init(ExecMapperContext context, MapredContext mrContext, String vertexName = hconf.get(Operator.CONTEXT_NAME_KEY, ""); String counterName = Utilities.getVertexCounterName(HashTableLoaderCounters.HASHTABLE_LOAD_TIME_MS.name(), vertexName); this.htLoadCounter = tezContext.getTezProcessorContext().getCounters().findCounter(counterGroup, counterName); +this.numEntries = 0; +totalEntries = new AtomicLong(0); + } + + public void drainQueueAndLoad(VectorMapJoinFastTableContainer vectorMapJoinFastTableContainer, boolean doMemCheck, + String inputName, MemoryMonitorInfo memoryMonitorInfo, long effectiveThreshold, int partitionId, + BlockingQueue[] sharedQ) + throws InterruptedException, IOException, HiveException, SerDeException { +LOG.info("Draining thread " + partitionId + " started"); +long entries = 0; +BlockingQueue[] partitionQ = sharedQ; Review comment: is this assignment needed? ## File path: ql/src/java/org/apache/hadoop/hive/ql/exec/vector/mapjoin/fast/VectorMapJoinFastHashTableLoader.java ## @@ -71,6 +196,79 @@ public void init(ExecMapperContext context, MapredContext mrContext, String vertexName = hconf.get(Operator.CONTEXT_NAME_KEY, ""); String counterName = Utilities.getVertexCounterName(HashTableLoaderCounters.HASHTABLE_LOAD_TIME_MS.name(), vertexName); this.htLoadCounter = tezContext.getTezProcessorContext().getCounters().findCounter(counterGroup, counterName); +this.numEntries = 0; +totalEntries = new AtomicLong(0); + } + + public void drainQueueAndLoad(VectorMapJoinFastTableContainer vectorMapJoinFastTableContainer, boolean doMemCheck, + String inputName, MemoryMonitorInfo memoryMonitorInfo, long effectiveThreshold, int
[jira] [Work logged] (HIVE-24037) Parallelize hash table constructions in map joins
[ https://issues.apache.org/jira/browse/HIVE-24037?focusedWorklogId=556013=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-556013 ] ASF GitHub Bot logged work on HIVE-24037: - Author: ASF GitHub Bot Created on: 22/Feb/21 20:37 Start Date: 22/Feb/21 20:37 Worklog Time Spent: 10m Work Description: ramesh0201 opened a new pull request #2004: URL: https://github.com/apache/hive/pull/2004 ### What changes were proposed in this pull request? ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 556013) Time Spent: 40m (was: 0.5h) > Parallelize hash table constructions in map joins > - > > Key: HIVE-24037 > URL: https://issues.apache.org/jira/browse/HIVE-24037 > Project: Hive > Issue Type: Improvement >Reporter: Ramesh Kumar Thangarajan >Assignee: Ramesh Kumar Thangarajan >Priority: Major > Labels: pull-request-available > Time Spent: 40m > Remaining Estimate: 0h > > Parallelize hash table constructions in map joins -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-24037) Parallelize hash table constructions in map joins
[ https://issues.apache.org/jira/browse/HIVE-24037?focusedWorklogId=505480=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-505480 ] ASF GitHub Bot logged work on HIVE-24037: - Author: ASF GitHub Bot Created on: 28/Oct/20 01:01 Start Date: 28/Oct/20 01:01 Worklog Time Spent: 10m Work Description: github-actions[bot] closed pull request #1401: URL: https://github.com/apache/hive/pull/1401 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 505480) Time Spent: 0.5h (was: 20m) > Parallelize hash table constructions in map joins > - > > Key: HIVE-24037 > URL: https://issues.apache.org/jira/browse/HIVE-24037 > Project: Hive > Issue Type: Improvement >Reporter: Ramesh Kumar Thangarajan >Assignee: Ramesh Kumar Thangarajan >Priority: Major > Labels: pull-request-available > Time Spent: 0.5h > Remaining Estimate: 0h > > Parallelize hash table constructions in map joins -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-24037) Parallelize hash table constructions in map joins
[ https://issues.apache.org/jira/browse/HIVE-24037?focusedWorklogId=502921=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-502921 ] ASF GitHub Bot logged work on HIVE-24037: - Author: ASF GitHub Bot Created on: 21/Oct/20 00:57 Start Date: 21/Oct/20 00:57 Worklog Time Spent: 10m Work Description: github-actions[bot] commented on pull request #1401: URL: https://github.com/apache/hive/pull/1401#issuecomment-713224730 This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Feel free to reach out on the d...@hive.apache.org list if the patch is in need of reviews. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 502921) Time Spent: 20m (was: 10m) > Parallelize hash table constructions in map joins > - > > Key: HIVE-24037 > URL: https://issues.apache.org/jira/browse/HIVE-24037 > Project: Hive > Issue Type: Improvement >Reporter: Ramesh Kumar Thangarajan >Assignee: Ramesh Kumar Thangarajan >Priority: Major > Labels: pull-request-available > Time Spent: 20m > Remaining Estimate: 0h > > Parallelize hash table constructions in map joins -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-24037) Parallelize hash table constructions in map joins
[ https://issues.apache.org/jira/browse/HIVE-24037?focusedWorklogId=470163=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-470163 ] ASF GitHub Bot logged work on HIVE-24037: - Author: ASF GitHub Bot Created on: 13/Aug/20 09:46 Start Date: 13/Aug/20 09:46 Worklog Time Spent: 10m Work Description: ramesh0201 opened a new pull request #1401: URL: https://github.com/apache/hive/pull/1401 ### What changes were proposed in this pull request? ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 470163) Remaining Estimate: 0h Time Spent: 10m > Parallelize hash table constructions in map joins > - > > Key: HIVE-24037 > URL: https://issues.apache.org/jira/browse/HIVE-24037 > Project: Hive > Issue Type: Improvement >Reporter: Ramesh Kumar Thangarajan >Assignee: Ramesh Kumar Thangarajan >Priority: Major > Time Spent: 10m > Remaining Estimate: 0h > > Parallelize hash table constructions in map joins -- This message was sent by Atlassian Jira (v8.3.4#803005)