[jira] [Work logged] (HIVE-24991) Enable fetching deleted rows in vectorized mode
[ https://issues.apache.org/jira/browse/HIVE-24991?focusedWorklogId=611159=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-611159 ] ASF GitHub Bot logged work on HIVE-24991: - Author: ASF GitHub Bot Created on: 15/Jun/21 06:32 Start Date: 15/Jun/21 06:32 Worklog Time Spent: 10m Work Description: kasakrisz merged pull request #2264: URL: https://github.com/apache/hive/pull/2264 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 611159) Time Spent: 4h 50m (was: 4h 40m) > Enable fetching deleted rows in vectorized mode > --- > > Key: HIVE-24991 > URL: https://issues.apache.org/jira/browse/HIVE-24991 > Project: Hive > Issue Type: Improvement > Components: Vectorization >Reporter: Krisztian Kasa >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > Time Spent: 4h 50m > Remaining Estimate: 0h > > HIVE-24855 enables loading deleted rows from ORC tables when table property > *acid.fetch.deleted.rows* is true. > The goal of this jira is to enable this feature in vectorized orc batch > reader. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-24991) Enable fetching deleted rows in vectorized mode
[ https://issues.apache.org/jira/browse/HIVE-24991?focusedWorklogId=608551=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-608551 ] ASF GitHub Bot logged work on HIVE-24991: - Author: ASF GitHub Bot Created on: 08/Jun/21 15:59 Start Date: 08/Jun/21 15:59 Worklog Time Spent: 10m Work Description: kasakrisz commented on a change in pull request #2264: URL: https://github.com/apache/hive/pull/2264#discussion_r647582327 ## File path: ql/src/java/org/apache/hadoop/hive/ql/io/orc/VectorizedOrcAcidRowBatchReader.java ## @@ -281,16 +285,23 @@ private VectorizedOrcAcidRowBatchReader(JobConf conf, OrcSplit orcSplit, Reporte deleteEventReaderOptions.range(0, Long.MAX_VALUE); deleteEventReaderOptions.searchArgument(null, null); keyInterval = findMinMaxKeys(orcSplit, conf, deleteEventReaderOptions); +fetchDeletedRows = conf.getBoolean(Constants.ACID_FETCH_DELETED_ROWS, false); DeleteEventRegistry der; try { // See if we can load all the relevant delete events from all the // delete deltas in memory... + ColumnizedDeleteEventRegistry.OriginalWriteIdLoader writeIdLoader; + if (fetchDeletedRows) { +writeIdLoader = new ColumnizedDeleteEventRegistry.BothWriteIdLoader(); Review comment: done ## File path: ql/src/java/org/apache/hadoop/hive/ql/io/orc/VectorizedOrcAcidRowBatchReader.java ## @@ -303,6 +314,12 @@ private VectorizedOrcAcidRowBatchReader(JobConf conf, OrcSplit orcSplit, Reporte VectorizedRowBatch.DEFAULT_SIZE, null, null, null); } rowIdProjected = areRowIdsProjected(rbCtx); +rowIsDeletedProjected = isVirtualColumnProjected(rbCtx, VirtualColumn.ROWISDELETED); Review comment: done ## File path: ql/src/java/org/apache/hadoop/hive/ql/io/orc/VectorizedOrcAcidRowBatchReader.java ## @@ -303,6 +314,12 @@ private VectorizedOrcAcidRowBatchReader(JobConf conf, OrcSplit orcSplit, Reporte VectorizedRowBatch.DEFAULT_SIZE, null, null, null); } rowIdProjected = areRowIdsProjected(rbCtx); +rowIsDeletedProjected = isVirtualColumnProjected(rbCtx, VirtualColumn.ROWISDELETED); +if (rowIsDeletedProjected) { + rowIsDeletedVector = new RowIsDeletedColumnVector(); Review comment: done -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 608551) Time Spent: 4h 20m (was: 4h 10m) > Enable fetching deleted rows in vectorized mode > --- > > Key: HIVE-24991 > URL: https://issues.apache.org/jira/browse/HIVE-24991 > Project: Hive > Issue Type: Improvement > Components: Vectorization >Reporter: Krisztian Kasa >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > Time Spent: 4h 20m > Remaining Estimate: 0h > > HIVE-24855 enables loading deleted rows from ORC tables when table property > *acid.fetch.deleted.rows* is true. > The goal of this jira is to enable this feature in vectorized orc batch > reader. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-24991) Enable fetching deleted rows in vectorized mode
[ https://issues.apache.org/jira/browse/HIVE-24991?focusedWorklogId=608553=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-608553 ] ASF GitHub Bot logged work on HIVE-24991: - Author: ASF GitHub Bot Created on: 08/Jun/21 15:59 Start Date: 08/Jun/21 15:59 Worklog Time Spent: 10m Work Description: kasakrisz commented on a change in pull request #2264: URL: https://github.com/apache/hive/pull/2264#discussion_r647583124 ## File path: ql/src/java/org/apache/hadoop/hive/ql/io/orc/VectorizedOrcAcidRowBatchReader.java ## @@ -1748,7 +1946,7 @@ public int compareTo(CompressedOwid other) { assert shouldReadDeleteDeltasWithLlap(conf, true); } deleteReaderValue = new DeleteReaderValue(readerData.reader, deleteDeltaFile, readerOptions, bucket, -validWriteIdList, isBucketedTable, conf, keyInterval, orcSplit, numRows, cacheTag, fileId); +validWriteIdList, isBucketedTable, conf, keyInterval, orcSplit, numRows, cacheTag, fileId); Review comment: reverted -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 608553) Time Spent: 4h 40m (was: 4.5h) > Enable fetching deleted rows in vectorized mode > --- > > Key: HIVE-24991 > URL: https://issues.apache.org/jira/browse/HIVE-24991 > Project: Hive > Issue Type: Improvement > Components: Vectorization >Reporter: Krisztian Kasa >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > Time Spent: 4h 40m > Remaining Estimate: 0h > > HIVE-24855 enables loading deleted rows from ORC tables when table property > *acid.fetch.deleted.rows* is true. > The goal of this jira is to enable this feature in vectorized orc batch > reader. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-24991) Enable fetching deleted rows in vectorized mode
[ https://issues.apache.org/jira/browse/HIVE-24991?focusedWorklogId=608552=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-608552 ] ASF GitHub Bot logged work on HIVE-24991: - Author: ASF GitHub Bot Created on: 08/Jun/21 15:59 Start Date: 08/Jun/21 15:59 Worklog Time Spent: 10m Work Description: kasakrisz commented on a change in pull request #2264: URL: https://github.com/apache/hive/pull/2264#discussion_r647582736 ## File path: ql/src/java/org/apache/hadoop/hive/ql/io/orc/VectorizedOrcAcidRowBatchReader.java ## @@ -948,7 +978,7 @@ public boolean next(NullWritable key, VectorizedRowBatch value) throws IOExcepti // This loop fills up the selected[] vector with all the index positions that are selected. for (int setBitIndex = selectedBitSet.nextSetBit(0), selectedItr = 0; setBitIndex >= 0; - setBitIndex = selectedBitSet.nextSetBit(setBitIndex+1), ++selectedItr) { + setBitIndex = selectedBitSet.nextSetBit(setBitIndex + 1), ++selectedItr) { Review comment: reverted -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 608552) Time Spent: 4.5h (was: 4h 20m) > Enable fetching deleted rows in vectorized mode > --- > > Key: HIVE-24991 > URL: https://issues.apache.org/jira/browse/HIVE-24991 > Project: Hive > Issue Type: Improvement > Components: Vectorization >Reporter: Krisztian Kasa >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > Time Spent: 4.5h > Remaining Estimate: 0h > > HIVE-24855 enables loading deleted rows from ORC tables when table property > *acid.fetch.deleted.rows* is true. > The goal of this jira is to enable this feature in vectorized orc batch > reader. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-24991) Enable fetching deleted rows in vectorized mode
[ https://issues.apache.org/jira/browse/HIVE-24991?focusedWorklogId=608548=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-608548 ] ASF GitHub Bot logged work on HIVE-24991: - Author: ASF GitHub Bot Created on: 08/Jun/21 15:58 Start Date: 08/Jun/21 15:58 Worklog Time Spent: 10m Work Description: kasakrisz commented on a change in pull request #2264: URL: https://github.com/apache/hive/pull/2264#discussion_r647582032 ## File path: ql/src/java/org/apache/hadoop/hive/ql/io/orc/VectorizedOrcAcidRowBatchReader.java ## @@ -1940,39 +2091,38 @@ public boolean isEmpty() { } @Override public void findDeletedRecords(ColumnVector[] cols, int size, BitSet selectedBitSet) { - if (rowIds == null || compressedOwids == null) { + if (rowIds == null || writeIds == null || writeIds.isEmpty()) { return; } // Iterate through the batch and for each (owid, rowid) in the batch // check if it is deleted or not. long[] originalWriteIdVector = - cols[OrcRecordUpdater.ORIGINAL_WRITEID].isRepeating ? null - : ((LongColumnVector) cols[OrcRecordUpdater.ORIGINAL_WRITEID]).vector; + cols[OrcRecordUpdater.ORIGINAL_WRITEID].isRepeating ? null Review comment: reverted ## File path: ql/src/test/org/apache/hadoop/hive/ql/io/orc/TestVectorizedOrcAcidRowBatchReader.java ## @@ -961,26 +966,41 @@ private void testDeleteEventOriginalFiltering2() throws Exception { @Test public void testVectorizedOrcAcidRowBatchReader() throws Exception { +setupTestData(); + + testVectorizedOrcAcidRowBatchReader(ColumnizedDeleteEventRegistry.class.getName()); + +// To test the SortMergedDeleteEventRegistry, we need to explicitly set the +// HIVE_TRANSACTIONAL_NUM_EVENTS_IN_MEMORY constant to a smaller value. +int oldValue = conf.getInt(HiveConf.ConfVars.HIVE_TRANSACTIONAL_NUM_EVENTS_IN_MEMORY.varname, 100); + conf.setInt(HiveConf.ConfVars.HIVE_TRANSACTIONAL_NUM_EVENTS_IN_MEMORY.varname, 1000); + testVectorizedOrcAcidRowBatchReader(SortMergedDeleteEventRegistry.class.getName()); + +// Restore the old value. + conf.setInt(HiveConf.ConfVars.HIVE_TRANSACTIONAL_NUM_EVENTS_IN_MEMORY.varname, oldValue); + } + + private void setupTestData() throws IOException { conf.set("bucket_count", "1"); - conf.set(ValidTxnList.VALID_TXNS_KEY, - new ValidReadTxnList(new long[0], new BitSet(), 1000, Long.MAX_VALUE).writeToString()); +conf.set(ValidTxnList.VALID_TXNS_KEY, +new ValidReadTxnList(new long[0], new BitSet(), 1000, Long.MAX_VALUE).writeToString()); int bucket = 0; AcidOutputFormat.Options options = new AcidOutputFormat.Options(conf) -.filesystem(fs) -.bucket(bucket) -.writingBase(false) -.minimumWriteId(1) -.maximumWriteId(NUM_OWID) -.inspector(inspector) -.reporter(Reporter.NULL) -.recordIdColumn(1) -.finalDestination(root); +.filesystem(fs) Review comment: reverted -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 608548) Time Spent: 4h 10m (was: 4h) > Enable fetching deleted rows in vectorized mode > --- > > Key: HIVE-24991 > URL: https://issues.apache.org/jira/browse/HIVE-24991 > Project: Hive > Issue Type: Improvement > Components: Vectorization >Reporter: Krisztian Kasa >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > Time Spent: 4h 10m > Remaining Estimate: 0h > > HIVE-24855 enables loading deleted rows from ORC tables when table property > *acid.fetch.deleted.rows* is true. > The goal of this jira is to enable this feature in vectorized orc batch > reader. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-24991) Enable fetching deleted rows in vectorized mode
[ https://issues.apache.org/jira/browse/HIVE-24991?focusedWorklogId=608534=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-608534 ] ASF GitHub Bot logged work on HIVE-24991: - Author: ASF GitHub Bot Created on: 08/Jun/21 15:37 Start Date: 08/Jun/21 15:37 Worklog Time Spent: 10m Work Description: kasakrisz commented on a change in pull request #2264: URL: https://github.com/apache/hive/pull/2264#discussion_r647563794 ## File path: ql/src/java/org/apache/hadoop/hive/ql/io/orc/VectorizedOrcAcidRowBatchReader.java ## @@ -2039,4 +2189,29 @@ private static IntegerColumnStatistics deserializeIntColumnStatistics(List Enable fetching deleted rows in vectorized mode > --- > > Key: HIVE-24991 > URL: https://issues.apache.org/jira/browse/HIVE-24991 > Project: Hive > Issue Type: Improvement > Components: Vectorization >Reporter: Krisztian Kasa >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > Time Spent: 4h > Remaining Estimate: 0h > > HIVE-24855 enables loading deleted rows from ORC tables when table property > *acid.fetch.deleted.rows* is true. > The goal of this jira is to enable this feature in vectorized orc batch > reader. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-24991) Enable fetching deleted rows in vectorized mode
[ https://issues.apache.org/jira/browse/HIVE-24991?focusedWorklogId=608531=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-608531 ] ASF GitHub Bot logged work on HIVE-24991: - Author: ASF GitHub Bot Created on: 08/Jun/21 15:36 Start Date: 08/Jun/21 15:36 Worklog Time Spent: 10m Work Description: kasakrisz commented on a change in pull request #2264: URL: https://github.com/apache/hive/pull/2264#discussion_r647562562 ## File path: ql/src/java/org/apache/hadoop/hive/ql/io/orc/VectorizedOrcAcidRowBatchReader.java ## @@ -2039,4 +2189,29 @@ private static IntegerColumnStatistics deserializeIntColumnStatistics(List Enable fetching deleted rows in vectorized mode > --- > > Key: HIVE-24991 > URL: https://issues.apache.org/jira/browse/HIVE-24991 > Project: Hive > Issue Type: Improvement > Components: Vectorization >Reporter: Krisztian Kasa >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > Time Spent: 3h 50m > Remaining Estimate: 0h > > HIVE-24855 enables loading deleted rows from ORC tables when table property > *acid.fetch.deleted.rows* is true. > The goal of this jira is to enable this feature in vectorized orc batch > reader. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-24991) Enable fetching deleted rows in vectorized mode
[ https://issues.apache.org/jira/browse/HIVE-24991?focusedWorklogId=608529=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-608529 ] ASF GitHub Bot logged work on HIVE-24991: - Author: ASF GitHub Bot Created on: 08/Jun/21 15:35 Start Date: 08/Jun/21 15:35 Worklog Time Spent: 10m Work Description: kasakrisz commented on a change in pull request #2264: URL: https://github.com/apache/hive/pull/2264#discussion_r647561929 ## File path: ql/src/java/org/apache/hadoop/hive/ql/io/orc/VectorizedOrcAcidRowBatchReader.java ## @@ -959,6 +989,20 @@ public boolean next(NullWritable key, VectorizedRowBatch value) throws IOExcepti int ix = rbCtx.findVirtualColumnNum(VirtualColumn.ROWID); value.cols[ix] = recordIdColumnVector; } +if (rowIsDeletedProjected) { + if (fetchDeletedRows) { Review comment: I prefer your first suggestion because the second one requires passing `vectorizedRowBatchBase.size()` to the `set` method which I would like to avoid. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 608529) Time Spent: 3h 40m (was: 3.5h) > Enable fetching deleted rows in vectorized mode > --- > > Key: HIVE-24991 > URL: https://issues.apache.org/jira/browse/HIVE-24991 > Project: Hive > Issue Type: Improvement > Components: Vectorization >Reporter: Krisztian Kasa >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > Time Spent: 3h 40m > Remaining Estimate: 0h > > HIVE-24855 enables loading deleted rows from ORC tables when table property > *acid.fetch.deleted.rows* is true. > The goal of this jira is to enable this feature in vectorized orc batch > reader. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-24991) Enable fetching deleted rows in vectorized mode
[ https://issues.apache.org/jira/browse/HIVE-24991?focusedWorklogId=608526=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-608526 ] ASF GitHub Bot logged work on HIVE-24991: - Author: ASF GitHub Bot Created on: 08/Jun/21 15:32 Start Date: 08/Jun/21 15:32 Worklog Time Spent: 10m Work Description: kasakrisz commented on a change in pull request #2264: URL: https://github.com/apache/hive/pull/2264#discussion_r647559407 ## File path: ql/src/java/org/apache/hadoop/hive/ql/io/orc/VectorizedOrcAcidRowBatchReader.java ## @@ -959,6 +989,20 @@ public boolean next(NullWritable key, VectorizedRowBatch value) throws IOExcepti int ix = rbCtx.findVirtualColumnNum(VirtualColumn.ROWID); Review comment: see my previous comment for `VirtualColumn.ROWISDELETED` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 608526) Time Spent: 3h 20m (was: 3h 10m) > Enable fetching deleted rows in vectorized mode > --- > > Key: HIVE-24991 > URL: https://issues.apache.org/jira/browse/HIVE-24991 > Project: Hive > Issue Type: Improvement > Components: Vectorization >Reporter: Krisztian Kasa >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > Time Spent: 3h 20m > Remaining Estimate: 0h > > HIVE-24855 enables loading deleted rows from ORC tables when table property > *acid.fetch.deleted.rows* is true. > The goal of this jira is to enable this feature in vectorized orc batch > reader. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-24991) Enable fetching deleted rows in vectorized mode
[ https://issues.apache.org/jira/browse/HIVE-24991?focusedWorklogId=608527=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-608527 ] ASF GitHub Bot logged work on HIVE-24991: - Author: ASF GitHub Bot Created on: 08/Jun/21 15:32 Start Date: 08/Jun/21 15:32 Worklog Time Spent: 10m Work Description: kasakrisz commented on a change in pull request #2264: URL: https://github.com/apache/hive/pull/2264#discussion_r647559557 ## File path: ql/src/java/org/apache/hadoop/hive/ql/io/orc/VectorizedOrcAcidRowBatchReader.java ## @@ -983,7 +1027,7 @@ private void copyFromBase(VectorizedRowBatch value) { System.arraycopy(payloadStruct.fields, 0, value.cols, 0, value.getDataColumnCount()); } if (rowIdProjected) { - recordIdColumnVector.fields[0] = vectorizedRowBatchBase.cols[OrcRecordUpdater.ORIGINAL_WRITEID]; + recordIdColumnVector.fields[0] = vectorizedRowBatchBase.cols[fetchDeletedRows ? OrcRecordUpdater.CURRENT_WRITEID : OrcRecordUpdater.ORIGINAL_WRITEID]; Review comment: done -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 608527) Time Spent: 3.5h (was: 3h 20m) > Enable fetching deleted rows in vectorized mode > --- > > Key: HIVE-24991 > URL: https://issues.apache.org/jira/browse/HIVE-24991 > Project: Hive > Issue Type: Improvement > Components: Vectorization >Reporter: Krisztian Kasa >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > Time Spent: 3.5h > Remaining Estimate: 0h > > HIVE-24855 enables loading deleted rows from ORC tables when table property > *acid.fetch.deleted.rows* is true. > The goal of this jira is to enable this feature in vectorized orc batch > reader. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-24991) Enable fetching deleted rows in vectorized mode
[ https://issues.apache.org/jira/browse/HIVE-24991?focusedWorklogId=608522=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-608522 ] ASF GitHub Bot logged work on HIVE-24991: - Author: ASF GitHub Bot Created on: 08/Jun/21 15:31 Start Date: 08/Jun/21 15:31 Worklog Time Spent: 10m Work Description: kasakrisz commented on a change in pull request #2264: URL: https://github.com/apache/hive/pull/2264#discussion_r647558152 ## File path: ql/src/java/org/apache/hadoop/hive/ql/io/orc/VectorizedOrcAcidRowBatchReader.java ## @@ -892,13 +913,20 @@ public boolean next(NullWritable key, VectorizedRowBatch value) throws IOExcepti } catch (Exception e) { throw new IOException("error iterating", e); } -if(!includeAcidColumns) { +if (!includeAcidColumns) { //if here, we don't need to filter anything wrt acid metadata columns //in fact, they are not even read from file/llap value.size = vectorizedRowBatchBase.size; value.selected = vectorizedRowBatchBase.selected; value.selectedInUse = vectorizedRowBatchBase.selectedInUse; copyFromBase(value); + + if (rowIsDeletedProjected) { +rowIsDeletedVector.clear(); +int ix = rbCtx.findVirtualColumnNum(VirtualColumn.ROWISDELETED); Review comment: I started to work on a solution to manage Virtual Column related information but it lead to a much bigger change. `VectorizedOrcAcidRowBatchReader` can behave several ways and each of those behavior worth a separate class after extracting common parts. So I decided to followed existing logic implemented for RowId. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 608522) Time Spent: 3h (was: 2h 50m) > Enable fetching deleted rows in vectorized mode > --- > > Key: HIVE-24991 > URL: https://issues.apache.org/jira/browse/HIVE-24991 > Project: Hive > Issue Type: Improvement > Components: Vectorization >Reporter: Krisztian Kasa >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > Time Spent: 3h > Remaining Estimate: 0h > > HIVE-24855 enables loading deleted rows from ORC tables when table property > *acid.fetch.deleted.rows* is true. > The goal of this jira is to enable this feature in vectorized orc batch > reader. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-24991) Enable fetching deleted rows in vectorized mode
[ https://issues.apache.org/jira/browse/HIVE-24991?focusedWorklogId=608524=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-608524 ] ASF GitHub Bot logged work on HIVE-24991: - Author: ASF GitHub Bot Created on: 08/Jun/21 15:31 Start Date: 08/Jun/21 15:31 Worklog Time Spent: 10m Work Description: kasakrisz commented on a change in pull request #2264: URL: https://github.com/apache/hive/pull/2264#discussion_r647558355 ## File path: ql/src/java/org/apache/hadoop/hive/ql/io/orc/VectorizedOrcAcidRowBatchReader.java ## @@ -932,8 +960,10 @@ public boolean next(NullWritable key, VectorizedRowBatch value) throws IOExcepti } // Case 2- find rows which have been deleted. +BitSet notDeletedBitSet = fetchDeletedRows ? (BitSet) selectedBitSet.clone() : selectedBitSet; Review comment: done -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 608524) Time Spent: 3h 10m (was: 3h) > Enable fetching deleted rows in vectorized mode > --- > > Key: HIVE-24991 > URL: https://issues.apache.org/jira/browse/HIVE-24991 > Project: Hive > Issue Type: Improvement > Components: Vectorization >Reporter: Krisztian Kasa >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > Time Spent: 3h 10m > Remaining Estimate: 0h > > HIVE-24855 enables loading deleted rows from ORC tables when table property > *acid.fetch.deleted.rows* is true. > The goal of this jira is to enable this feature in vectorized orc batch > reader. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-24991) Enable fetching deleted rows in vectorized mode
[ https://issues.apache.org/jira/browse/HIVE-24991?focusedWorklogId=607874=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-607874 ] ASF GitHub Bot logged work on HIVE-24991: - Author: ASF GitHub Bot Created on: 07/Jun/21 13:46 Start Date: 07/Jun/21 13:46 Worklog Time Spent: 10m Work Description: pgaref commented on a change in pull request #2264: URL: https://github.com/apache/hive/pull/2264#discussion_r646602703 ## File path: ql/src/java/org/apache/hadoop/hive/ql/io/orc/VectorizedOrcAcidRowBatchReader.java ## @@ -2039,4 +2189,29 @@ private static IntegerColumnStatistics deserializeIntColumnStatistics(List Enable fetching deleted rows in vectorized mode > --- > > Key: HIVE-24991 > URL: https://issues.apache.org/jira/browse/HIVE-24991 > Project: Hive > Issue Type: Improvement > Components: Vectorization >Reporter: Krisztian Kasa >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > Time Spent: 2h 50m > Remaining Estimate: 0h > > HIVE-24855 enables loading deleted rows from ORC tables when table property > *acid.fetch.deleted.rows* is true. > The goal of this jira is to enable this feature in vectorized orc batch > reader. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-24991) Enable fetching deleted rows in vectorized mode
[ https://issues.apache.org/jira/browse/HIVE-24991?focusedWorklogId=607870=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-607870 ] ASF GitHub Bot logged work on HIVE-24991: - Author: ASF GitHub Bot Created on: 07/Jun/21 13:44 Start Date: 07/Jun/21 13:44 Worklog Time Spent: 10m Work Description: pgaref commented on a change in pull request #2264: URL: https://github.com/apache/hive/pull/2264#discussion_r646601459 ## File path: ql/src/java/org/apache/hadoop/hive/ql/io/orc/VectorizedOrcAcidRowBatchReader.java ## @@ -2039,4 +2189,29 @@ private static IntegerColumnStatistics deserializeIntColumnStatistics(List Enable fetching deleted rows in vectorized mode > --- > > Key: HIVE-24991 > URL: https://issues.apache.org/jira/browse/HIVE-24991 > Project: Hive > Issue Type: Improvement > Components: Vectorization >Reporter: Krisztian Kasa >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > Time Spent: 2h 40m > Remaining Estimate: 0h > > HIVE-24855 enables loading deleted rows from ORC tables when table property > *acid.fetch.deleted.rows* is true. > The goal of this jira is to enable this feature in vectorized orc batch > reader. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-24991) Enable fetching deleted rows in vectorized mode
[ https://issues.apache.org/jira/browse/HIVE-24991?focusedWorklogId=607869=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-607869 ] ASF GitHub Bot logged work on HIVE-24991: - Author: ASF GitHub Bot Created on: 07/Jun/21 13:42 Start Date: 07/Jun/21 13:42 Worklog Time Spent: 10m Work Description: pgaref commented on a change in pull request #2264: URL: https://github.com/apache/hive/pull/2264#discussion_r646598948 ## File path: ql/src/java/org/apache/hadoop/hive/ql/io/orc/VectorizedOrcAcidRowBatchReader.java ## @@ -1748,7 +1946,7 @@ public int compareTo(CompressedOwid other) { assert shouldReadDeleteDeltasWithLlap(conf, true); } deleteReaderValue = new DeleteReaderValue(readerData.reader, deleteDeltaFile, readerOptions, bucket, -validWriteIdList, isBucketedTable, conf, keyInterval, orcSplit, numRows, cacheTag, fileId); +validWriteIdList, isBucketedTable, conf, keyInterval, orcSplit, numRows, cacheTag, fileId); Review comment: unnecessary space -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 607869) Time Spent: 2.5h (was: 2h 20m) > Enable fetching deleted rows in vectorized mode > --- > > Key: HIVE-24991 > URL: https://issues.apache.org/jira/browse/HIVE-24991 > Project: Hive > Issue Type: Improvement > Components: Vectorization >Reporter: Krisztian Kasa >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > Time Spent: 2.5h > Remaining Estimate: 0h > > HIVE-24855 enables loading deleted rows from ORC tables when table property > *acid.fetch.deleted.rows* is true. > The goal of this jira is to enable this feature in vectorized orc batch > reader. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-24991) Enable fetching deleted rows in vectorized mode
[ https://issues.apache.org/jira/browse/HIVE-24991?focusedWorklogId=607865=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-607865 ] ASF GitHub Bot logged work on HIVE-24991: - Author: ASF GitHub Bot Created on: 07/Jun/21 13:38 Start Date: 07/Jun/21 13:38 Worklog Time Spent: 10m Work Description: pgaref commented on a change in pull request #2264: URL: https://github.com/apache/hive/pull/2264#discussion_r646596265 ## File path: ql/src/java/org/apache/hadoop/hive/ql/io/orc/VectorizedOrcAcidRowBatchReader.java ## @@ -959,6 +989,20 @@ public boolean next(NullWritable key, VectorizedRowBatch value) throws IOExcepti int ix = rbCtx.findVirtualColumnNum(VirtualColumn.ROWID); value.cols[ix] = recordIdColumnVector; } +if (rowIsDeletedProjected) { + if (fetchDeletedRows) { Review comment: tbh we could even do the second check as part of the Set method (as we do already for cardinality 0) and simplify the logic here -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 607865) Time Spent: 2h 20m (was: 2h 10m) > Enable fetching deleted rows in vectorized mode > --- > > Key: HIVE-24991 > URL: https://issues.apache.org/jira/browse/HIVE-24991 > Project: Hive > Issue Type: Improvement > Components: Vectorization >Reporter: Krisztian Kasa >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > Time Spent: 2h 20m > Remaining Estimate: 0h > > HIVE-24855 enables loading deleted rows from ORC tables when table property > *acid.fetch.deleted.rows* is true. > The goal of this jira is to enable this feature in vectorized orc batch > reader. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-24991) Enable fetching deleted rows in vectorized mode
[ https://issues.apache.org/jira/browse/HIVE-24991?focusedWorklogId=607859=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-607859 ] ASF GitHub Bot logged work on HIVE-24991: - Author: ASF GitHub Bot Created on: 07/Jun/21 13:34 Start Date: 07/Jun/21 13:34 Worklog Time Spent: 10m Work Description: pgaref commented on a change in pull request #2264: URL: https://github.com/apache/hive/pull/2264#discussion_r646592844 ## File path: ql/src/java/org/apache/hadoop/hive/ql/io/orc/VectorizedOrcAcidRowBatchReader.java ## @@ -983,7 +1027,7 @@ private void copyFromBase(VectorizedRowBatch value) { System.arraycopy(payloadStruct.fields, 0, value.cols, 0, value.getDataColumnCount()); } if (rowIdProjected) { - recordIdColumnVector.fields[0] = vectorizedRowBatchBase.cols[OrcRecordUpdater.ORIGINAL_WRITEID]; + recordIdColumnVector.fields[0] = vectorizedRowBatchBase.cols[fetchDeletedRows ? OrcRecordUpdater.CURRENT_WRITEID : OrcRecordUpdater.ORIGINAL_WRITEID]; Review comment: would love a comment about the different WRITEID here -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 607859) Time Spent: 2h 10m (was: 2h) > Enable fetching deleted rows in vectorized mode > --- > > Key: HIVE-24991 > URL: https://issues.apache.org/jira/browse/HIVE-24991 > Project: Hive > Issue Type: Improvement > Components: Vectorization >Reporter: Krisztian Kasa >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > Time Spent: 2h 10m > Remaining Estimate: 0h > > HIVE-24855 enables loading deleted rows from ORC tables when table property > *acid.fetch.deleted.rows* is true. > The goal of this jira is to enable this feature in vectorized orc batch > reader. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-24991) Enable fetching deleted rows in vectorized mode
[ https://issues.apache.org/jira/browse/HIVE-24991?focusedWorklogId=607858=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-607858 ] ASF GitHub Bot logged work on HIVE-24991: - Author: ASF GitHub Bot Created on: 07/Jun/21 13:34 Start Date: 07/Jun/21 13:34 Worklog Time Spent: 10m Work Description: pgaref commented on a change in pull request #2264: URL: https://github.com/apache/hive/pull/2264#discussion_r646592084 ## File path: ql/src/java/org/apache/hadoop/hive/ql/io/orc/VectorizedOrcAcidRowBatchReader.java ## @@ -959,6 +989,20 @@ public boolean next(NullWritable key, VectorizedRowBatch value) throws IOExcepti int ix = rbCtx.findVirtualColumnNum(VirtualColumn.ROWID); value.cols[ix] = recordIdColumnVector; } +if (rowIsDeletedProjected) { + if (fetchDeletedRows) { Review comment: if (!fetchDeletedRows || notDeletedBitSet.cardinality() == vectorizedRowBatchBase.size ) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 607858) Time Spent: 2h (was: 1h 50m) > Enable fetching deleted rows in vectorized mode > --- > > Key: HIVE-24991 > URL: https://issues.apache.org/jira/browse/HIVE-24991 > Project: Hive > Issue Type: Improvement > Components: Vectorization >Reporter: Krisztian Kasa >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > Time Spent: 2h > Remaining Estimate: 0h > > HIVE-24855 enables loading deleted rows from ORC tables when table property > *acid.fetch.deleted.rows* is true. > The goal of this jira is to enable this feature in vectorized orc batch > reader. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-24991) Enable fetching deleted rows in vectorized mode
[ https://issues.apache.org/jira/browse/HIVE-24991?focusedWorklogId=607854=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-607854 ] ASF GitHub Bot logged work on HIVE-24991: - Author: ASF GitHub Bot Created on: 07/Jun/21 13:32 Start Date: 07/Jun/21 13:32 Worklog Time Spent: 10m Work Description: pgaref commented on a change in pull request #2264: URL: https://github.com/apache/hive/pull/2264#discussion_r646590439 ## File path: ql/src/java/org/apache/hadoop/hive/ql/io/orc/VectorizedOrcAcidRowBatchReader.java ## @@ -959,6 +989,20 @@ public boolean next(NullWritable key, VectorizedRowBatch value) throws IOExcepti int ix = rbCtx.findVirtualColumnNum(VirtualColumn.ROWID); Review comment: we could probably do the same optimization here -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 607854) Time Spent: 1h 50m (was: 1h 40m) > Enable fetching deleted rows in vectorized mode > --- > > Key: HIVE-24991 > URL: https://issues.apache.org/jira/browse/HIVE-24991 > Project: Hive > Issue Type: Improvement > Components: Vectorization >Reporter: Krisztian Kasa >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > Time Spent: 1h 50m > Remaining Estimate: 0h > > HIVE-24855 enables loading deleted rows from ORC tables when table property > *acid.fetch.deleted.rows* is true. > The goal of this jira is to enable this feature in vectorized orc batch > reader. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-24991) Enable fetching deleted rows in vectorized mode
[ https://issues.apache.org/jira/browse/HIVE-24991?focusedWorklogId=607840=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-607840 ] ASF GitHub Bot logged work on HIVE-24991: - Author: ASF GitHub Bot Created on: 07/Jun/21 12:53 Start Date: 07/Jun/21 12:53 Worklog Time Spent: 10m Work Description: pgaref commented on a change in pull request #2264: URL: https://github.com/apache/hive/pull/2264#discussion_r646558980 ## File path: ql/src/java/org/apache/hadoop/hive/ql/io/orc/VectorizedOrcAcidRowBatchReader.java ## @@ -948,7 +978,7 @@ public boolean next(NullWritable key, VectorizedRowBatch value) throws IOExcepti // This loop fills up the selected[] vector with all the index positions that are selected. for (int setBitIndex = selectedBitSet.nextSetBit(0), selectedItr = 0; setBitIndex >= 0; - setBitIndex = selectedBitSet.nextSetBit(setBitIndex+1), ++selectedItr) { + setBitIndex = selectedBitSet.nextSetBit(setBitIndex + 1), ++selectedItr) { Review comment: change not needed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 607840) Time Spent: 1h 40m (was: 1.5h) > Enable fetching deleted rows in vectorized mode > --- > > Key: HIVE-24991 > URL: https://issues.apache.org/jira/browse/HIVE-24991 > Project: Hive > Issue Type: Improvement > Components: Vectorization >Reporter: Krisztian Kasa >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > Time Spent: 1h 40m > Remaining Estimate: 0h > > HIVE-24855 enables loading deleted rows from ORC tables when table property > *acid.fetch.deleted.rows* is true. > The goal of this jira is to enable this feature in vectorized orc batch > reader. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-24991) Enable fetching deleted rows in vectorized mode
[ https://issues.apache.org/jira/browse/HIVE-24991?focusedWorklogId=607839=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-607839 ] ASF GitHub Bot logged work on HIVE-24991: - Author: ASF GitHub Bot Created on: 07/Jun/21 12:52 Start Date: 07/Jun/21 12:52 Worklog Time Spent: 10m Work Description: pgaref commented on a change in pull request #2264: URL: https://github.com/apache/hive/pull/2264#discussion_r646558742 ## File path: ql/src/java/org/apache/hadoop/hive/ql/io/orc/VectorizedOrcAcidRowBatchReader.java ## @@ -932,8 +960,10 @@ public boolean next(NullWritable key, VectorizedRowBatch value) throws IOExcepti } // Case 2- find rows which have been deleted. +BitSet notDeletedBitSet = fetchDeletedRows ? (BitSet) selectedBitSet.clone() : selectedBitSet; Review comment: lets add a comment above saying when/why we clone the BitSet -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 607839) Time Spent: 1.5h (was: 1h 20m) > Enable fetching deleted rows in vectorized mode > --- > > Key: HIVE-24991 > URL: https://issues.apache.org/jira/browse/HIVE-24991 > Project: Hive > Issue Type: Improvement > Components: Vectorization >Reporter: Krisztian Kasa >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > Time Spent: 1.5h > Remaining Estimate: 0h > > HIVE-24855 enables loading deleted rows from ORC tables when table property > *acid.fetch.deleted.rows* is true. > The goal of this jira is to enable this feature in vectorized orc batch > reader. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-24991) Enable fetching deleted rows in vectorized mode
[ https://issues.apache.org/jira/browse/HIVE-24991?focusedWorklogId=607837=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-607837 ] ASF GitHub Bot logged work on HIVE-24991: - Author: ASF GitHub Bot Created on: 07/Jun/21 12:49 Start Date: 07/Jun/21 12:49 Worklog Time Spent: 10m Work Description: pgaref commented on a change in pull request #2264: URL: https://github.com/apache/hive/pull/2264#discussion_r646556329 ## File path: ql/src/java/org/apache/hadoop/hive/ql/io/orc/VectorizedOrcAcidRowBatchReader.java ## @@ -303,6 +314,12 @@ private VectorizedOrcAcidRowBatchReader(JobConf conf, OrcSplit orcSplit, Reporte VectorizedRowBatch.DEFAULT_SIZE, null, null, null); } rowIdProjected = areRowIdsProjected(rbCtx); +rowIsDeletedProjected = isVirtualColumnProjected(rbCtx, VirtualColumn.ROWISDELETED); +if (rowIsDeletedProjected) { + rowIsDeletedVector = new RowIsDeletedColumnVector(); Review comment: Lets explicitly pass VectorizedRowBatch.DEFAULT_SIZE to make this more obvious -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 607837) Time Spent: 1h 20m (was: 1h 10m) > Enable fetching deleted rows in vectorized mode > --- > > Key: HIVE-24991 > URL: https://issues.apache.org/jira/browse/HIVE-24991 > Project: Hive > Issue Type: Improvement > Components: Vectorization >Reporter: Krisztian Kasa >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > Time Spent: 1h 20m > Remaining Estimate: 0h > > HIVE-24855 enables loading deleted rows from ORC tables when table property > *acid.fetch.deleted.rows* is true. > The goal of this jira is to enable this feature in vectorized orc batch > reader. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-24991) Enable fetching deleted rows in vectorized mode
[ https://issues.apache.org/jira/browse/HIVE-24991?focusedWorklogId=607833=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-607833 ] ASF GitHub Bot logged work on HIVE-24991: - Author: ASF GitHub Bot Created on: 07/Jun/21 12:47 Start Date: 07/Jun/21 12:47 Worklog Time Spent: 10m Work Description: pgaref commented on a change in pull request #2264: URL: https://github.com/apache/hive/pull/2264#discussion_r646554194 ## File path: ql/src/java/org/apache/hadoop/hive/ql/io/orc/VectorizedOrcAcidRowBatchReader.java ## @@ -892,13 +913,20 @@ public boolean next(NullWritable key, VectorizedRowBatch value) throws IOExcepti } catch (Exception e) { throw new IOException("error iterating", e); } -if(!includeAcidColumns) { +if (!includeAcidColumns) { //if here, we don't need to filter anything wrt acid metadata columns //in fact, they are not even read from file/llap value.size = vectorizedRowBatchBase.size; value.selected = vectorizedRowBatchBase.selected; value.selectedInUse = vectorizedRowBatchBase.selectedInUse; copyFromBase(value); + + if (rowIsDeletedProjected) { +rowIsDeletedVector.clear(); +int ix = rbCtx.findVirtualColumnNum(VirtualColumn.ROWISDELETED); Review comment: Why do we have to recompute this for every batch? Lets store this along with rowIsDeletedProjected flag -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 607833) Time Spent: 1h (was: 50m) > Enable fetching deleted rows in vectorized mode > --- > > Key: HIVE-24991 > URL: https://issues.apache.org/jira/browse/HIVE-24991 > Project: Hive > Issue Type: Improvement > Components: Vectorization >Reporter: Krisztian Kasa >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > Time Spent: 1h > Remaining Estimate: 0h > > HIVE-24855 enables loading deleted rows from ORC tables when table property > *acid.fetch.deleted.rows* is true. > The goal of this jira is to enable this feature in vectorized orc batch > reader. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-24991) Enable fetching deleted rows in vectorized mode
[ https://issues.apache.org/jira/browse/HIVE-24991?focusedWorklogId=607834=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-607834 ] ASF GitHub Bot logged work on HIVE-24991: - Author: ASF GitHub Bot Created on: 07/Jun/21 12:47 Start Date: 07/Jun/21 12:47 Worklog Time Spent: 10m Work Description: pgaref commented on a change in pull request #2264: URL: https://github.com/apache/hive/pull/2264#discussion_r646554194 ## File path: ql/src/java/org/apache/hadoop/hive/ql/io/orc/VectorizedOrcAcidRowBatchReader.java ## @@ -892,13 +913,20 @@ public boolean next(NullWritable key, VectorizedRowBatch value) throws IOExcepti } catch (Exception e) { throw new IOException("error iterating", e); } -if(!includeAcidColumns) { +if (!includeAcidColumns) { //if here, we don't need to filter anything wrt acid metadata columns //in fact, they are not even read from file/llap value.size = vectorizedRowBatchBase.size; value.selected = vectorizedRowBatchBase.selected; value.selectedInUse = vectorizedRowBatchBase.selectedInUse; copyFromBase(value); + + if (rowIsDeletedProjected) { +rowIsDeletedVector.clear(); +int ix = rbCtx.findVirtualColumnNum(VirtualColumn.ROWISDELETED); Review comment: Why do we have to recompute **ix** for every batch? Lets store this along with rowIsDeletedProjected flag -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 607834) Time Spent: 1h 10m (was: 1h) > Enable fetching deleted rows in vectorized mode > --- > > Key: HIVE-24991 > URL: https://issues.apache.org/jira/browse/HIVE-24991 > Project: Hive > Issue Type: Improvement > Components: Vectorization >Reporter: Krisztian Kasa >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > Time Spent: 1h 10m > Remaining Estimate: 0h > > HIVE-24855 enables loading deleted rows from ORC tables when table property > *acid.fetch.deleted.rows* is true. > The goal of this jira is to enable this feature in vectorized orc batch > reader. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-24991) Enable fetching deleted rows in vectorized mode
[ https://issues.apache.org/jira/browse/HIVE-24991?focusedWorklogId=607832=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-607832 ] ASF GitHub Bot logged work on HIVE-24991: - Author: ASF GitHub Bot Created on: 07/Jun/21 12:43 Start Date: 07/Jun/21 12:43 Worklog Time Spent: 10m Work Description: pgaref commented on a change in pull request #2264: URL: https://github.com/apache/hive/pull/2264#discussion_r646551472 ## File path: ql/src/java/org/apache/hadoop/hive/ql/io/orc/VectorizedOrcAcidRowBatchReader.java ## @@ -303,6 +314,12 @@ private VectorizedOrcAcidRowBatchReader(JobConf conf, OrcSplit orcSplit, Reporte VectorizedRowBatch.DEFAULT_SIZE, null, null, null); } rowIdProjected = areRowIdsProjected(rbCtx); +rowIsDeletedProjected = isVirtualColumnProjected(rbCtx, VirtualColumn.ROWISDELETED); Review comment: lets move this to a Utility function as areRowIdsProjected() above -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 607832) Time Spent: 50m (was: 40m) > Enable fetching deleted rows in vectorized mode > --- > > Key: HIVE-24991 > URL: https://issues.apache.org/jira/browse/HIVE-24991 > Project: Hive > Issue Type: Improvement > Components: Vectorization >Reporter: Krisztian Kasa >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > Time Spent: 50m > Remaining Estimate: 0h > > HIVE-24855 enables loading deleted rows from ORC tables when table property > *acid.fetch.deleted.rows* is true. > The goal of this jira is to enable this feature in vectorized orc batch > reader. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-24991) Enable fetching deleted rows in vectorized mode
[ https://issues.apache.org/jira/browse/HIVE-24991?focusedWorklogId=607830=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-607830 ] ASF GitHub Bot logged work on HIVE-24991: - Author: ASF GitHub Bot Created on: 07/Jun/21 12:41 Start Date: 07/Jun/21 12:41 Worklog Time Spent: 10m Work Description: pgaref commented on a change in pull request #2264: URL: https://github.com/apache/hive/pull/2264#discussion_r646550206 ## File path: ql/src/java/org/apache/hadoop/hive/ql/io/orc/VectorizedOrcAcidRowBatchReader.java ## @@ -281,16 +285,23 @@ private VectorizedOrcAcidRowBatchReader(JobConf conf, OrcSplit orcSplit, Reporte deleteEventReaderOptions.range(0, Long.MAX_VALUE); deleteEventReaderOptions.searchArgument(null, null); keyInterval = findMinMaxKeys(orcSplit, conf, deleteEventReaderOptions); +fetchDeletedRows = conf.getBoolean(Constants.ACID_FETCH_DELETED_ROWS, false); DeleteEventRegistry der; try { // See if we can load all the relevant delete events from all the // delete deltas in memory... + ColumnizedDeleteEventRegistry.OriginalWriteIdLoader writeIdLoader; + if (fetchDeletedRows) { +writeIdLoader = new ColumnizedDeleteEventRegistry.BothWriteIdLoader(); Review comment: Maybe rename to something more explicit like OriginalAndCurrentWriteIdLoader? Also lets add some comment above explaining the logic -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 607830) Time Spent: 40m (was: 0.5h) > Enable fetching deleted rows in vectorized mode > --- > > Key: HIVE-24991 > URL: https://issues.apache.org/jira/browse/HIVE-24991 > Project: Hive > Issue Type: Improvement > Components: Vectorization >Reporter: Krisztian Kasa >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > Time Spent: 40m > Remaining Estimate: 0h > > HIVE-24855 enables loading deleted rows from ORC tables when table property > *acid.fetch.deleted.rows* is true. > The goal of this jira is to enable this feature in vectorized orc batch > reader. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-24991) Enable fetching deleted rows in vectorized mode
[ https://issues.apache.org/jira/browse/HIVE-24991?focusedWorklogId=607822=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-607822 ] ASF GitHub Bot logged work on HIVE-24991: - Author: ASF GitHub Bot Created on: 07/Jun/21 12:25 Start Date: 07/Jun/21 12:25 Worklog Time Spent: 10m Work Description: pgaref commented on a change in pull request #2264: URL: https://github.com/apache/hive/pull/2264#discussion_r646538850 ## File path: ql/src/test/org/apache/hadoop/hive/ql/io/orc/TestVectorizedOrcAcidRowBatchReader.java ## @@ -961,26 +966,41 @@ private void testDeleteEventOriginalFiltering2() throws Exception { @Test public void testVectorizedOrcAcidRowBatchReader() throws Exception { +setupTestData(); + + testVectorizedOrcAcidRowBatchReader(ColumnizedDeleteEventRegistry.class.getName()); + +// To test the SortMergedDeleteEventRegistry, we need to explicitly set the +// HIVE_TRANSACTIONAL_NUM_EVENTS_IN_MEMORY constant to a smaller value. +int oldValue = conf.getInt(HiveConf.ConfVars.HIVE_TRANSACTIONAL_NUM_EVENTS_IN_MEMORY.varname, 100); + conf.setInt(HiveConf.ConfVars.HIVE_TRANSACTIONAL_NUM_EVENTS_IN_MEMORY.varname, 1000); + testVectorizedOrcAcidRowBatchReader(SortMergedDeleteEventRegistry.class.getName()); + +// Restore the old value. + conf.setInt(HiveConf.ConfVars.HIVE_TRANSACTIONAL_NUM_EVENTS_IN_MEMORY.varname, oldValue); + } + + private void setupTestData() throws IOException { conf.set("bucket_count", "1"); - conf.set(ValidTxnList.VALID_TXNS_KEY, - new ValidReadTxnList(new long[0], new BitSet(), 1000, Long.MAX_VALUE).writeToString()); +conf.set(ValidTxnList.VALID_TXNS_KEY, +new ValidReadTxnList(new long[0], new BitSet(), 1000, Long.MAX_VALUE).writeToString()); int bucket = 0; AcidOutputFormat.Options options = new AcidOutputFormat.Options(conf) -.filesystem(fs) -.bucket(bucket) -.writingBase(false) -.minimumWriteId(1) -.maximumWriteId(NUM_OWID) -.inspector(inspector) -.reporter(Reporter.NULL) -.recordIdColumn(1) -.finalDestination(root); +.filesystem(fs) Review comment: nit. revert spaces -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 607822) Time Spent: 0.5h (was: 20m) > Enable fetching deleted rows in vectorized mode > --- > > Key: HIVE-24991 > URL: https://issues.apache.org/jira/browse/HIVE-24991 > Project: Hive > Issue Type: Improvement > Components: Vectorization >Reporter: Krisztian Kasa >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > Time Spent: 0.5h > Remaining Estimate: 0h > > HIVE-24855 enables loading deleted rows from ORC tables when table property > *acid.fetch.deleted.rows* is true. > The goal of this jira is to enable this feature in vectorized orc batch > reader. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-24991) Enable fetching deleted rows in vectorized mode
[ https://issues.apache.org/jira/browse/HIVE-24991?focusedWorklogId=607819=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-607819 ] ASF GitHub Bot logged work on HIVE-24991: - Author: ASF GitHub Bot Created on: 07/Jun/21 12:22 Start Date: 07/Jun/21 12:22 Worklog Time Spent: 10m Work Description: pgaref commented on a change in pull request #2264: URL: https://github.com/apache/hive/pull/2264#discussion_r646536788 ## File path: ql/src/java/org/apache/hadoop/hive/ql/io/orc/VectorizedOrcAcidRowBatchReader.java ## @@ -1940,39 +2091,38 @@ public boolean isEmpty() { } @Override public void findDeletedRecords(ColumnVector[] cols, int size, BitSet selectedBitSet) { - if (rowIds == null || compressedOwids == null) { + if (rowIds == null || writeIds == null || writeIds.isEmpty()) { return; } // Iterate through the batch and for each (owid, rowid) in the batch // check if it is deleted or not. long[] originalWriteIdVector = - cols[OrcRecordUpdater.ORIGINAL_WRITEID].isRepeating ? null - : ((LongColumnVector) cols[OrcRecordUpdater.ORIGINAL_WRITEID]).vector; + cols[OrcRecordUpdater.ORIGINAL_WRITEID].isRepeating ? null Review comment: Lets avoid changing the tabs/spaces below -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 607819) Time Spent: 20m (was: 10m) > Enable fetching deleted rows in vectorized mode > --- > > Key: HIVE-24991 > URL: https://issues.apache.org/jira/browse/HIVE-24991 > Project: Hive > Issue Type: Improvement > Components: Vectorization >Reporter: Krisztian Kasa >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > Time Spent: 20m > Remaining Estimate: 0h > > HIVE-24855 enables loading deleted rows from ORC tables when table property > *acid.fetch.deleted.rows* is true. > The goal of this jira is to enable this feature in vectorized orc batch > reader. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-24991) Enable fetching deleted rows in vectorized mode
[ https://issues.apache.org/jira/browse/HIVE-24991?focusedWorklogId=595270=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-595270 ] ASF GitHub Bot logged work on HIVE-24991: - Author: ASF GitHub Bot Created on: 12/May/21 12:39 Start Date: 12/May/21 12:39 Worklog Time Spent: 10m Work Description: kasakrisz opened a new pull request #2264: URL: https://github.com/apache/hive/pull/2264 ### What changes were proposed in this pull request? ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 595270) Remaining Estimate: 0h Time Spent: 10m > Enable fetching deleted rows in vectorized mode > --- > > Key: HIVE-24991 > URL: https://issues.apache.org/jira/browse/HIVE-24991 > Project: Hive > Issue Type: Improvement > Components: Vectorization >Reporter: Krisztian Kasa >Priority: Major > Fix For: 4.0.0 > > Time Spent: 10m > Remaining Estimate: 0h > > HIVE-24855 enables loading deleted rows from ORC tables when table property > *acid.fetch.deleted.rows* is true. > The goal of this jira is to enable this feature in vectorized orc batch > reader. -- This message was sent by Atlassian Jira (v8.3.4#803005)