[jira] [Work logged] (HIVE-24991) Enable fetching deleted rows in vectorized mode

2021-06-15 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24991?focusedWorklogId=611159=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-611159
 ]

ASF GitHub Bot logged work on HIVE-24991:
-

Author: ASF GitHub Bot
Created on: 15/Jun/21 06:32
Start Date: 15/Jun/21 06:32
Worklog Time Spent: 10m 
  Work Description: kasakrisz merged pull request #2264:
URL: https://github.com/apache/hive/pull/2264


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 611159)
Time Spent: 4h 50m  (was: 4h 40m)

> Enable fetching deleted rows in vectorized mode
> ---
>
> Key: HIVE-24991
> URL: https://issues.apache.org/jira/browse/HIVE-24991
> Project: Hive
>  Issue Type: Improvement
>  Components: Vectorization
>Reporter: Krisztian Kasa
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 4h 50m
>  Remaining Estimate: 0h
>
> HIVE-24855 enables loading deleted rows from ORC tables when table property 
> *acid.fetch.deleted.rows* is true.
> The goal of this jira is to enable this feature in vectorized orc batch 
> reader.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (HIVE-24991) Enable fetching deleted rows in vectorized mode

2021-06-08 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24991?focusedWorklogId=608551=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-608551
 ]

ASF GitHub Bot logged work on HIVE-24991:
-

Author: ASF GitHub Bot
Created on: 08/Jun/21 15:59
Start Date: 08/Jun/21 15:59
Worklog Time Spent: 10m 
  Work Description: kasakrisz commented on a change in pull request #2264:
URL: https://github.com/apache/hive/pull/2264#discussion_r647582327



##
File path: 
ql/src/java/org/apache/hadoop/hive/ql/io/orc/VectorizedOrcAcidRowBatchReader.java
##
@@ -281,16 +285,23 @@ private VectorizedOrcAcidRowBatchReader(JobConf conf, 
OrcSplit orcSplit, Reporte
 deleteEventReaderOptions.range(0, Long.MAX_VALUE);
 deleteEventReaderOptions.searchArgument(null, null);
 keyInterval = findMinMaxKeys(orcSplit, conf, deleteEventReaderOptions);
+fetchDeletedRows = conf.getBoolean(Constants.ACID_FETCH_DELETED_ROWS, 
false);
 DeleteEventRegistry der;
 try {
   // See if we can load all the relevant delete events from all the
   // delete deltas in memory...
+  ColumnizedDeleteEventRegistry.OriginalWriteIdLoader writeIdLoader;
+  if (fetchDeletedRows) {
+writeIdLoader = new ColumnizedDeleteEventRegistry.BothWriteIdLoader();

Review comment:
   done

##
File path: 
ql/src/java/org/apache/hadoop/hive/ql/io/orc/VectorizedOrcAcidRowBatchReader.java
##
@@ -303,6 +314,12 @@ private VectorizedOrcAcidRowBatchReader(JobConf conf, 
OrcSplit orcSplit, Reporte
   VectorizedRowBatch.DEFAULT_SIZE, null, null, null);
 }
 rowIdProjected = areRowIdsProjected(rbCtx);
+rowIsDeletedProjected = isVirtualColumnProjected(rbCtx, 
VirtualColumn.ROWISDELETED);

Review comment:
   done

##
File path: 
ql/src/java/org/apache/hadoop/hive/ql/io/orc/VectorizedOrcAcidRowBatchReader.java
##
@@ -303,6 +314,12 @@ private VectorizedOrcAcidRowBatchReader(JobConf conf, 
OrcSplit orcSplit, Reporte
   VectorizedRowBatch.DEFAULT_SIZE, null, null, null);
 }
 rowIdProjected = areRowIdsProjected(rbCtx);
+rowIsDeletedProjected = isVirtualColumnProjected(rbCtx, 
VirtualColumn.ROWISDELETED);
+if (rowIsDeletedProjected) {
+  rowIsDeletedVector = new RowIsDeletedColumnVector();

Review comment:
   done




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 608551)
Time Spent: 4h 20m  (was: 4h 10m)

> Enable fetching deleted rows in vectorized mode
> ---
>
> Key: HIVE-24991
> URL: https://issues.apache.org/jira/browse/HIVE-24991
> Project: Hive
>  Issue Type: Improvement
>  Components: Vectorization
>Reporter: Krisztian Kasa
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 4h 20m
>  Remaining Estimate: 0h
>
> HIVE-24855 enables loading deleted rows from ORC tables when table property 
> *acid.fetch.deleted.rows* is true.
> The goal of this jira is to enable this feature in vectorized orc batch 
> reader.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (HIVE-24991) Enable fetching deleted rows in vectorized mode

2021-06-08 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24991?focusedWorklogId=608553=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-608553
 ]

ASF GitHub Bot logged work on HIVE-24991:
-

Author: ASF GitHub Bot
Created on: 08/Jun/21 15:59
Start Date: 08/Jun/21 15:59
Worklog Time Spent: 10m 
  Work Description: kasakrisz commented on a change in pull request #2264:
URL: https://github.com/apache/hive/pull/2264#discussion_r647583124



##
File path: 
ql/src/java/org/apache/hadoop/hive/ql/io/orc/VectorizedOrcAcidRowBatchReader.java
##
@@ -1748,7 +1946,7 @@ public int compareTo(CompressedOwid other) {
   assert shouldReadDeleteDeltasWithLlap(conf, true);
 }
 deleteReaderValue = new DeleteReaderValue(readerData.reader, 
deleteDeltaFile, readerOptions, bucket,
-validWriteIdList, isBucketedTable, conf, keyInterval, 
orcSplit, numRows, cacheTag, fileId);
+validWriteIdList, isBucketedTable, conf, keyInterval, 
orcSplit, numRows, cacheTag, fileId);

Review comment:
   reverted




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 608553)
Time Spent: 4h 40m  (was: 4.5h)

> Enable fetching deleted rows in vectorized mode
> ---
>
> Key: HIVE-24991
> URL: https://issues.apache.org/jira/browse/HIVE-24991
> Project: Hive
>  Issue Type: Improvement
>  Components: Vectorization
>Reporter: Krisztian Kasa
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 4h 40m
>  Remaining Estimate: 0h
>
> HIVE-24855 enables loading deleted rows from ORC tables when table property 
> *acid.fetch.deleted.rows* is true.
> The goal of this jira is to enable this feature in vectorized orc batch 
> reader.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (HIVE-24991) Enable fetching deleted rows in vectorized mode

2021-06-08 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24991?focusedWorklogId=608552=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-608552
 ]

ASF GitHub Bot logged work on HIVE-24991:
-

Author: ASF GitHub Bot
Created on: 08/Jun/21 15:59
Start Date: 08/Jun/21 15:59
Worklog Time Spent: 10m 
  Work Description: kasakrisz commented on a change in pull request #2264:
URL: https://github.com/apache/hive/pull/2264#discussion_r647582736



##
File path: 
ql/src/java/org/apache/hadoop/hive/ql/io/orc/VectorizedOrcAcidRowBatchReader.java
##
@@ -948,7 +978,7 @@ public boolean next(NullWritable key, VectorizedRowBatch 
value) throws IOExcepti
   // This loop fills up the selected[] vector with all the index positions 
that are selected.
   for (int setBitIndex = selectedBitSet.nextSetBit(0), selectedItr = 0;
setBitIndex >= 0;
-   setBitIndex = selectedBitSet.nextSetBit(setBitIndex+1), 
++selectedItr) {
+   setBitIndex = selectedBitSet.nextSetBit(setBitIndex + 1), 
++selectedItr) {

Review comment:
   reverted




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 608552)
Time Spent: 4.5h  (was: 4h 20m)

> Enable fetching deleted rows in vectorized mode
> ---
>
> Key: HIVE-24991
> URL: https://issues.apache.org/jira/browse/HIVE-24991
> Project: Hive
>  Issue Type: Improvement
>  Components: Vectorization
>Reporter: Krisztian Kasa
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 4.5h
>  Remaining Estimate: 0h
>
> HIVE-24855 enables loading deleted rows from ORC tables when table property 
> *acid.fetch.deleted.rows* is true.
> The goal of this jira is to enable this feature in vectorized orc batch 
> reader.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (HIVE-24991) Enable fetching deleted rows in vectorized mode

2021-06-08 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24991?focusedWorklogId=608548=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-608548
 ]

ASF GitHub Bot logged work on HIVE-24991:
-

Author: ASF GitHub Bot
Created on: 08/Jun/21 15:58
Start Date: 08/Jun/21 15:58
Worklog Time Spent: 10m 
  Work Description: kasakrisz commented on a change in pull request #2264:
URL: https://github.com/apache/hive/pull/2264#discussion_r647582032



##
File path: 
ql/src/java/org/apache/hadoop/hive/ql/io/orc/VectorizedOrcAcidRowBatchReader.java
##
@@ -1940,39 +2091,38 @@ public boolean isEmpty() {
 }
 @Override
 public void findDeletedRecords(ColumnVector[] cols, int size, BitSet 
selectedBitSet) {
-  if (rowIds == null || compressedOwids == null) {
+  if (rowIds == null || writeIds == null || writeIds.isEmpty()) {
 return;
   }
   // Iterate through the batch and for each (owid, rowid) in the batch
   // check if it is deleted or not.
 
   long[] originalWriteIdVector =
-  cols[OrcRecordUpdater.ORIGINAL_WRITEID].isRepeating ? null
-  : ((LongColumnVector) 
cols[OrcRecordUpdater.ORIGINAL_WRITEID]).vector;
+  cols[OrcRecordUpdater.ORIGINAL_WRITEID].isRepeating ? null

Review comment:
   reverted

##
File path: 
ql/src/test/org/apache/hadoop/hive/ql/io/orc/TestVectorizedOrcAcidRowBatchReader.java
##
@@ -961,26 +966,41 @@ private void testDeleteEventOriginalFiltering2() throws 
Exception {
 
   @Test
   public void testVectorizedOrcAcidRowBatchReader() throws Exception {
+setupTestData();
+
+
testVectorizedOrcAcidRowBatchReader(ColumnizedDeleteEventRegistry.class.getName());
+
+// To test the SortMergedDeleteEventRegistry, we need to explicitly set the
+// HIVE_TRANSACTIONAL_NUM_EVENTS_IN_MEMORY constant to a smaller value.
+int oldValue = 
conf.getInt(HiveConf.ConfVars.HIVE_TRANSACTIONAL_NUM_EVENTS_IN_MEMORY.varname, 
100);
+
conf.setInt(HiveConf.ConfVars.HIVE_TRANSACTIONAL_NUM_EVENTS_IN_MEMORY.varname, 
1000);
+
testVectorizedOrcAcidRowBatchReader(SortMergedDeleteEventRegistry.class.getName());
+
+// Restore the old value.
+
conf.setInt(HiveConf.ConfVars.HIVE_TRANSACTIONAL_NUM_EVENTS_IN_MEMORY.varname, 
oldValue);
+  }
+
+  private void setupTestData() throws IOException {
 conf.set("bucket_count", "1");
-  conf.set(ValidTxnList.VALID_TXNS_KEY,
-  new ValidReadTxnList(new long[0], new BitSet(), 1000, 
Long.MAX_VALUE).writeToString());
+conf.set(ValidTxnList.VALID_TXNS_KEY,
+new ValidReadTxnList(new long[0], new BitSet(), 1000, 
Long.MAX_VALUE).writeToString());
 
 int bucket = 0;
 AcidOutputFormat.Options options = new AcidOutputFormat.Options(conf)
-.filesystem(fs)
-.bucket(bucket)
-.writingBase(false)
-.minimumWriteId(1)
-.maximumWriteId(NUM_OWID)
-.inspector(inspector)
-.reporter(Reporter.NULL)
-.recordIdColumn(1)
-.finalDestination(root);
+.filesystem(fs)

Review comment:
   reverted




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 608548)
Time Spent: 4h 10m  (was: 4h)

> Enable fetching deleted rows in vectorized mode
> ---
>
> Key: HIVE-24991
> URL: https://issues.apache.org/jira/browse/HIVE-24991
> Project: Hive
>  Issue Type: Improvement
>  Components: Vectorization
>Reporter: Krisztian Kasa
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 4h 10m
>  Remaining Estimate: 0h
>
> HIVE-24855 enables loading deleted rows from ORC tables when table property 
> *acid.fetch.deleted.rows* is true.
> The goal of this jira is to enable this feature in vectorized orc batch 
> reader.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (HIVE-24991) Enable fetching deleted rows in vectorized mode

2021-06-08 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24991?focusedWorklogId=608534=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-608534
 ]

ASF GitHub Bot logged work on HIVE-24991:
-

Author: ASF GitHub Bot
Created on: 08/Jun/21 15:37
Start Date: 08/Jun/21 15:37
Worklog Time Spent: 10m 
  Work Description: kasakrisz commented on a change in pull request #2264:
URL: https://github.com/apache/hive/pull/2264#discussion_r647563794



##
File path: 
ql/src/java/org/apache/hadoop/hive/ql/io/orc/VectorizedOrcAcidRowBatchReader.java
##
@@ -2039,4 +2189,29 @@ private static IntegerColumnStatistics 
deserializeIntColumnStatistics(List Enable fetching deleted rows in vectorized mode
> ---
>
> Key: HIVE-24991
> URL: https://issues.apache.org/jira/browse/HIVE-24991
> Project: Hive
>  Issue Type: Improvement
>  Components: Vectorization
>Reporter: Krisztian Kasa
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 4h
>  Remaining Estimate: 0h
>
> HIVE-24855 enables loading deleted rows from ORC tables when table property 
> *acid.fetch.deleted.rows* is true.
> The goal of this jira is to enable this feature in vectorized orc batch 
> reader.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (HIVE-24991) Enable fetching deleted rows in vectorized mode

2021-06-08 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24991?focusedWorklogId=608531=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-608531
 ]

ASF GitHub Bot logged work on HIVE-24991:
-

Author: ASF GitHub Bot
Created on: 08/Jun/21 15:36
Start Date: 08/Jun/21 15:36
Worklog Time Spent: 10m 
  Work Description: kasakrisz commented on a change in pull request #2264:
URL: https://github.com/apache/hive/pull/2264#discussion_r647562562



##
File path: 
ql/src/java/org/apache/hadoop/hive/ql/io/orc/VectorizedOrcAcidRowBatchReader.java
##
@@ -2039,4 +2189,29 @@ private static IntegerColumnStatistics 
deserializeIntColumnStatistics(List Enable fetching deleted rows in vectorized mode
> ---
>
> Key: HIVE-24991
> URL: https://issues.apache.org/jira/browse/HIVE-24991
> Project: Hive
>  Issue Type: Improvement
>  Components: Vectorization
>Reporter: Krisztian Kasa
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 3h 50m
>  Remaining Estimate: 0h
>
> HIVE-24855 enables loading deleted rows from ORC tables when table property 
> *acid.fetch.deleted.rows* is true.
> The goal of this jira is to enable this feature in vectorized orc batch 
> reader.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (HIVE-24991) Enable fetching deleted rows in vectorized mode

2021-06-08 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24991?focusedWorklogId=608529=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-608529
 ]

ASF GitHub Bot logged work on HIVE-24991:
-

Author: ASF GitHub Bot
Created on: 08/Jun/21 15:35
Start Date: 08/Jun/21 15:35
Worklog Time Spent: 10m 
  Work Description: kasakrisz commented on a change in pull request #2264:
URL: https://github.com/apache/hive/pull/2264#discussion_r647561929



##
File path: 
ql/src/java/org/apache/hadoop/hive/ql/io/orc/VectorizedOrcAcidRowBatchReader.java
##
@@ -959,6 +989,20 @@ public boolean next(NullWritable key, VectorizedRowBatch 
value) throws IOExcepti
   int ix = rbCtx.findVirtualColumnNum(VirtualColumn.ROWID);
   value.cols[ix] = recordIdColumnVector;
 }
+if (rowIsDeletedProjected) {
+  if (fetchDeletedRows) {

Review comment:
   I prefer your first suggestion because the second one requires passing 
`vectorizedRowBatchBase.size()` to the `set` method which I would like to avoid.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 608529)
Time Spent: 3h 40m  (was: 3.5h)

> Enable fetching deleted rows in vectorized mode
> ---
>
> Key: HIVE-24991
> URL: https://issues.apache.org/jira/browse/HIVE-24991
> Project: Hive
>  Issue Type: Improvement
>  Components: Vectorization
>Reporter: Krisztian Kasa
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 3h 40m
>  Remaining Estimate: 0h
>
> HIVE-24855 enables loading deleted rows from ORC tables when table property 
> *acid.fetch.deleted.rows* is true.
> The goal of this jira is to enable this feature in vectorized orc batch 
> reader.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (HIVE-24991) Enable fetching deleted rows in vectorized mode

2021-06-08 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24991?focusedWorklogId=608526=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-608526
 ]

ASF GitHub Bot logged work on HIVE-24991:
-

Author: ASF GitHub Bot
Created on: 08/Jun/21 15:32
Start Date: 08/Jun/21 15:32
Worklog Time Spent: 10m 
  Work Description: kasakrisz commented on a change in pull request #2264:
URL: https://github.com/apache/hive/pull/2264#discussion_r647559407



##
File path: 
ql/src/java/org/apache/hadoop/hive/ql/io/orc/VectorizedOrcAcidRowBatchReader.java
##
@@ -959,6 +989,20 @@ public boolean next(NullWritable key, VectorizedRowBatch 
value) throws IOExcepti
   int ix = rbCtx.findVirtualColumnNum(VirtualColumn.ROWID);

Review comment:
   see my previous comment for `VirtualColumn.ROWISDELETED`




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 608526)
Time Spent: 3h 20m  (was: 3h 10m)

> Enable fetching deleted rows in vectorized mode
> ---
>
> Key: HIVE-24991
> URL: https://issues.apache.org/jira/browse/HIVE-24991
> Project: Hive
>  Issue Type: Improvement
>  Components: Vectorization
>Reporter: Krisztian Kasa
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 3h 20m
>  Remaining Estimate: 0h
>
> HIVE-24855 enables loading deleted rows from ORC tables when table property 
> *acid.fetch.deleted.rows* is true.
> The goal of this jira is to enable this feature in vectorized orc batch 
> reader.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (HIVE-24991) Enable fetching deleted rows in vectorized mode

2021-06-08 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24991?focusedWorklogId=608527=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-608527
 ]

ASF GitHub Bot logged work on HIVE-24991:
-

Author: ASF GitHub Bot
Created on: 08/Jun/21 15:32
Start Date: 08/Jun/21 15:32
Worklog Time Spent: 10m 
  Work Description: kasakrisz commented on a change in pull request #2264:
URL: https://github.com/apache/hive/pull/2264#discussion_r647559557



##
File path: 
ql/src/java/org/apache/hadoop/hive/ql/io/orc/VectorizedOrcAcidRowBatchReader.java
##
@@ -983,7 +1027,7 @@ private void copyFromBase(VectorizedRowBatch value) {
   System.arraycopy(payloadStruct.fields, 0, value.cols, 0, 
value.getDataColumnCount());
 }
 if (rowIdProjected) {
-  recordIdColumnVector.fields[0] = 
vectorizedRowBatchBase.cols[OrcRecordUpdater.ORIGINAL_WRITEID];
+  recordIdColumnVector.fields[0] = 
vectorizedRowBatchBase.cols[fetchDeletedRows ? OrcRecordUpdater.CURRENT_WRITEID 
: OrcRecordUpdater.ORIGINAL_WRITEID];

Review comment:
   done




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 608527)
Time Spent: 3.5h  (was: 3h 20m)

> Enable fetching deleted rows in vectorized mode
> ---
>
> Key: HIVE-24991
> URL: https://issues.apache.org/jira/browse/HIVE-24991
> Project: Hive
>  Issue Type: Improvement
>  Components: Vectorization
>Reporter: Krisztian Kasa
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 3.5h
>  Remaining Estimate: 0h
>
> HIVE-24855 enables loading deleted rows from ORC tables when table property 
> *acid.fetch.deleted.rows* is true.
> The goal of this jira is to enable this feature in vectorized orc batch 
> reader.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (HIVE-24991) Enable fetching deleted rows in vectorized mode

2021-06-08 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24991?focusedWorklogId=608522=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-608522
 ]

ASF GitHub Bot logged work on HIVE-24991:
-

Author: ASF GitHub Bot
Created on: 08/Jun/21 15:31
Start Date: 08/Jun/21 15:31
Worklog Time Spent: 10m 
  Work Description: kasakrisz commented on a change in pull request #2264:
URL: https://github.com/apache/hive/pull/2264#discussion_r647558152



##
File path: 
ql/src/java/org/apache/hadoop/hive/ql/io/orc/VectorizedOrcAcidRowBatchReader.java
##
@@ -892,13 +913,20 @@ public boolean next(NullWritable key, VectorizedRowBatch 
value) throws IOExcepti
 } catch (Exception e) {
   throw new IOException("error iterating", e);
 }
-if(!includeAcidColumns) {
+if (!includeAcidColumns) {
   //if here, we don't need to filter anything wrt acid metadata columns
   //in fact, they are not even read from file/llap
   value.size = vectorizedRowBatchBase.size;
   value.selected = vectorizedRowBatchBase.selected;
   value.selectedInUse = vectorizedRowBatchBase.selectedInUse;
   copyFromBase(value);
+
+  if (rowIsDeletedProjected) {
+rowIsDeletedVector.clear();
+int ix = rbCtx.findVirtualColumnNum(VirtualColumn.ROWISDELETED);

Review comment:
   I started to work on a solution to manage Virtual Column related 
information but it lead to a much bigger change. 
`VectorizedOrcAcidRowBatchReader` can behave several ways and each of those 
behavior worth a separate class after extracting common parts.
   So I decided to followed existing logic implemented for RowId.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 608522)
Time Spent: 3h  (was: 2h 50m)

> Enable fetching deleted rows in vectorized mode
> ---
>
> Key: HIVE-24991
> URL: https://issues.apache.org/jira/browse/HIVE-24991
> Project: Hive
>  Issue Type: Improvement
>  Components: Vectorization
>Reporter: Krisztian Kasa
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 3h
>  Remaining Estimate: 0h
>
> HIVE-24855 enables loading deleted rows from ORC tables when table property 
> *acid.fetch.deleted.rows* is true.
> The goal of this jira is to enable this feature in vectorized orc batch 
> reader.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (HIVE-24991) Enable fetching deleted rows in vectorized mode

2021-06-08 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24991?focusedWorklogId=608524=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-608524
 ]

ASF GitHub Bot logged work on HIVE-24991:
-

Author: ASF GitHub Bot
Created on: 08/Jun/21 15:31
Start Date: 08/Jun/21 15:31
Worklog Time Spent: 10m 
  Work Description: kasakrisz commented on a change in pull request #2264:
URL: https://github.com/apache/hive/pull/2264#discussion_r647558355



##
File path: 
ql/src/java/org/apache/hadoop/hive/ql/io/orc/VectorizedOrcAcidRowBatchReader.java
##
@@ -932,8 +960,10 @@ public boolean next(NullWritable key, VectorizedRowBatch 
value) throws IOExcepti
 }
 
 // Case 2- find rows which have been deleted.
+BitSet notDeletedBitSet = fetchDeletedRows ? (BitSet) 
selectedBitSet.clone() : selectedBitSet;

Review comment:
   done




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 608524)
Time Spent: 3h 10m  (was: 3h)

> Enable fetching deleted rows in vectorized mode
> ---
>
> Key: HIVE-24991
> URL: https://issues.apache.org/jira/browse/HIVE-24991
> Project: Hive
>  Issue Type: Improvement
>  Components: Vectorization
>Reporter: Krisztian Kasa
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 3h 10m
>  Remaining Estimate: 0h
>
> HIVE-24855 enables loading deleted rows from ORC tables when table property 
> *acid.fetch.deleted.rows* is true.
> The goal of this jira is to enable this feature in vectorized orc batch 
> reader.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (HIVE-24991) Enable fetching deleted rows in vectorized mode

2021-06-07 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24991?focusedWorklogId=607874=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-607874
 ]

ASF GitHub Bot logged work on HIVE-24991:
-

Author: ASF GitHub Bot
Created on: 07/Jun/21 13:46
Start Date: 07/Jun/21 13:46
Worklog Time Spent: 10m 
  Work Description: pgaref commented on a change in pull request #2264:
URL: https://github.com/apache/hive/pull/2264#discussion_r646602703



##
File path: 
ql/src/java/org/apache/hadoop/hive/ql/io/orc/VectorizedOrcAcidRowBatchReader.java
##
@@ -2039,4 +2189,29 @@ private static IntegerColumnStatistics 
deserializeIntColumnStatistics(List Enable fetching deleted rows in vectorized mode
> ---
>
> Key: HIVE-24991
> URL: https://issues.apache.org/jira/browse/HIVE-24991
> Project: Hive
>  Issue Type: Improvement
>  Components: Vectorization
>Reporter: Krisztian Kasa
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 2h 50m
>  Remaining Estimate: 0h
>
> HIVE-24855 enables loading deleted rows from ORC tables when table property 
> *acid.fetch.deleted.rows* is true.
> The goal of this jira is to enable this feature in vectorized orc batch 
> reader.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (HIVE-24991) Enable fetching deleted rows in vectorized mode

2021-06-07 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24991?focusedWorklogId=607870=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-607870
 ]

ASF GitHub Bot logged work on HIVE-24991:
-

Author: ASF GitHub Bot
Created on: 07/Jun/21 13:44
Start Date: 07/Jun/21 13:44
Worklog Time Spent: 10m 
  Work Description: pgaref commented on a change in pull request #2264:
URL: https://github.com/apache/hive/pull/2264#discussion_r646601459



##
File path: 
ql/src/java/org/apache/hadoop/hive/ql/io/orc/VectorizedOrcAcidRowBatchReader.java
##
@@ -2039,4 +2189,29 @@ private static IntegerColumnStatistics 
deserializeIntColumnStatistics(List Enable fetching deleted rows in vectorized mode
> ---
>
> Key: HIVE-24991
> URL: https://issues.apache.org/jira/browse/HIVE-24991
> Project: Hive
>  Issue Type: Improvement
>  Components: Vectorization
>Reporter: Krisztian Kasa
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 2h 40m
>  Remaining Estimate: 0h
>
> HIVE-24855 enables loading deleted rows from ORC tables when table property 
> *acid.fetch.deleted.rows* is true.
> The goal of this jira is to enable this feature in vectorized orc batch 
> reader.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (HIVE-24991) Enable fetching deleted rows in vectorized mode

2021-06-07 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24991?focusedWorklogId=607869=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-607869
 ]

ASF GitHub Bot logged work on HIVE-24991:
-

Author: ASF GitHub Bot
Created on: 07/Jun/21 13:42
Start Date: 07/Jun/21 13:42
Worklog Time Spent: 10m 
  Work Description: pgaref commented on a change in pull request #2264:
URL: https://github.com/apache/hive/pull/2264#discussion_r646598948



##
File path: 
ql/src/java/org/apache/hadoop/hive/ql/io/orc/VectorizedOrcAcidRowBatchReader.java
##
@@ -1748,7 +1946,7 @@ public int compareTo(CompressedOwid other) {
   assert shouldReadDeleteDeltasWithLlap(conf, true);
 }
 deleteReaderValue = new DeleteReaderValue(readerData.reader, 
deleteDeltaFile, readerOptions, bucket,
-validWriteIdList, isBucketedTable, conf, keyInterval, 
orcSplit, numRows, cacheTag, fileId);
+validWriteIdList, isBucketedTable, conf, keyInterval, 
orcSplit, numRows, cacheTag, fileId);

Review comment:
   unnecessary space




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 607869)
Time Spent: 2.5h  (was: 2h 20m)

> Enable fetching deleted rows in vectorized mode
> ---
>
> Key: HIVE-24991
> URL: https://issues.apache.org/jira/browse/HIVE-24991
> Project: Hive
>  Issue Type: Improvement
>  Components: Vectorization
>Reporter: Krisztian Kasa
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 2.5h
>  Remaining Estimate: 0h
>
> HIVE-24855 enables loading deleted rows from ORC tables when table property 
> *acid.fetch.deleted.rows* is true.
> The goal of this jira is to enable this feature in vectorized orc batch 
> reader.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (HIVE-24991) Enable fetching deleted rows in vectorized mode

2021-06-07 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24991?focusedWorklogId=607865=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-607865
 ]

ASF GitHub Bot logged work on HIVE-24991:
-

Author: ASF GitHub Bot
Created on: 07/Jun/21 13:38
Start Date: 07/Jun/21 13:38
Worklog Time Spent: 10m 
  Work Description: pgaref commented on a change in pull request #2264:
URL: https://github.com/apache/hive/pull/2264#discussion_r646596265



##
File path: 
ql/src/java/org/apache/hadoop/hive/ql/io/orc/VectorizedOrcAcidRowBatchReader.java
##
@@ -959,6 +989,20 @@ public boolean next(NullWritable key, VectorizedRowBatch 
value) throws IOExcepti
   int ix = rbCtx.findVirtualColumnNum(VirtualColumn.ROWID);
   value.cols[ix] = recordIdColumnVector;
 }
+if (rowIsDeletedProjected) {
+  if (fetchDeletedRows) {

Review comment:
   tbh we could even do the second check as part of the Set method (as we 
do already for cardinality 0) and simplify the logic here




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 607865)
Time Spent: 2h 20m  (was: 2h 10m)

> Enable fetching deleted rows in vectorized mode
> ---
>
> Key: HIVE-24991
> URL: https://issues.apache.org/jira/browse/HIVE-24991
> Project: Hive
>  Issue Type: Improvement
>  Components: Vectorization
>Reporter: Krisztian Kasa
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 2h 20m
>  Remaining Estimate: 0h
>
> HIVE-24855 enables loading deleted rows from ORC tables when table property 
> *acid.fetch.deleted.rows* is true.
> The goal of this jira is to enable this feature in vectorized orc batch 
> reader.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (HIVE-24991) Enable fetching deleted rows in vectorized mode

2021-06-07 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24991?focusedWorklogId=607859=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-607859
 ]

ASF GitHub Bot logged work on HIVE-24991:
-

Author: ASF GitHub Bot
Created on: 07/Jun/21 13:34
Start Date: 07/Jun/21 13:34
Worklog Time Spent: 10m 
  Work Description: pgaref commented on a change in pull request #2264:
URL: https://github.com/apache/hive/pull/2264#discussion_r646592844



##
File path: 
ql/src/java/org/apache/hadoop/hive/ql/io/orc/VectorizedOrcAcidRowBatchReader.java
##
@@ -983,7 +1027,7 @@ private void copyFromBase(VectorizedRowBatch value) {
   System.arraycopy(payloadStruct.fields, 0, value.cols, 0, 
value.getDataColumnCount());
 }
 if (rowIdProjected) {
-  recordIdColumnVector.fields[0] = 
vectorizedRowBatchBase.cols[OrcRecordUpdater.ORIGINAL_WRITEID];
+  recordIdColumnVector.fields[0] = 
vectorizedRowBatchBase.cols[fetchDeletedRows ? OrcRecordUpdater.CURRENT_WRITEID 
: OrcRecordUpdater.ORIGINAL_WRITEID];

Review comment:
   would love a comment about the different WRITEID here




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 607859)
Time Spent: 2h 10m  (was: 2h)

> Enable fetching deleted rows in vectorized mode
> ---
>
> Key: HIVE-24991
> URL: https://issues.apache.org/jira/browse/HIVE-24991
> Project: Hive
>  Issue Type: Improvement
>  Components: Vectorization
>Reporter: Krisztian Kasa
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 2h 10m
>  Remaining Estimate: 0h
>
> HIVE-24855 enables loading deleted rows from ORC tables when table property 
> *acid.fetch.deleted.rows* is true.
> The goal of this jira is to enable this feature in vectorized orc batch 
> reader.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (HIVE-24991) Enable fetching deleted rows in vectorized mode

2021-06-07 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24991?focusedWorklogId=607858=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-607858
 ]

ASF GitHub Bot logged work on HIVE-24991:
-

Author: ASF GitHub Bot
Created on: 07/Jun/21 13:34
Start Date: 07/Jun/21 13:34
Worklog Time Spent: 10m 
  Work Description: pgaref commented on a change in pull request #2264:
URL: https://github.com/apache/hive/pull/2264#discussion_r646592084



##
File path: 
ql/src/java/org/apache/hadoop/hive/ql/io/orc/VectorizedOrcAcidRowBatchReader.java
##
@@ -959,6 +989,20 @@ public boolean next(NullWritable key, VectorizedRowBatch 
value) throws IOExcepti
   int ix = rbCtx.findVirtualColumnNum(VirtualColumn.ROWID);
   value.cols[ix] = recordIdColumnVector;
 }
+if (rowIsDeletedProjected) {
+  if (fetchDeletedRows) {

Review comment:
   if (!fetchDeletedRows || notDeletedBitSet.cardinality() == 
vectorizedRowBatchBase.size )




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 607858)
Time Spent: 2h  (was: 1h 50m)

> Enable fetching deleted rows in vectorized mode
> ---
>
> Key: HIVE-24991
> URL: https://issues.apache.org/jira/browse/HIVE-24991
> Project: Hive
>  Issue Type: Improvement
>  Components: Vectorization
>Reporter: Krisztian Kasa
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 2h
>  Remaining Estimate: 0h
>
> HIVE-24855 enables loading deleted rows from ORC tables when table property 
> *acid.fetch.deleted.rows* is true.
> The goal of this jira is to enable this feature in vectorized orc batch 
> reader.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (HIVE-24991) Enable fetching deleted rows in vectorized mode

2021-06-07 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24991?focusedWorklogId=607854=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-607854
 ]

ASF GitHub Bot logged work on HIVE-24991:
-

Author: ASF GitHub Bot
Created on: 07/Jun/21 13:32
Start Date: 07/Jun/21 13:32
Worklog Time Spent: 10m 
  Work Description: pgaref commented on a change in pull request #2264:
URL: https://github.com/apache/hive/pull/2264#discussion_r646590439



##
File path: 
ql/src/java/org/apache/hadoop/hive/ql/io/orc/VectorizedOrcAcidRowBatchReader.java
##
@@ -959,6 +989,20 @@ public boolean next(NullWritable key, VectorizedRowBatch 
value) throws IOExcepti
   int ix = rbCtx.findVirtualColumnNum(VirtualColumn.ROWID);

Review comment:
   we could probably do the same optimization here




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 607854)
Time Spent: 1h 50m  (was: 1h 40m)

> Enable fetching deleted rows in vectorized mode
> ---
>
> Key: HIVE-24991
> URL: https://issues.apache.org/jira/browse/HIVE-24991
> Project: Hive
>  Issue Type: Improvement
>  Components: Vectorization
>Reporter: Krisztian Kasa
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 1h 50m
>  Remaining Estimate: 0h
>
> HIVE-24855 enables loading deleted rows from ORC tables when table property 
> *acid.fetch.deleted.rows* is true.
> The goal of this jira is to enable this feature in vectorized orc batch 
> reader.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (HIVE-24991) Enable fetching deleted rows in vectorized mode

2021-06-07 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24991?focusedWorklogId=607840=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-607840
 ]

ASF GitHub Bot logged work on HIVE-24991:
-

Author: ASF GitHub Bot
Created on: 07/Jun/21 12:53
Start Date: 07/Jun/21 12:53
Worklog Time Spent: 10m 
  Work Description: pgaref commented on a change in pull request #2264:
URL: https://github.com/apache/hive/pull/2264#discussion_r646558980



##
File path: 
ql/src/java/org/apache/hadoop/hive/ql/io/orc/VectorizedOrcAcidRowBatchReader.java
##
@@ -948,7 +978,7 @@ public boolean next(NullWritable key, VectorizedRowBatch 
value) throws IOExcepti
   // This loop fills up the selected[] vector with all the index positions 
that are selected.
   for (int setBitIndex = selectedBitSet.nextSetBit(0), selectedItr = 0;
setBitIndex >= 0;
-   setBitIndex = selectedBitSet.nextSetBit(setBitIndex+1), 
++selectedItr) {
+   setBitIndex = selectedBitSet.nextSetBit(setBitIndex + 1), 
++selectedItr) {

Review comment:
   change not needed




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 607840)
Time Spent: 1h 40m  (was: 1.5h)

> Enable fetching deleted rows in vectorized mode
> ---
>
> Key: HIVE-24991
> URL: https://issues.apache.org/jira/browse/HIVE-24991
> Project: Hive
>  Issue Type: Improvement
>  Components: Vectorization
>Reporter: Krisztian Kasa
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> HIVE-24855 enables loading deleted rows from ORC tables when table property 
> *acid.fetch.deleted.rows* is true.
> The goal of this jira is to enable this feature in vectorized orc batch 
> reader.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (HIVE-24991) Enable fetching deleted rows in vectorized mode

2021-06-07 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24991?focusedWorklogId=607839=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-607839
 ]

ASF GitHub Bot logged work on HIVE-24991:
-

Author: ASF GitHub Bot
Created on: 07/Jun/21 12:52
Start Date: 07/Jun/21 12:52
Worklog Time Spent: 10m 
  Work Description: pgaref commented on a change in pull request #2264:
URL: https://github.com/apache/hive/pull/2264#discussion_r646558742



##
File path: 
ql/src/java/org/apache/hadoop/hive/ql/io/orc/VectorizedOrcAcidRowBatchReader.java
##
@@ -932,8 +960,10 @@ public boolean next(NullWritable key, VectorizedRowBatch 
value) throws IOExcepti
 }
 
 // Case 2- find rows which have been deleted.
+BitSet notDeletedBitSet = fetchDeletedRows ? (BitSet) 
selectedBitSet.clone() : selectedBitSet;

Review comment:
   lets add a comment above saying when/why we clone the BitSet




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 607839)
Time Spent: 1.5h  (was: 1h 20m)

> Enable fetching deleted rows in vectorized mode
> ---
>
> Key: HIVE-24991
> URL: https://issues.apache.org/jira/browse/HIVE-24991
> Project: Hive
>  Issue Type: Improvement
>  Components: Vectorization
>Reporter: Krisztian Kasa
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> HIVE-24855 enables loading deleted rows from ORC tables when table property 
> *acid.fetch.deleted.rows* is true.
> The goal of this jira is to enable this feature in vectorized orc batch 
> reader.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (HIVE-24991) Enable fetching deleted rows in vectorized mode

2021-06-07 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24991?focusedWorklogId=607837=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-607837
 ]

ASF GitHub Bot logged work on HIVE-24991:
-

Author: ASF GitHub Bot
Created on: 07/Jun/21 12:49
Start Date: 07/Jun/21 12:49
Worklog Time Spent: 10m 
  Work Description: pgaref commented on a change in pull request #2264:
URL: https://github.com/apache/hive/pull/2264#discussion_r646556329



##
File path: 
ql/src/java/org/apache/hadoop/hive/ql/io/orc/VectorizedOrcAcidRowBatchReader.java
##
@@ -303,6 +314,12 @@ private VectorizedOrcAcidRowBatchReader(JobConf conf, 
OrcSplit orcSplit, Reporte
   VectorizedRowBatch.DEFAULT_SIZE, null, null, null);
 }
 rowIdProjected = areRowIdsProjected(rbCtx);
+rowIsDeletedProjected = isVirtualColumnProjected(rbCtx, 
VirtualColumn.ROWISDELETED);
+if (rowIsDeletedProjected) {
+  rowIsDeletedVector = new RowIsDeletedColumnVector();

Review comment:
   Lets explicitly pass VectorizedRowBatch.DEFAULT_SIZE to make this more 
obvious




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 607837)
Time Spent: 1h 20m  (was: 1h 10m)

> Enable fetching deleted rows in vectorized mode
> ---
>
> Key: HIVE-24991
> URL: https://issues.apache.org/jira/browse/HIVE-24991
> Project: Hive
>  Issue Type: Improvement
>  Components: Vectorization
>Reporter: Krisztian Kasa
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> HIVE-24855 enables loading deleted rows from ORC tables when table property 
> *acid.fetch.deleted.rows* is true.
> The goal of this jira is to enable this feature in vectorized orc batch 
> reader.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (HIVE-24991) Enable fetching deleted rows in vectorized mode

2021-06-07 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24991?focusedWorklogId=607833=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-607833
 ]

ASF GitHub Bot logged work on HIVE-24991:
-

Author: ASF GitHub Bot
Created on: 07/Jun/21 12:47
Start Date: 07/Jun/21 12:47
Worklog Time Spent: 10m 
  Work Description: pgaref commented on a change in pull request #2264:
URL: https://github.com/apache/hive/pull/2264#discussion_r646554194



##
File path: 
ql/src/java/org/apache/hadoop/hive/ql/io/orc/VectorizedOrcAcidRowBatchReader.java
##
@@ -892,13 +913,20 @@ public boolean next(NullWritable key, VectorizedRowBatch 
value) throws IOExcepti
 } catch (Exception e) {
   throw new IOException("error iterating", e);
 }
-if(!includeAcidColumns) {
+if (!includeAcidColumns) {
   //if here, we don't need to filter anything wrt acid metadata columns
   //in fact, they are not even read from file/llap
   value.size = vectorizedRowBatchBase.size;
   value.selected = vectorizedRowBatchBase.selected;
   value.selectedInUse = vectorizedRowBatchBase.selectedInUse;
   copyFromBase(value);
+
+  if (rowIsDeletedProjected) {
+rowIsDeletedVector.clear();
+int ix = rbCtx.findVirtualColumnNum(VirtualColumn.ROWISDELETED);

Review comment:
   Why do we have to recompute this for every batch? Lets store this along 
with rowIsDeletedProjected flag




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 607833)
Time Spent: 1h  (was: 50m)

> Enable fetching deleted rows in vectorized mode
> ---
>
> Key: HIVE-24991
> URL: https://issues.apache.org/jira/browse/HIVE-24991
> Project: Hive
>  Issue Type: Improvement
>  Components: Vectorization
>Reporter: Krisztian Kasa
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> HIVE-24855 enables loading deleted rows from ORC tables when table property 
> *acid.fetch.deleted.rows* is true.
> The goal of this jira is to enable this feature in vectorized orc batch 
> reader.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (HIVE-24991) Enable fetching deleted rows in vectorized mode

2021-06-07 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24991?focusedWorklogId=607834=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-607834
 ]

ASF GitHub Bot logged work on HIVE-24991:
-

Author: ASF GitHub Bot
Created on: 07/Jun/21 12:47
Start Date: 07/Jun/21 12:47
Worklog Time Spent: 10m 
  Work Description: pgaref commented on a change in pull request #2264:
URL: https://github.com/apache/hive/pull/2264#discussion_r646554194



##
File path: 
ql/src/java/org/apache/hadoop/hive/ql/io/orc/VectorizedOrcAcidRowBatchReader.java
##
@@ -892,13 +913,20 @@ public boolean next(NullWritable key, VectorizedRowBatch 
value) throws IOExcepti
 } catch (Exception e) {
   throw new IOException("error iterating", e);
 }
-if(!includeAcidColumns) {
+if (!includeAcidColumns) {
   //if here, we don't need to filter anything wrt acid metadata columns
   //in fact, they are not even read from file/llap
   value.size = vectorizedRowBatchBase.size;
   value.selected = vectorizedRowBatchBase.selected;
   value.selectedInUse = vectorizedRowBatchBase.selectedInUse;
   copyFromBase(value);
+
+  if (rowIsDeletedProjected) {
+rowIsDeletedVector.clear();
+int ix = rbCtx.findVirtualColumnNum(VirtualColumn.ROWISDELETED);

Review comment:
   Why do we have to recompute **ix** for every batch? Lets store this 
along with rowIsDeletedProjected flag




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 607834)
Time Spent: 1h 10m  (was: 1h)

> Enable fetching deleted rows in vectorized mode
> ---
>
> Key: HIVE-24991
> URL: https://issues.apache.org/jira/browse/HIVE-24991
> Project: Hive
>  Issue Type: Improvement
>  Components: Vectorization
>Reporter: Krisztian Kasa
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> HIVE-24855 enables loading deleted rows from ORC tables when table property 
> *acid.fetch.deleted.rows* is true.
> The goal of this jira is to enable this feature in vectorized orc batch 
> reader.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (HIVE-24991) Enable fetching deleted rows in vectorized mode

2021-06-07 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24991?focusedWorklogId=607832=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-607832
 ]

ASF GitHub Bot logged work on HIVE-24991:
-

Author: ASF GitHub Bot
Created on: 07/Jun/21 12:43
Start Date: 07/Jun/21 12:43
Worklog Time Spent: 10m 
  Work Description: pgaref commented on a change in pull request #2264:
URL: https://github.com/apache/hive/pull/2264#discussion_r646551472



##
File path: 
ql/src/java/org/apache/hadoop/hive/ql/io/orc/VectorizedOrcAcidRowBatchReader.java
##
@@ -303,6 +314,12 @@ private VectorizedOrcAcidRowBatchReader(JobConf conf, 
OrcSplit orcSplit, Reporte
   VectorizedRowBatch.DEFAULT_SIZE, null, null, null);
 }
 rowIdProjected = areRowIdsProjected(rbCtx);
+rowIsDeletedProjected = isVirtualColumnProjected(rbCtx, 
VirtualColumn.ROWISDELETED);

Review comment:
   lets move this to a Utility function as areRowIdsProjected() above




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 607832)
Time Spent: 50m  (was: 40m)

> Enable fetching deleted rows in vectorized mode
> ---
>
> Key: HIVE-24991
> URL: https://issues.apache.org/jira/browse/HIVE-24991
> Project: Hive
>  Issue Type: Improvement
>  Components: Vectorization
>Reporter: Krisztian Kasa
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> HIVE-24855 enables loading deleted rows from ORC tables when table property 
> *acid.fetch.deleted.rows* is true.
> The goal of this jira is to enable this feature in vectorized orc batch 
> reader.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (HIVE-24991) Enable fetching deleted rows in vectorized mode

2021-06-07 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24991?focusedWorklogId=607830=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-607830
 ]

ASF GitHub Bot logged work on HIVE-24991:
-

Author: ASF GitHub Bot
Created on: 07/Jun/21 12:41
Start Date: 07/Jun/21 12:41
Worklog Time Spent: 10m 
  Work Description: pgaref commented on a change in pull request #2264:
URL: https://github.com/apache/hive/pull/2264#discussion_r646550206



##
File path: 
ql/src/java/org/apache/hadoop/hive/ql/io/orc/VectorizedOrcAcidRowBatchReader.java
##
@@ -281,16 +285,23 @@ private VectorizedOrcAcidRowBatchReader(JobConf conf, 
OrcSplit orcSplit, Reporte
 deleteEventReaderOptions.range(0, Long.MAX_VALUE);
 deleteEventReaderOptions.searchArgument(null, null);
 keyInterval = findMinMaxKeys(orcSplit, conf, deleteEventReaderOptions);
+fetchDeletedRows = conf.getBoolean(Constants.ACID_FETCH_DELETED_ROWS, 
false);
 DeleteEventRegistry der;
 try {
   // See if we can load all the relevant delete events from all the
   // delete deltas in memory...
+  ColumnizedDeleteEventRegistry.OriginalWriteIdLoader writeIdLoader;
+  if (fetchDeletedRows) {
+writeIdLoader = new ColumnizedDeleteEventRegistry.BothWriteIdLoader();

Review comment:
   Maybe rename to something more explicit like 
OriginalAndCurrentWriteIdLoader?
   
   Also lets add some comment above explaining the logic




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 607830)
Time Spent: 40m  (was: 0.5h)

> Enable fetching deleted rows in vectorized mode
> ---
>
> Key: HIVE-24991
> URL: https://issues.apache.org/jira/browse/HIVE-24991
> Project: Hive
>  Issue Type: Improvement
>  Components: Vectorization
>Reporter: Krisztian Kasa
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> HIVE-24855 enables loading deleted rows from ORC tables when table property 
> *acid.fetch.deleted.rows* is true.
> The goal of this jira is to enable this feature in vectorized orc batch 
> reader.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (HIVE-24991) Enable fetching deleted rows in vectorized mode

2021-06-07 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24991?focusedWorklogId=607822=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-607822
 ]

ASF GitHub Bot logged work on HIVE-24991:
-

Author: ASF GitHub Bot
Created on: 07/Jun/21 12:25
Start Date: 07/Jun/21 12:25
Worklog Time Spent: 10m 
  Work Description: pgaref commented on a change in pull request #2264:
URL: https://github.com/apache/hive/pull/2264#discussion_r646538850



##
File path: 
ql/src/test/org/apache/hadoop/hive/ql/io/orc/TestVectorizedOrcAcidRowBatchReader.java
##
@@ -961,26 +966,41 @@ private void testDeleteEventOriginalFiltering2() throws 
Exception {
 
   @Test
   public void testVectorizedOrcAcidRowBatchReader() throws Exception {
+setupTestData();
+
+
testVectorizedOrcAcidRowBatchReader(ColumnizedDeleteEventRegistry.class.getName());
+
+// To test the SortMergedDeleteEventRegistry, we need to explicitly set the
+// HIVE_TRANSACTIONAL_NUM_EVENTS_IN_MEMORY constant to a smaller value.
+int oldValue = 
conf.getInt(HiveConf.ConfVars.HIVE_TRANSACTIONAL_NUM_EVENTS_IN_MEMORY.varname, 
100);
+
conf.setInt(HiveConf.ConfVars.HIVE_TRANSACTIONAL_NUM_EVENTS_IN_MEMORY.varname, 
1000);
+
testVectorizedOrcAcidRowBatchReader(SortMergedDeleteEventRegistry.class.getName());
+
+// Restore the old value.
+
conf.setInt(HiveConf.ConfVars.HIVE_TRANSACTIONAL_NUM_EVENTS_IN_MEMORY.varname, 
oldValue);
+  }
+
+  private void setupTestData() throws IOException {
 conf.set("bucket_count", "1");
-  conf.set(ValidTxnList.VALID_TXNS_KEY,
-  new ValidReadTxnList(new long[0], new BitSet(), 1000, 
Long.MAX_VALUE).writeToString());
+conf.set(ValidTxnList.VALID_TXNS_KEY,
+new ValidReadTxnList(new long[0], new BitSet(), 1000, 
Long.MAX_VALUE).writeToString());
 
 int bucket = 0;
 AcidOutputFormat.Options options = new AcidOutputFormat.Options(conf)
-.filesystem(fs)
-.bucket(bucket)
-.writingBase(false)
-.minimumWriteId(1)
-.maximumWriteId(NUM_OWID)
-.inspector(inspector)
-.reporter(Reporter.NULL)
-.recordIdColumn(1)
-.finalDestination(root);
+.filesystem(fs)

Review comment:
   nit. revert spaces




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 607822)
Time Spent: 0.5h  (was: 20m)

> Enable fetching deleted rows in vectorized mode
> ---
>
> Key: HIVE-24991
> URL: https://issues.apache.org/jira/browse/HIVE-24991
> Project: Hive
>  Issue Type: Improvement
>  Components: Vectorization
>Reporter: Krisztian Kasa
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> HIVE-24855 enables loading deleted rows from ORC tables when table property 
> *acid.fetch.deleted.rows* is true.
> The goal of this jira is to enable this feature in vectorized orc batch 
> reader.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (HIVE-24991) Enable fetching deleted rows in vectorized mode

2021-06-07 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24991?focusedWorklogId=607819=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-607819
 ]

ASF GitHub Bot logged work on HIVE-24991:
-

Author: ASF GitHub Bot
Created on: 07/Jun/21 12:22
Start Date: 07/Jun/21 12:22
Worklog Time Spent: 10m 
  Work Description: pgaref commented on a change in pull request #2264:
URL: https://github.com/apache/hive/pull/2264#discussion_r646536788



##
File path: 
ql/src/java/org/apache/hadoop/hive/ql/io/orc/VectorizedOrcAcidRowBatchReader.java
##
@@ -1940,39 +2091,38 @@ public boolean isEmpty() {
 }
 @Override
 public void findDeletedRecords(ColumnVector[] cols, int size, BitSet 
selectedBitSet) {
-  if (rowIds == null || compressedOwids == null) {
+  if (rowIds == null || writeIds == null || writeIds.isEmpty()) {
 return;
   }
   // Iterate through the batch and for each (owid, rowid) in the batch
   // check if it is deleted or not.
 
   long[] originalWriteIdVector =
-  cols[OrcRecordUpdater.ORIGINAL_WRITEID].isRepeating ? null
-  : ((LongColumnVector) 
cols[OrcRecordUpdater.ORIGINAL_WRITEID]).vector;
+  cols[OrcRecordUpdater.ORIGINAL_WRITEID].isRepeating ? null

Review comment:
   Lets avoid changing the tabs/spaces below




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 607819)
Time Spent: 20m  (was: 10m)

> Enable fetching deleted rows in vectorized mode
> ---
>
> Key: HIVE-24991
> URL: https://issues.apache.org/jira/browse/HIVE-24991
> Project: Hive
>  Issue Type: Improvement
>  Components: Vectorization
>Reporter: Krisztian Kasa
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> HIVE-24855 enables loading deleted rows from ORC tables when table property 
> *acid.fetch.deleted.rows* is true.
> The goal of this jira is to enable this feature in vectorized orc batch 
> reader.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (HIVE-24991) Enable fetching deleted rows in vectorized mode

2021-05-12 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24991?focusedWorklogId=595270=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-595270
 ]

ASF GitHub Bot logged work on HIVE-24991:
-

Author: ASF GitHub Bot
Created on: 12/May/21 12:39
Start Date: 12/May/21 12:39
Worklog Time Spent: 10m 
  Work Description: kasakrisz opened a new pull request #2264:
URL: https://github.com/apache/hive/pull/2264


   
   
   ### What changes were proposed in this pull request?
   
   
   
   ### Why are the changes needed?
   
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   
   
   ### How was this patch tested?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 595270)
Remaining Estimate: 0h
Time Spent: 10m

> Enable fetching deleted rows in vectorized mode
> ---
>
> Key: HIVE-24991
> URL: https://issues.apache.org/jira/browse/HIVE-24991
> Project: Hive
>  Issue Type: Improvement
>  Components: Vectorization
>Reporter: Krisztian Kasa
>Priority: Major
> Fix For: 4.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> HIVE-24855 enables loading deleted rows from ORC tables when table property 
> *acid.fetch.deleted.rows* is true.
> The goal of this jira is to enable this feature in vectorized orc batch 
> reader.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)