[jira] [Assigned] (HIVE-18045) can VectorizedOrcAcidRowBatchReader be used all the time

2020-10-20 Thread Saurabh Seth (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-18045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saurabh Seth reassigned HIVE-18045:
---

Assignee: (was: Saurabh Seth)

> can VectorizedOrcAcidRowBatchReader be used all the time
> 
>
> Key: HIVE-18045
> URL: https://issues.apache.org/jira/browse/HIVE-18045
> Project: Hive
>  Issue Type: Improvement
>  Components: Transactions
>Reporter: Eugene Koifman
>Priority: Blocker
>
> Can we use VectorizedOrcAcidRowBatchReader for non-vectorized queries?
> It would just need a wrapper on top of it to turn VRBs into rows.
> This would mean there is just 1 acid reader to maintain - not 2.
> Would this be an issue for sorted reader/SMB support?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HIVE-20730) Do delete event filtering even if hive.acid.index is not there

2018-11-13 Thread Saurabh Seth (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-20730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16685782#comment-16685782
 ] 

Saurabh Seth commented on HIVE-20730:
-

[~ekoifman] this is ready to be committed now. I've added the test case I was 
working on.

> Do delete event filtering even if hive.acid.index is not there
> --
>
> Key: HIVE-20730
> URL: https://issues.apache.org/jira/browse/HIVE-20730
> Project: Hive
>  Issue Type: Improvement
>  Components: Transactions
>Affects Versions: 4.0.0
>Reporter: Eugene Koifman
>Assignee: Saurabh Seth
>Priority: Major
> Attachments: HIVE-20730.2.patch, HIVE-20730.3.patch, HIVE-20730.patch
>
>
> since HIVE-16812 {{VectorizedOrcAcidRowBatchReader}} filters delete events 
> based on min/max ROW__ID in the split which relies on {{hive.acid.index}} to 
> be in the ORC footer.  
> There is no way to generate {{hive.acid.index}} from a plain query as in 
> HIVE-20699 and so we need to make sure that we generate a SARG into 
> delete_delta/bucket_x based on stripe stats even the index is missing



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-20730) Do delete event filtering even if hive.acid.index is not there

2018-11-09 Thread Saurabh Seth (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-20730?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saurabh Seth updated HIVE-20730:

Attachment: HIVE-20730.3.patch
Status: Patch Available  (was: In Progress)

Had uploaded an incorrect patch accidentally.
Also fixed the TestOrcRawRecordMerger.testNewBase test failure and uploaded new 
patch.

> Do delete event filtering even if hive.acid.index is not there
> --
>
> Key: HIVE-20730
> URL: https://issues.apache.org/jira/browse/HIVE-20730
> Project: Hive
>  Issue Type: Improvement
>  Components: Transactions
>Affects Versions: 4.0.0
>Reporter: Eugene Koifman
>Assignee: Saurabh Seth
>Priority: Major
> Attachments: HIVE-20730.2.patch, HIVE-20730.3.patch, HIVE-20730.patch
>
>
> since HIVE-16812 {{VectorizedOrcAcidRowBatchReader}} filters delete events 
> based on min/max ROW__ID in the split which relies on {{hive.acid.index}} to 
> be in the ORC footer.  
> There is no way to generate {{hive.acid.index}} from a plain query as in 
> HIVE-20699 and so we need to make sure that we generate a SARG into 
> delete_delta/bucket_x based on stripe stats even the index is missing



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-20730) Do delete event filtering even if hive.acid.index is not there

2018-11-09 Thread Saurabh Seth (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-20730?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saurabh Seth updated HIVE-20730:

Status: In Progress  (was: Patch Available)

> Do delete event filtering even if hive.acid.index is not there
> --
>
> Key: HIVE-20730
> URL: https://issues.apache.org/jira/browse/HIVE-20730
> Project: Hive
>  Issue Type: Improvement
>  Components: Transactions
>Affects Versions: 4.0.0
>Reporter: Eugene Koifman
>Assignee: Saurabh Seth
>Priority: Major
> Attachments: HIVE-20730.2.patch, HIVE-20730.patch
>
>
> since HIVE-16812 {{VectorizedOrcAcidRowBatchReader}} filters delete events 
> based on min/max ROW__ID in the split which relies on {{hive.acid.index}} to 
> be in the ORC footer.  
> There is no way to generate {{hive.acid.index}} from a plain query as in 
> HIVE-20699 and so we need to make sure that we generate a SARG into 
> delete_delta/bucket_x based on stripe stats even the index is missing



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (HIVE-18045) can VectorizedOrcAcidRowBatchReader be used all the time

2018-11-09 Thread Saurabh Seth (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-18045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saurabh Seth reassigned HIVE-18045:
---

Assignee: Saurabh Seth

> can VectorizedOrcAcidRowBatchReader be used all the time
> 
>
> Key: HIVE-18045
> URL: https://issues.apache.org/jira/browse/HIVE-18045
> Project: Hive
>  Issue Type: Improvement
>  Components: Transactions
>Reporter: Eugene Koifman
>Assignee: Saurabh Seth
>Priority: Blocker
>
> Can we use VectorizedOrcAcidRowBatchReader for non-vectorized queries?
> It would just need a wrapper on top of it to turn VRBs into rows.
> This would mean there is just 1 acid reader to maintain - not 2.
> Would this be an issue for sorted reader/SMB support?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-20730) Do delete event filtering even if hive.acid.index is not there

2018-11-08 Thread Saurabh Seth (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-20730?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saurabh Seth updated HIVE-20730:

Status: In Progress  (was: Patch Available)

> Do delete event filtering even if hive.acid.index is not there
> --
>
> Key: HIVE-20730
> URL: https://issues.apache.org/jira/browse/HIVE-20730
> Project: Hive
>  Issue Type: Improvement
>  Components: Transactions
>Affects Versions: 4.0.0
>Reporter: Eugene Koifman
>Assignee: Saurabh Seth
>Priority: Major
> Attachments: HIVE-20730.patch
>
>
> since HIVE-16812 {{VectorizedOrcAcidRowBatchReader}} filters delete events 
> based on min/max ROW__ID in the split which relies on {{hive.acid.index}} to 
> be in the ORC footer.  
> There is no way to generate {{hive.acid.index}} from a plain query as in 
> HIVE-20699 and so we need to make sure that we generate a SARG into 
> delete_delta/bucket_x based on stripe stats even the index is missing



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-20730) Do delete event filtering even if hive.acid.index is not there

2018-11-08 Thread Saurabh Seth (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-20730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16679464#comment-16679464
 ] 

Saurabh Seth commented on HIVE-20730:
-

[~ekoifman] Not quite yet. I'll test the fix for the NPE with a HIveConf 
property for skipping the index generation and then submit another patch soon.

> Do delete event filtering even if hive.acid.index is not there
> --
>
> Key: HIVE-20730
> URL: https://issues.apache.org/jira/browse/HIVE-20730
> Project: Hive
>  Issue Type: Improvement
>  Components: Transactions
>Affects Versions: 4.0.0
>Reporter: Eugene Koifman
>Assignee: Saurabh Seth
>Priority: Major
> Attachments: HIVE-20730.patch
>
>
> since HIVE-16812 {{VectorizedOrcAcidRowBatchReader}} filters delete events 
> based on min/max ROW__ID in the split which relies on {{hive.acid.index}} to 
> be in the ORC footer.  
> There is no way to generate {{hive.acid.index}} from a plain query as in 
> HIVE-20699 and so we need to make sure that we generate a SARG into 
> delete_delta/bucket_x based on stripe stats even the index is missing



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-20730) Do delete event filtering even if hive.acid.index is not there

2018-11-08 Thread Saurabh Seth (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-20730?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saurabh Seth updated HIVE-20730:

Attachment: HIVE-20730.2.patch
Status: Patch Available  (was: In Progress)

Added a check in {{OrcRecordUpdater.parseKeyIndex}} to return null if 
{{hive.acid.key.index}} is not available. Also added a unit test.

> Do delete event filtering even if hive.acid.index is not there
> --
>
> Key: HIVE-20730
> URL: https://issues.apache.org/jira/browse/HIVE-20730
> Project: Hive
>  Issue Type: Improvement
>  Components: Transactions
>Affects Versions: 4.0.0
>Reporter: Eugene Koifman
>Assignee: Saurabh Seth
>Priority: Major
> Attachments: HIVE-20730.2.patch, HIVE-20730.patch
>
>
> since HIVE-16812 {{VectorizedOrcAcidRowBatchReader}} filters delete events 
> based on min/max ROW__ID in the split which relies on {{hive.acid.index}} to 
> be in the ORC footer.  
> There is no way to generate {{hive.acid.index}} from a plain query as in 
> HIVE-20699 and so we need to make sure that we generate a SARG into 
> delete_delta/bucket_x based on stripe stats even the index is missing



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-20730) Do delete event filtering even if hive.acid.index is not there

2018-11-06 Thread Saurabh Seth (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-20730?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saurabh Seth updated HIVE-20730:

Attachment: HIVE-20730.patch
Status: Patch Available  (was: Open)

I have tweaked {{VectorizedOrcAcidRowBatchReader.findMinMaxKeys}} to set a SARG 
into delete_delta based on the stripe stats in case the {{hive.acid.key.index}} 
is not present.

[~ekoifman], I couldn't add a unit test for this because I don't completely 
understand how the query based compactor will generate such a file 
(OrcRecordUpdater seems to always write the index). I tested this change by 
ignoring the index present in files written using OrcRecordUpdater. If you have 
any suggestions, please let me know.

> Do delete event filtering even if hive.acid.index is not there
> --
>
> Key: HIVE-20730
> URL: https://issues.apache.org/jira/browse/HIVE-20730
> Project: Hive
>  Issue Type: Improvement
>  Components: Transactions
>Affects Versions: 4.0.0
>Reporter: Eugene Koifman
>Assignee: Saurabh Seth
>Priority: Major
> Attachments: HIVE-20730.patch
>
>
> since HIVE-16812 {{VectorizedOrcAcidRowBatchReader}} filters delete events 
> based on min/max ROW__ID in the split which relies on {{hive.acid.index}} to 
> be in the ORC footer.  
> There is no way to generate {{hive.acid.index}} from a plain query as in 
> HIVE-20699 and so we need to make sure that we generate a SARG into 
> delete_delta/bucket_x based on stripe stats even the index is missing



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (HIVE-20730) Do delete event filtering even if hive.acid.index is not there

2018-11-06 Thread Saurabh Seth (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-20730?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saurabh Seth reassigned HIVE-20730:
---

Assignee: Saurabh Seth

> Do delete event filtering even if hive.acid.index is not there
> --
>
> Key: HIVE-20730
> URL: https://issues.apache.org/jira/browse/HIVE-20730
> Project: Hive
>  Issue Type: Improvement
>  Components: Transactions
>Affects Versions: 4.0.0
>Reporter: Eugene Koifman
>Assignee: Saurabh Seth
>Priority: Major
>
> since HIVE-16812 {{VectorizedOrcAcidRowBatchReader}} filters delete events 
> based on min/max ROW__ID in the split which relies on {{hive.acid.index}} to 
> be in the ORC footer.  
> There is no way to generate {{hive.acid.index}} from a plain query as in 
> HIVE-20699 and so we need to make sure that we generate a SARG into 
> delete_delta/bucket_x based on stripe stats even the index is missing



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-20723) Allow per table specification of compaction yarn queue

2018-10-11 Thread Saurabh Seth (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-20723?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saurabh Seth updated HIVE-20723:

Attachment: HIVE-20723.patch
Status: Patch Available  (was: Open)

Moved setting of the queue name to after handling props override.

> Allow per table specification of compaction yarn queue
> --
>
> Key: HIVE-20723
> URL: https://issues.apache.org/jira/browse/HIVE-20723
> Project: Hive
>  Issue Type: New Feature
>  Components: Transactions
>Affects Versions: 2.0.0
>Reporter: Eugene Koifman
>Assignee: Saurabh Seth
>Priority: Major
> Attachments: HIVE-20723.patch
>
>
> Currently compactions of full CRUD transactional tables are Map-Reduce jobs 
> submitted to a yarn queue defined by hive.compactor.job.queue property.
> If would be useful to be able to override this on per table basis by putting 
> it into table properties so that compactions for different tables can use 
> different queues.
>  
> There is already ability to override other compaction related configs via 
> table props, though this will need additional handling to set the queue name 
> {{CompactorMr.createBaseJobConf}}
> [https://cwiki.apache.org/confluence/display/Hive/Hive+Transactions#HiveTransactions-TableProperties]
>  
> See {{CopactorMR.COMPACTOR_PREFIX}} and 
> {{Initiator.COMPACTORTHRESHOLD_PREFIX}}
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-20723) Allow per table specification of compaction yarn queue

2018-10-11 Thread Saurabh Seth (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-20723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16646178#comment-16646178
 ] 

Saurabh Seth commented on HIVE-20723:
-

I can take this up

> Allow per table specification of compaction yarn queue
> --
>
> Key: HIVE-20723
> URL: https://issues.apache.org/jira/browse/HIVE-20723
> Project: Hive
>  Issue Type: New Feature
>  Components: Transactions
>Affects Versions: 2.0.0
>Reporter: Eugene Koifman
>Assignee: Saurabh Seth
>Priority: Major
>
> Currently compactions of full CRUD transactional tables are Map-Reduce jobs 
> submitted to a yarn queue defined by hive.compactor.job.queue property.
> If would be useful to be able to override this on per table basis by putting 
> it into table properties so that compactions for different tables can use 
> different queues.
>  
> There is already ability to override other compaction related configs via 
> table props, though this will need additional handling to set the queue name 
> {{CompactorMr.createBaseJobConf}}
> [https://cwiki.apache.org/confluence/display/Hive/Hive+Transactions#HiveTransactions-TableProperties]
>  
> See {{CopactorMR.COMPACTOR_PREFIX}} and 
> {{Initiator.COMPACTORTHRESHOLD_PREFIX}}
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (HIVE-20723) Allow per table specification of compaction yarn queue

2018-10-11 Thread Saurabh Seth (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-20723?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saurabh Seth reassigned HIVE-20723:
---

Assignee: Saurabh Seth

> Allow per table specification of compaction yarn queue
> --
>
> Key: HIVE-20723
> URL: https://issues.apache.org/jira/browse/HIVE-20723
> Project: Hive
>  Issue Type: New Feature
>  Components: Transactions
>Affects Versions: 2.0.0
>Reporter: Eugene Koifman
>Assignee: Saurabh Seth
>Priority: Major
>
> Currently compactions of full CRUD transactional tables are Map-Reduce jobs 
> submitted to a yarn queue defined by hive.compactor.job.queue property.
> If would be useful to be able to override this on per table basis by putting 
> it into table properties so that compactions for different tables can use 
> different queues.
>  
> There is already ability to override other compaction related configs via 
> table props, though this will need additional handling to set the queue name 
> {{CompactorMr.createBaseJobConf}}
> [https://cwiki.apache.org/confluence/display/Hive/Hive+Transactions#HiveTransactions-TableProperties]
>  
> See {{CopactorMR.COMPACTOR_PREFIX}} and 
> {{Initiator.COMPACTORTHRESHOLD_PREFIX}}
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-20694) Additional unit tests for VectorizedOrcAcidRowBatchReader min max key evaluation

2018-10-08 Thread Saurabh Seth (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-20694?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saurabh Seth updated HIVE-20694:

Attachment: HIVE-20694.patch
Status: Patch Available  (was: Open)

Added unit tests for the min max key interval computation for a combination of 
split and stripe boundaries.

[~ekoifman], could you please review?

> Additional unit tests for VectorizedOrcAcidRowBatchReader min max key 
> evaluation
> 
>
> Key: HIVE-20694
> URL: https://issues.apache.org/jira/browse/HIVE-20694
> Project: Hive
>  Issue Type: Test
>  Components: Transactions
>Reporter: Saurabh Seth
>Assignee: Saurabh Seth
>Priority: Minor
> Attachments: HIVE-20694.patch
>
>
> Follow up to HIVE-20664 and HIVE-20635.
> Additional unit tests for {{VectorizedOrcAcidRowBatchReader.findMinMaxKeys}} 
> and {{VectorizedOrcAcidRowBatchReader.findOriginalMinMaxKeys}} related to 
> split and stripe boundaries - particularly the case when a split is 
> completely within an ORC stripe.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-20635) VectorizedOrcAcidRowBatchReader doesn't filter delete events for original files

2018-10-06 Thread Saurabh Seth (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-20635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16640619#comment-16640619
 ] 

Saurabh Seth commented on HIVE-20635:
-

The asf license warning is in an unrelated file - HiveJdbcImplementor.java. 
Everything else is green.

> VectorizedOrcAcidRowBatchReader doesn't filter delete events for original 
> files
> ---
>
> Key: HIVE-20635
> URL: https://issues.apache.org/jira/browse/HIVE-20635
> Project: Hive
>  Issue Type: Improvement
>  Components: Transactions
>Affects Versions: 3.0.0
>Reporter: Eugene Koifman
>Assignee: Saurabh Seth
>Priority: Major
> Attachments: HIVE-20635.2.patch, HIVE-20635.3.patch, HIVE-20635.patch
>
>
> this is a followup to HIVE-16812 which adds support for delete event 
> filtering for splits from native acid files
> need to add the same for {{OrcSplit.isOriginal()}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-20664) Potential ArrayIndexOutOfBoundsException in VectorizedOrcAcidRowBatchReader.findMinMaxKeys

2018-10-05 Thread Saurabh Seth (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-20664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16639808#comment-16639808
 ] 

Saurabh Seth commented on HIVE-20664:
-

Thanks Eugene.

I think it may be better if I open a follow up jira for adding such a test for 
original files in addition to acid files. I have created HIVE-20694 to track 
this.

> Potential ArrayIndexOutOfBoundsException in 
> VectorizedOrcAcidRowBatchReader.findMinMaxKeys
> --
>
> Key: HIVE-20664
> URL: https://issues.apache.org/jira/browse/HIVE-20664
> Project: Hive
>  Issue Type: Bug
>  Components: Transactions
>Reporter: Saurabh Seth
>Assignee: Saurabh Seth
>Priority: Minor
> Attachments: HIVE-20664.2.patch, HIVE-20664.patch
>
>
> [~ekoifman], could you please confirm if my understanding is correct and if 
> so, review the fix?
> In the method {{VectorizedOrcAcidRowBatchReader.findMinMaxKeys}}, the code 
> snippet that identifies the first and last stripe indices in the current 
> split could result in an ArrayIndexOutOfBoundsException if a complete split 
> is within the same stripe:
> {noformat}
>     for(int i = 0; i < stripes.size(); i++) {
>   StripeInformation stripe = stripes.get(i);
>   long stripeEnd = stripe.getOffset() + stripe.getLength();
>   if(firstStripeIndex == -1 && stripe.getOffset() >= splitStart) {
> firstStripeIndex = i;
>   }
>   if(lastStripeIndex == -1 && splitEnd <= stripeEnd &&
>   stripes.get(firstStripeIndex).getOffset() <= stripe.getOffset() ) {
> //the last condition is for when both splitStart and splitEnd are in
> // the same stripe
> lastStripeIndex = i;
>   }
> }
> {noformat}
> Consider the example where there are 2 stripes - 0-500 and 500-1000 and 
> splitStart is 600 and splitEnd is 800.
> In the first iteration of the loop, stripe.getOffset() is 0 and stripeEnd is 
> 500. In this iteration, neither of the if statement conditions will be met 
> and firstSripeIndex as well as lastStripeIndex remain -1.
> In the second iteration of the loop stripe.getOffset() is 500, stripeEnd is 
> 1000, The first if statement condition will not be met in this case because 
> stripe's offset (500) is not greater than or equal to the splitStart (600). 
> However, in the second if statement, splitEnd (800) is <= stripeEnd(1000) and 
> it will try to compute the last condition 
> {{stripes.get(firstStripeIndex).getOffset() <= stripe.getOffset()}}. This 
> will throw an ArrayIndexOutOfBoundsException because firstStripeIndex is 
> still -1.
> I'm not sure if this scenario is possible at all, hence logging this as a low 
> priority issue. Perhaps block based split generation using BISplitStrategy 
> could trigger this?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (HIVE-20694) Additional unit tests for VectorizedOrcAcidRowBatchReader min max key evaluation

2018-10-05 Thread Saurabh Seth (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-20694?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saurabh Seth reassigned HIVE-20694:
---

Assignee: Saurabh Seth

> Additional unit tests for VectorizedOrcAcidRowBatchReader min max key 
> evaluation
> 
>
> Key: HIVE-20694
> URL: https://issues.apache.org/jira/browse/HIVE-20694
> Project: Hive
>  Issue Type: Test
>  Components: Transactions
>Reporter: Saurabh Seth
>Assignee: Saurabh Seth
>Priority: Minor
>
> Follow up to HIVE-20664 and HIVE-20635.
> Additional unit tests for {{VectorizedOrcAcidRowBatchReader.findMinMaxKeys}} 
> and {{VectorizedOrcAcidRowBatchReader.findOriginalMinMaxKeys}} related to 
> split and stripe boundaries - particularly the case when a split is 
> completely within an ORC stripe.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-20635) VectorizedOrcAcidRowBatchReader doesn't filter delete events for original files

2018-10-05 Thread Saurabh Seth (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-20635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saurabh Seth updated HIVE-20635:

Status: In Progress  (was: Patch Available)

> VectorizedOrcAcidRowBatchReader doesn't filter delete events for original 
> files
> ---
>
> Key: HIVE-20635
> URL: https://issues.apache.org/jira/browse/HIVE-20635
> Project: Hive
>  Issue Type: Improvement
>  Components: Transactions
>Affects Versions: 3.0.0
>Reporter: Eugene Koifman
>Assignee: Saurabh Seth
>Priority: Major
> Attachments: HIVE-20635.2.patch, HIVE-20635.patch
>
>
> this is a followup to HIVE-16812 which adds support for delete event 
> filtering for splits from native acid files
> need to add the same for {{OrcSplit.isOriginal()}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-20635) VectorizedOrcAcidRowBatchReader doesn't filter delete events for original files

2018-10-05 Thread Saurabh Seth (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-20635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saurabh Seth updated HIVE-20635:

Attachment: HIVE-20635.3.patch
Status: Patch Available  (was: In Progress)

Thanks for the review Eugene. Made a minor change to the logging so that the 
keyIntervalTmp is always logged.

> VectorizedOrcAcidRowBatchReader doesn't filter delete events for original 
> files
> ---
>
> Key: HIVE-20635
> URL: https://issues.apache.org/jira/browse/HIVE-20635
> Project: Hive
>  Issue Type: Improvement
>  Components: Transactions
>Affects Versions: 3.0.0
>Reporter: Eugene Koifman
>Assignee: Saurabh Seth
>Priority: Major
> Attachments: HIVE-20635.2.patch, HIVE-20635.3.patch, HIVE-20635.patch
>
>
> this is a followup to HIVE-16812 which adds support for delete event 
> filtering for splits from native acid files
> need to add the same for {{OrcSplit.isOriginal()}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-20635) VectorizedOrcAcidRowBatchReader doesn't filter delete events for original files

2018-10-02 Thread Saurabh Seth (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-20635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saurabh Seth updated HIVE-20635:

Attachment: HIVE-20635.2.patch
Status: Patch Available  (was: In Progress)

Fixed the checkstyle issues.

> VectorizedOrcAcidRowBatchReader doesn't filter delete events for original 
> files
> ---
>
> Key: HIVE-20635
> URL: https://issues.apache.org/jira/browse/HIVE-20635
> Project: Hive
>  Issue Type: Improvement
>  Components: Transactions
>Affects Versions: 3.0.0
>Reporter: Eugene Koifman
>Assignee: Saurabh Seth
>Priority: Major
> Attachments: HIVE-20635.2.patch, HIVE-20635.patch
>
>
> this is a followup to HIVE-16812 which adds support for delete event 
> filtering for splits from native acid files
> need to add the same for {{OrcSplit.isOriginal()}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-20635) VectorizedOrcAcidRowBatchReader doesn't filter delete events for original files

2018-10-02 Thread Saurabh Seth (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-20635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saurabh Seth updated HIVE-20635:

Status: In Progress  (was: Patch Available)

> VectorizedOrcAcidRowBatchReader doesn't filter delete events for original 
> files
> ---
>
> Key: HIVE-20635
> URL: https://issues.apache.org/jira/browse/HIVE-20635
> Project: Hive
>  Issue Type: Improvement
>  Components: Transactions
>Affects Versions: 3.0.0
>Reporter: Eugene Koifman
>Assignee: Saurabh Seth
>Priority: Major
> Attachments: HIVE-20635.patch
>
>
> this is a followup to HIVE-16812 which adds support for delete event 
> filtering for splits from native acid files
> need to add the same for {{OrcSplit.isOriginal()}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-20664) Potential ArrayIndexOutOfBoundsException in VectorizedOrcAcidRowBatchReader.findMinMaxKeys

2018-10-02 Thread Saurabh Seth (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-20664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saurabh Seth updated HIVE-20664:

Attachment: HIVE-20664.2.patch
Status: Patch Available  (was: In Progress)

I have added a log statement as suggested. Also fixed a couple of issues with 
the patch.

I have tweaked the condition to try to simplify the flow, but 
{{firstStripeIndex > lastStripeIndex}} remains the same. The condition is now:
 {{if (firstStripeIndex > lastStripeIndex || firstStripeIndex == -1)}}

In the example I used in the description, if there's a 3rd stripe in the file, 
1000-1500, the lastStripeIndex will be 1 and the firstStripeIndex will be 2.

So basically the condition is: either the first stripe is found after the last 
stripe (the split is within any of the stripes of the file except the last one) 
or the first stripe is not found at all (the entire split is within the last 
stripe).

Regarding testing this fix, I tried to create a test case using 
TestInputOutputFormat.MockFileSystem to simulate multiple blocks in a file. We 
could then have a split generated per block, but things get tricky because we 
create an OrcReader in findMinMax which needs an actual file to be present. One 
way is to write an orc file with multiple stripes, then use MockFileSystem to 
simulate having multiple blocks in it and using that fs to generate the splits. 
I'm sure there is an easier/cleaner way to create such a test case - any 
suggestions?

> Potential ArrayIndexOutOfBoundsException in 
> VectorizedOrcAcidRowBatchReader.findMinMaxKeys
> --
>
> Key: HIVE-20664
> URL: https://issues.apache.org/jira/browse/HIVE-20664
> Project: Hive
>  Issue Type: Bug
>  Components: Transactions
>Reporter: Saurabh Seth
>Assignee: Saurabh Seth
>Priority: Minor
> Attachments: HIVE-20664.2.patch, HIVE-20664.patch
>
>
> [~ekoifman], could you please confirm if my understanding is correct and if 
> so, review the fix?
> In the method {{VectorizedOrcAcidRowBatchReader.findMinMaxKeys}}, the code 
> snippet that identifies the first and last stripe indices in the current 
> split could result in an ArrayIndexOutOfBoundsException if a complete split 
> is within the same stripe:
> {noformat}
>     for(int i = 0; i < stripes.size(); i++) {
>   StripeInformation stripe = stripes.get(i);
>   long stripeEnd = stripe.getOffset() + stripe.getLength();
>   if(firstStripeIndex == -1 && stripe.getOffset() >= splitStart) {
> firstStripeIndex = i;
>   }
>   if(lastStripeIndex == -1 && splitEnd <= stripeEnd &&
>   stripes.get(firstStripeIndex).getOffset() <= stripe.getOffset() ) {
> //the last condition is for when both splitStart and splitEnd are in
> // the same stripe
> lastStripeIndex = i;
>   }
> }
> {noformat}
> Consider the example where there are 2 stripes - 0-500 and 500-1000 and 
> splitStart is 600 and splitEnd is 800.
> In the first iteration of the loop, stripe.getOffset() is 0 and stripeEnd is 
> 500. In this iteration, neither of the if statement conditions will be met 
> and firstSripeIndex as well as lastStripeIndex remain -1.
> In the second iteration of the loop stripe.getOffset() is 500, stripeEnd is 
> 1000, The first if statement condition will not be met in this case because 
> stripe's offset (500) is not greater than or equal to the splitStart (600). 
> However, in the second if statement, splitEnd (800) is <= stripeEnd(1000) and 
> it will try to compute the last condition 
> {{stripes.get(firstStripeIndex).getOffset() <= stripe.getOffset()}}. This 
> will throw an ArrayIndexOutOfBoundsException because firstStripeIndex is 
> still -1.
> I'm not sure if this scenario is possible at all, hence logging this as a low 
> priority issue. Perhaps block based split generation using BISplitStrategy 
> could trigger this?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-20664) Potential ArrayIndexOutOfBoundsException in VectorizedOrcAcidRowBatchReader.findMinMaxKeys

2018-10-02 Thread Saurabh Seth (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-20664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saurabh Seth updated HIVE-20664:

Status: In Progress  (was: Patch Available)

> Potential ArrayIndexOutOfBoundsException in 
> VectorizedOrcAcidRowBatchReader.findMinMaxKeys
> --
>
> Key: HIVE-20664
> URL: https://issues.apache.org/jira/browse/HIVE-20664
> Project: Hive
>  Issue Type: Bug
>  Components: Transactions
>Reporter: Saurabh Seth
>Assignee: Saurabh Seth
>Priority: Minor
> Attachments: HIVE-20664.patch
>
>
> [~ekoifman], could you please confirm if my understanding is correct and if 
> so, review the fix?
> In the method {{VectorizedOrcAcidRowBatchReader.findMinMaxKeys}}, the code 
> snippet that identifies the first and last stripe indices in the current 
> split could result in an ArrayIndexOutOfBoundsException if a complete split 
> is within the same stripe:
> {noformat}
>     for(int i = 0; i < stripes.size(); i++) {
>   StripeInformation stripe = stripes.get(i);
>   long stripeEnd = stripe.getOffset() + stripe.getLength();
>   if(firstStripeIndex == -1 && stripe.getOffset() >= splitStart) {
> firstStripeIndex = i;
>   }
>   if(lastStripeIndex == -1 && splitEnd <= stripeEnd &&
>   stripes.get(firstStripeIndex).getOffset() <= stripe.getOffset() ) {
> //the last condition is for when both splitStart and splitEnd are in
> // the same stripe
> lastStripeIndex = i;
>   }
> }
> {noformat}
> Consider the example where there are 2 stripes - 0-500 and 500-1000 and 
> splitStart is 600 and splitEnd is 800.
> In the first iteration of the loop, stripe.getOffset() is 0 and stripeEnd is 
> 500. In this iteration, neither of the if statement conditions will be met 
> and firstSripeIndex as well as lastStripeIndex remain -1.
> In the second iteration of the loop stripe.getOffset() is 500, stripeEnd is 
> 1000, The first if statement condition will not be met in this case because 
> stripe's offset (500) is not greater than or equal to the splitStart (600). 
> However, in the second if statement, splitEnd (800) is <= stripeEnd(1000) and 
> it will try to compute the last condition 
> {{stripes.get(firstStripeIndex).getOffset() <= stripe.getOffset()}}. This 
> will throw an ArrayIndexOutOfBoundsException because firstStripeIndex is 
> still -1.
> I'm not sure if this scenario is possible at all, hence logging this as a low 
> priority issue. Perhaps block based split generation using BISplitStrategy 
> could trigger this?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-20664) Potential ArrayIndexOutOfBoundsException in VectorizedOrcAcidRowBatchReader.findMinMaxKeys

2018-10-01 Thread Saurabh Seth (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-20664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saurabh Seth updated HIVE-20664:

Attachment: HIVE-20664.patch
Status: Patch Available  (was: Open)

Removed the condition that could cause the exception and added a check after 
the loop to handle the case when a split is completely within an orc stripe.

> Potential ArrayIndexOutOfBoundsException in 
> VectorizedOrcAcidRowBatchReader.findMinMaxKeys
> --
>
> Key: HIVE-20664
> URL: https://issues.apache.org/jira/browse/HIVE-20664
> Project: Hive
>  Issue Type: Bug
>  Components: Transactions
>Reporter: Saurabh Seth
>Assignee: Saurabh Seth
>Priority: Minor
> Attachments: HIVE-20664.patch
>
>
> [~ekoifman], could you please confirm if my understanding is correct and if 
> so, review the fix?
> In the method {{VectorizedOrcAcidRowBatchReader.findMinMaxKeys}}, the code 
> snippet that identifies the first and last stripe indices in the current 
> split could result in an ArrayIndexOutOfBoundsException if a complete split 
> is within the same stripe:
> {noformat}
>     for(int i = 0; i < stripes.size(); i++) {
>   StripeInformation stripe = stripes.get(i);
>   long stripeEnd = stripe.getOffset() + stripe.getLength();
>   if(firstStripeIndex == -1 && stripe.getOffset() >= splitStart) {
> firstStripeIndex = i;
>   }
>   if(lastStripeIndex == -1 && splitEnd <= stripeEnd &&
>   stripes.get(firstStripeIndex).getOffset() <= stripe.getOffset() ) {
> //the last condition is for when both splitStart and splitEnd are in
> // the same stripe
> lastStripeIndex = i;
>   }
> }
> {noformat}
> Consider the example where there are 2 stripes - 0-500 and 500-1000 and 
> splitStart is 600 and splitEnd is 800.
> In the first iteration of the loop, stripe.getOffset() is 0 and stripeEnd is 
> 500. In this iteration, neither of the if statement conditions will be met 
> and firstSripeIndex as well as lastStripeIndex remain -1.
> In the second iteration of the loop stripe.getOffset() is 500, stripeEnd is 
> 1000, The first if statement condition will not be met in this case because 
> stripe's offset (500) is not greater than or equal to the splitStart (600). 
> However, in the second if statement, splitEnd (800) is <= stripeEnd(1000) and 
> it will try to compute the last condition 
> {{stripes.get(firstStripeIndex).getOffset() <= stripe.getOffset()}}. This 
> will throw an ArrayIndexOutOfBoundsException because firstStripeIndex is 
> still -1.
> I'm not sure if this scenario is possible at all, hence logging this as a low 
> priority issue. Perhaps block based split generation using BISplitStrategy 
> could trigger this?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (HIVE-20664) Potential ArrayIndexOutOfBoundsException in VectorizedOrcAcidRowBatchReader.findMinMaxKeys

2018-10-01 Thread Saurabh Seth (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-20664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saurabh Seth reassigned HIVE-20664:
---


> Potential ArrayIndexOutOfBoundsException in 
> VectorizedOrcAcidRowBatchReader.findMinMaxKeys
> --
>
> Key: HIVE-20664
> URL: https://issues.apache.org/jira/browse/HIVE-20664
> Project: Hive
>  Issue Type: Bug
>  Components: Transactions
>Reporter: Saurabh Seth
>Assignee: Saurabh Seth
>Priority: Minor
>
> [~ekoifman], could you please confirm if my understanding is correct and if 
> so, review the fix?
> In the method {{VectorizedOrcAcidRowBatchReader.findMinMaxKeys}}, the code 
> snippet that identifies the first and last stripe indices in the current 
> split could result in an ArrayIndexOutOfBoundsException if a complete split 
> is within the same stripe:
> {noformat}
>     for(int i = 0; i < stripes.size(); i++) {
>   StripeInformation stripe = stripes.get(i);
>   long stripeEnd = stripe.getOffset() + stripe.getLength();
>   if(firstStripeIndex == -1 && stripe.getOffset() >= splitStart) {
> firstStripeIndex = i;
>   }
>   if(lastStripeIndex == -1 && splitEnd <= stripeEnd &&
>   stripes.get(firstStripeIndex).getOffset() <= stripe.getOffset() ) {
> //the last condition is for when both splitStart and splitEnd are in
> // the same stripe
> lastStripeIndex = i;
>   }
> }
> {noformat}
> Consider the example where there are 2 stripes - 0-500 and 500-1000 and 
> splitStart is 600 and splitEnd is 800.
> In the first iteration of the loop, stripe.getOffset() is 0 and stripeEnd is 
> 500. In this iteration, neither of the if statement conditions will be met 
> and firstSripeIndex as well as lastStripeIndex remain -1.
> In the second iteration of the loop stripe.getOffset() is 500, stripeEnd is 
> 1000, The first if statement condition will not be met in this case because 
> stripe's offset (500) is not greater than or equal to the splitStart (600). 
> However, in the second if statement, splitEnd (800) is <= stripeEnd(1000) and 
> it will try to compute the last condition 
> {{stripes.get(firstStripeIndex).getOffset() <= stripe.getOffset()}}. This 
> will throw an ArrayIndexOutOfBoundsException because firstStripeIndex is 
> still -1.
> I'm not sure if this scenario is possible at all, hence logging this as a low 
> priority issue. Perhaps block based split generation using BISplitStrategy 
> could trigger this?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-20635) VectorizedOrcAcidRowBatchReader doesn't filter delete events for original files

2018-10-01 Thread Saurabh Seth (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-20635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saurabh Seth updated HIVE-20635:

Attachment: HIVE-20635.patch
Status: Patch Available  (was: Open)

Added delete event filtering for original file splits in 
{{VectorizedOrcAcidRowBatchReader.findMinMaxKeys}}.

> VectorizedOrcAcidRowBatchReader doesn't filter delete events for original 
> files
> ---
>
> Key: HIVE-20635
> URL: https://issues.apache.org/jira/browse/HIVE-20635
> Project: Hive
>  Issue Type: Improvement
>  Components: Transactions
>Affects Versions: 3.0.0
>Reporter: Eugene Koifman
>Assignee: Saurabh Seth
>Priority: Major
> Attachments: HIVE-20635.patch
>
>
> this is a followup to HIVE-16812 which adds support for delete event 
> filtering for splits from native acid files
> need to add the same for {{OrcSplit.isOriginal()}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (HIVE-20635) VectorizedOrcAcidRowBatchReader doesn't filter delete events for original files

2018-09-28 Thread Saurabh Seth (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-20635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saurabh Seth reassigned HIVE-20635:
---

Assignee: Saurabh Seth

> VectorizedOrcAcidRowBatchReader doesn't filter delete events for original 
> files
> ---
>
> Key: HIVE-20635
> URL: https://issues.apache.org/jira/browse/HIVE-20635
> Project: Hive
>  Issue Type: Improvement
>  Components: Transactions
>Affects Versions: 3.0.0
>Reporter: Eugene Koifman
>Assignee: Saurabh Seth
>Priority: Major
>
> this is a followup to HIVE-16812 which adds support for delete event 
> filtering for splits from native acid files
> need to add the same for {{OrcSplit.isOriginal()}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-17917) VectorizedOrcAcidRowBatchReader.computeOffsetAndBucket optimization

2018-09-27 Thread Saurabh Seth (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-17917?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16630345#comment-16630345
 ] 

Saurabh Seth commented on HIVE-17917:
-

The test TestHs2ConnectionMetricsBinary failure is unrelated and so are the 
compilation errors - although it seems weird to me how the tests are executed 
if the compilation itself is failing.

> VectorizedOrcAcidRowBatchReader.computeOffsetAndBucket optimization
> ---
>
> Key: HIVE-17917
> URL: https://issues.apache.org/jira/browse/HIVE-17917
> Project: Hive
>  Issue Type: Sub-task
>  Components: Transactions
>Affects Versions: 3.0.0
>Reporter: Eugene Koifman
>Assignee: Saurabh Seth
>Priority: Minor
> Attachments: HIVE-17917.2.patch, HIVE-17917.patch
>
>
> VectorizedOrcAcidRowBatchReader.computeOffsetAndBucket optimization() 
> computation is currently (after HIVE-17458) is done once per split.  It could 
> instead be done once per file (since the result is the same for each split of 
> the same file) and passed along in OrcSplit



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-17917) VectorizedOrcAcidRowBatchReader.computeOffsetAndBucket optimization

2018-09-26 Thread Saurabh Seth (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-17917?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saurabh Seth updated HIVE-17917:

Attachment: HIVE-17917.2.patch
Status: Patch Available  (was: In Progress)

 

Fixed the checkstyle and findbugs errors. These were pre-existing ones but 
because they were on lines where I made minor changes, they were reported as 
new ones. These were trivial ones, so I fixed them:

./ql/src/java/org/apache/hadoop/hive/ql/io/orc/VectorizedOrcAcidRowBatchReader.java:266:
 return new OrcSplit.OffsetAndBucketProperty(-1,-1, 
syntheticTxnInfo.syntheticWriteId);:56: warning: ',' is not followed by 
whitespace.

Redundant null check at VectorizedOrcAcidRowBatchReader.java:[line 495]

> VectorizedOrcAcidRowBatchReader.computeOffsetAndBucket optimization
> ---
>
> Key: HIVE-17917
> URL: https://issues.apache.org/jira/browse/HIVE-17917
> Project: Hive
>  Issue Type: Sub-task
>  Components: Transactions
>Affects Versions: 3.0.0
>Reporter: Eugene Koifman
>Assignee: Saurabh Seth
>Priority: Minor
> Attachments: HIVE-17917.2.patch, HIVE-17917.patch
>
>
> VectorizedOrcAcidRowBatchReader.computeOffsetAndBucket optimization() 
> computation is currently (after HIVE-17458) is done once per split.  It could 
> instead be done once per file (since the result is the same for each split of 
> the same file) and passed along in OrcSplit



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-17917) VectorizedOrcAcidRowBatchReader.computeOffsetAndBucket optimization

2018-09-26 Thread Saurabh Seth (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-17917?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saurabh Seth updated HIVE-17917:

Status: In Progress  (was: Patch Available)

> VectorizedOrcAcidRowBatchReader.computeOffsetAndBucket optimization
> ---
>
> Key: HIVE-17917
> URL: https://issues.apache.org/jira/browse/HIVE-17917
> Project: Hive
>  Issue Type: Sub-task
>  Components: Transactions
>Affects Versions: 3.0.0
>Reporter: Eugene Koifman
>Assignee: Saurabh Seth
>Priority: Minor
> Attachments: HIVE-17917.patch
>
>
> VectorizedOrcAcidRowBatchReader.computeOffsetAndBucket optimization() 
> computation is currently (after HIVE-17458) is done once per split.  It could 
> instead be done once per file (since the result is the same for each split of 
> the same file) and passed along in OrcSplit



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-17917) VectorizedOrcAcidRowBatchReader.computeOffsetAndBucket optimization

2018-09-25 Thread Saurabh Seth (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-17917?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saurabh Seth updated HIVE-17917:

Attachment: HIVE-17917.patch
Status: Patch Available  (was: Open)

Moved the computation of the offset and bucket so that it's done once per file 
when the splits are generated. The result is then passed along in the OrcSplit.

This is done only for vectorized execution mode because the non vector mode 
readers handle this themselves - perhaps that can be moved here as well.

> VectorizedOrcAcidRowBatchReader.computeOffsetAndBucket optimization
> ---
>
> Key: HIVE-17917
> URL: https://issues.apache.org/jira/browse/HIVE-17917
> Project: Hive
>  Issue Type: Sub-task
>  Components: Transactions
>Affects Versions: 3.0.0
>Reporter: Eugene Koifman
>Assignee: Saurabh Seth
>Priority: Minor
> Attachments: HIVE-17917.patch
>
>
> VectorizedOrcAcidRowBatchReader.computeOffsetAndBucket optimization() 
> computation is currently (after HIVE-17458) is done once per split.  It could 
> instead be done once per file (since the result is the same for each split of 
> the same file) and passed along in OrcSplit



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (HIVE-17917) VectorizedOrcAcidRowBatchReader.computeOffsetAndBucket optimization

2018-09-25 Thread Saurabh Seth (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-17917?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saurabh Seth reassigned HIVE-17917:
---

Assignee: Saurabh Seth

> VectorizedOrcAcidRowBatchReader.computeOffsetAndBucket optimization
> ---
>
> Key: HIVE-17917
> URL: https://issues.apache.org/jira/browse/HIVE-17917
> Project: Hive
>  Issue Type: Sub-task
>  Components: Transactions
>Affects Versions: 3.0.0
>Reporter: Eugene Koifman
>Assignee: Saurabh Seth
>Priority: Minor
>
> VectorizedOrcAcidRowBatchReader.computeOffsetAndBucket optimization() 
> computation is currently (after HIVE-17458) is done once per split.  It could 
> instead be done once per file (since the result is the same for each split of 
> the same file) and passed along in OrcSplit



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-17921) Aggregation with struct in LLAP produces wrong result

2018-09-04 Thread Saurabh Seth (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-17921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16603191#comment-16603191
 ] 

Saurabh Seth commented on HIVE-17921:
-

This result is for the query:
{noformat}
select ROW__ID, count(*) from over10k_orc_bucketed group by ROW__ID having 
count(*) > 1;
{noformat}

This table is an ACID table and doesn't actually have any null ROW__IDs. This 
issue was logged for this specific incorrect output being present - the 
description has more details as well.

> Aggregation with struct in LLAP produces wrong result
> -
>
> Key: HIVE-17921
> URL: https://issues.apache.org/jira/browse/HIVE-17921
> Project: Hive
>  Issue Type: Sub-task
>  Components: llap, Transactions
>Affects Versions: 3.0.0
>Reporter: Eugene Koifman
>Assignee: Saurabh Seth
>Priority: Blocker
> Attachments: HIVE-17921.2.patch, HIVE-17921.patch
>
>
> Consider 
> {noformat}
> select ROW__ID, count(*) from over10k_orc_bucketed group by ROW__ID having 
> count(*) > 1;
> {noformat}
>  in acid_vectorization_original.q (available since HIVE-17458)
> when run using TestMiniLlapCliDriver produces "NULL, N" where N varies from 
> run to run.
> The right answer is empty results set as can be seen by running
> {noformat}
> select ROW__ID, * from over10k_orc_bucketed where ROW__ID is null
> {noformat}
> in the same test.
> This is with 
> {noformat}
> set hive.vectorized.execution.enabled=true;
> set hive.vectorized.row.identifier.enabled=true;
> {noformat}
> It fails with TestMiniLlapCliDriver but not TestMiniTezCliDriver.  See 
> acid_vectorization_original_tez.q which has identical query.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-17921) Aggregation with struct in LLAP produces wrong result

2018-09-04 Thread Saurabh Seth (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-17921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16603135#comment-16603135
 ] 

Saurabh Seth commented on HIVE-17921:
-

The test failures are unrelated to this patch. This patch fixes an existing 
test result so I haven't added any more tests.

> Aggregation with struct in LLAP produces wrong result
> -
>
> Key: HIVE-17921
> URL: https://issues.apache.org/jira/browse/HIVE-17921
> Project: Hive
>  Issue Type: Sub-task
>  Components: llap, Transactions
>Affects Versions: 3.0.0
>Reporter: Eugene Koifman
>Assignee: Saurabh Seth
>Priority: Blocker
> Attachments: HIVE-17921.2.patch, HIVE-17921.patch
>
>
> Consider 
> {noformat}
> select ROW__ID, count(*) from over10k_orc_bucketed group by ROW__ID having 
> count(*) > 1;
> {noformat}
>  in acid_vectorization_original.q (available since HIVE-17458)
> when run using TestMiniLlapCliDriver produces "NULL, N" where N varies from 
> run to run.
> The right answer is empty results set as can be seen by running
> {noformat}
> select ROW__ID, * from over10k_orc_bucketed where ROW__ID is null
> {noformat}
> in the same test.
> This is with 
> {noformat}
> set hive.vectorized.execution.enabled=true;
> set hive.vectorized.row.identifier.enabled=true;
> {noformat}
> It fails with TestMiniLlapCliDriver but not TestMiniTezCliDriver.  See 
> acid_vectorization_original_tez.q which has identical query.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-17921) Aggregation with struct in LLAP produces wrong result

2018-09-04 Thread Saurabh Seth (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-17921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saurabh Seth updated HIVE-17921:

Attachment: HIVE-17921.2.patch
Status: Patch Available  (was: In Progress)

I have fixed the acid_vectorization_original.q.out. I was running 
acid_vectorization_original.q test with TestMiniLlapCliDriver. I didn't realize 
the masking is different in TestMiniLlapLocalCliDriver and both these tests use 
the same results directory.

The TestJdbcWithMiniLlapArrow.testComplexQuery failure seems unrelated to this 
fix. I couldn't reproduce it myself and the exception and stacktrace in the 
test run log does not seem to be related either.

This patch fixes an existing test result so I haven't added any more tests.

> Aggregation with struct in LLAP produces wrong result
> -
>
> Key: HIVE-17921
> URL: https://issues.apache.org/jira/browse/HIVE-17921
> Project: Hive
>  Issue Type: Sub-task
>  Components: llap, Transactions
>Affects Versions: 3.0.0
>Reporter: Eugene Koifman
>Assignee: Saurabh Seth
>Priority: Blocker
> Attachments: HIVE-17921.2.patch, HIVE-17921.patch
>
>
> Consider 
> {noformat}
> select ROW__ID, count(*) from over10k_orc_bucketed group by ROW__ID having 
> count(*) > 1;
> {noformat}
>  in acid_vectorization_original.q (available since HIVE-17458)
> when run using TestMiniLlapCliDriver produces "NULL, N" where N varies from 
> run to run.
> The right answer is empty results set as can be seen by running
> {noformat}
> select ROW__ID, * from over10k_orc_bucketed where ROW__ID is null
> {noformat}
> in the same test.
> This is with 
> {noformat}
> set hive.vectorized.execution.enabled=true;
> set hive.vectorized.row.identifier.enabled=true;
> {noformat}
> It fails with TestMiniLlapCliDriver but not TestMiniTezCliDriver.  See 
> acid_vectorization_original_tez.q which has identical query.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-17921) Aggregation with struct in LLAP produces wrong result

2018-09-04 Thread Saurabh Seth (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-17921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saurabh Seth updated HIVE-17921:

Status: In Progress  (was: Patch Available)

> Aggregation with struct in LLAP produces wrong result
> -
>
> Key: HIVE-17921
> URL: https://issues.apache.org/jira/browse/HIVE-17921
> Project: Hive
>  Issue Type: Sub-task
>  Components: llap, Transactions
>Affects Versions: 3.0.0
>Reporter: Eugene Koifman
>Assignee: Saurabh Seth
>Priority: Blocker
> Attachments: HIVE-17921.patch
>
>
> Consider 
> {noformat}
> select ROW__ID, count(*) from over10k_orc_bucketed group by ROW__ID having 
> count(*) > 1;
> {noformat}
>  in acid_vectorization_original.q (available since HIVE-17458)
> when run using TestMiniLlapCliDriver produces "NULL, N" where N varies from 
> run to run.
> The right answer is empty results set as can be seen by running
> {noformat}
> select ROW__ID, * from over10k_orc_bucketed where ROW__ID is null
> {noformat}
> in the same test.
> This is with 
> {noformat}
> set hive.vectorized.execution.enabled=true;
> set hive.vectorized.row.identifier.enabled=true;
> {noformat}
> It fails with TestMiniLlapCliDriver but not TestMiniTezCliDriver.  See 
> acid_vectorization_original_tez.q which has identical query.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (HIVE-17921) Aggregation with struct in LLAP produces wrong result

2018-09-03 Thread Saurabh Seth (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-17921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16601916#comment-16601916
 ] 

Saurabh Seth edited comment on HIVE-17921 at 9/3/18 9:24 AM:
-

I took a stab at debugging this. OrcStruct.canUseLlapIo thinks vectorization is 
being used when it's not and ends up using LlapRecordReader for the delta 
splits (wrapped in OrcOiBatchToRowReader because LlapInputFormat knows 
vectorization isn't being used). For the original data splits, 
OrcInputFormat.NullKeyRecordReader is used. These 2 RecordReaders create 
OrcStructs with different schemas (through createValue) - OrcOiBatchToRowReader 
adds an extra field for the ROW__ID but NullKeyRecordReader doesn't. Since this 
struct is cached (by MRReaderMapred), depending on which split is first within 
TezGroupedSplit, the cached OrcStruct may or may not have the extra field for 
ROW__ID. In this test case, an original file split is first and hence 
NullKeyRecordReader's OrcStruct is used. When this OrcStruct is given to 
OrcOiBatchToRowReader to fetch values (from the delta splits), it doesn't 
populate the record identifier - neither in the OrcStruct nor in the iocontext 
(in HiveContextAwareRecordReader). So all modified records in the delta splits 
end up having null ROW__IDs.

I have fixed OrcStruct.canUseLlapIo and the patch is attached.

A related question - Should OrcOiBatchToRowReader and NullKeyRecordReader be 
"compatible" and work when they're used from a TezGroupedSplitsRecordReader?


was (Author: saurabh.s...@gmail.com):
I took a stab at debugging this. OrcStruct.canUseLlapIo thinks vectorization is 
being used when it's not and ends up using LlapRecordReader for the delta 
splits (wrapped in OrcOiBatchToRowReader because LlapInputFormat knows 
vectorization isn't being used). For the original data splits, 
OrcInputFormat.NullKeyRecordReader is used. These 2 RecordReaders create 
OrcStructs with different schemas (through createValue) - OrcOiBatchToRowReader 
adds an extra field for the ROW__ID but NullKeyRecordReader doesn't. Since this 
struct is cached (by MRReaderMapred), depending on which split is first within 
TezGroupedSplit, the cached OrcStruct may or may not have the extra field for 
ROW__ID. In this test case, an original file split is first and hence 
NullKeyRecordReader's OrcStruct is used. When this OrcStruct is given to 
OrcOiBatchToRowReader to fetch values (from the delta splits), it doesn't 
populate the record identifier - neither in the OrcStruct nor in the iocontext 
(in HiveContextAwareRecordReader). So all modified records in the delta splits 
end up having null ROW__IDs.

I have fixed OrcStruct.canUseLlapIo and the patch is attached.

A related question - Should OrcOiBatchToRowReader and NullKeyRecordReader be 
"compatible" and work when they're used from a TezGroupedSplitsRecordReader?

> Aggregation with struct in LLAP produces wrong result
> -
>
> Key: HIVE-17921
> URL: https://issues.apache.org/jira/browse/HIVE-17921
> Project: Hive
>  Issue Type: Sub-task
>  Components: llap, Transactions
>Affects Versions: 3.0.0
>Reporter: Eugene Koifman
>Assignee: Saurabh Seth
>Priority: Blocker
> Attachments: HIVE-17921.patch
>
>
> Consider 
> {noformat}
> select ROW__ID, count(*) from over10k_orc_bucketed group by ROW__ID having 
> count(*) > 1;
> {noformat}
>  in acid_vectorization_original.q (available since HIVE-17458)
> when run using TestMiniLlapCliDriver produces "NULL, N" where N varies from 
> run to run.
> The right answer is empty results set as can be seen by running
> {noformat}
> select ROW__ID, * from over10k_orc_bucketed where ROW__ID is null
> {noformat}
> in the same test.
> This is with 
> {noformat}
> set hive.vectorized.execution.enabled=true;
> set hive.vectorized.row.identifier.enabled=true;
> {noformat}
> It fails with TestMiniLlapCliDriver but not TestMiniTezCliDriver.  See 
> acid_vectorization_original_tez.q which has identical query.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-17921) Aggregation with struct in LLAP produces wrong result

2018-09-03 Thread Saurabh Seth (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-17921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16601923#comment-16601923
 ] 

Saurabh Seth commented on HIVE-17921:
-

I also updated llap/acid_vectorization_original.q.out which now doesn't have 
the extra row:

-NULL 6

In addition, the masked patterns seem to have changed since the test results 
were last updated. I have updated those as well so that this test passes.

> Aggregation with struct in LLAP produces wrong result
> -
>
> Key: HIVE-17921
> URL: https://issues.apache.org/jira/browse/HIVE-17921
> Project: Hive
>  Issue Type: Sub-task
>  Components: llap, Transactions
>Affects Versions: 3.0.0
>Reporter: Eugene Koifman
>Assignee: Saurabh Seth
>Priority: Blocker
> Attachments: HIVE-17921.patch
>
>
> Consider 
> {noformat}
> select ROW__ID, count(*) from over10k_orc_bucketed group by ROW__ID having 
> count(*) > 1;
> {noformat}
>  in acid_vectorization_original.q (available since HIVE-17458)
> when run using TestMiniLlapCliDriver produces "NULL, N" where N varies from 
> run to run.
> The right answer is empty results set as can be seen by running
> {noformat}
> select ROW__ID, * from over10k_orc_bucketed where ROW__ID is null
> {noformat}
> in the same test.
> This is with 
> {noformat}
> set hive.vectorized.execution.enabled=true;
> set hive.vectorized.row.identifier.enabled=true;
> {noformat}
> It fails with TestMiniLlapCliDriver but not TestMiniTezCliDriver.  See 
> acid_vectorization_original_tez.q which has identical query.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-17921) Aggregation with struct in LLAP produces wrong result

2018-09-03 Thread Saurabh Seth (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-17921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saurabh Seth updated HIVE-17921:

Attachment: HIVE-17921.patch
Status: Patch Available  (was: Open)

I took a stab at debugging this. OrcStruct.canUseLlapIo thinks vectorization is 
being used when it's not and ends up using LlapRecordReader for the delta 
splits (wrapped in OrcOiBatchToRowReader because LlapInputFormat knows 
vectorization isn't being used). For the original data splits, 
OrcInputFormat.NullKeyRecordReader is used. These 2 RecordReaders create 
OrcStructs with different schemas (through createValue) - OrcOiBatchToRowReader 
adds an extra field for the ROW__ID but NullKeyRecordReader doesn't. Since this 
struct is cached (by MRReaderMapred), depending on which split is first within 
TezGroupedSplit, the cached OrcStruct may or may not have the extra field for 
ROW__ID. In this test case, an original file split is first and hence 
NullKeyRecordReader's OrcStruct is used. When this OrcStruct is given to 
OrcOiBatchToRowReader to fetch values (from the delta splits), it doesn't 
populate the record identifier - neither in the OrcStruct nor in the iocontext 
(in HiveContextAwareRecordReader). So all modified records in the delta splits 
end up having null ROW__IDs.

I have fixed OrcStruct.canUseLlapIo and the patch is attached.

A related question - Should OrcOiBatchToRowReader and NullKeyRecordReader be 
"compatible" and work when they're used from a TezGroupedSplitsRecordReader?

> Aggregation with struct in LLAP produces wrong result
> -
>
> Key: HIVE-17921
> URL: https://issues.apache.org/jira/browse/HIVE-17921
> Project: Hive
>  Issue Type: Sub-task
>  Components: llap, Transactions
>Affects Versions: 3.0.0
>Reporter: Eugene Koifman
>Assignee: Saurabh Seth
>Priority: Blocker
> Attachments: HIVE-17921.patch
>
>
> Consider 
> {noformat}
> select ROW__ID, count(*) from over10k_orc_bucketed group by ROW__ID having 
> count(*) > 1;
> {noformat}
>  in acid_vectorization_original.q (available since HIVE-17458)
> when run using TestMiniLlapCliDriver produces "NULL, N" where N varies from 
> run to run.
> The right answer is empty results set as can be seen by running
> {noformat}
> select ROW__ID, * from over10k_orc_bucketed where ROW__ID is null
> {noformat}
> in the same test.
> This is with 
> {noformat}
> set hive.vectorized.execution.enabled=true;
> set hive.vectorized.row.identifier.enabled=true;
> {noformat}
> It fails with TestMiniLlapCliDriver but not TestMiniTezCliDriver.  See 
> acid_vectorization_original_tez.q which has identical query.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (HIVE-17921) Aggregation with struct in LLAP produces wrong result

2018-09-03 Thread Saurabh Seth (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-17921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saurabh Seth reassigned HIVE-17921:
---

Assignee: Saurabh Seth

> Aggregation with struct in LLAP produces wrong result
> -
>
> Key: HIVE-17921
> URL: https://issues.apache.org/jira/browse/HIVE-17921
> Project: Hive
>  Issue Type: Sub-task
>  Components: llap, Transactions
>Affects Versions: 3.0.0
>Reporter: Eugene Koifman
>Assignee: Saurabh Seth
>Priority: Blocker
>
> Consider 
> {noformat}
> select ROW__ID, count(*) from over10k_orc_bucketed group by ROW__ID having 
> count(*) > 1;
> {noformat}
>  in acid_vectorization_original.q (available since HIVE-17458)
> when run using TestMiniLlapCliDriver produces "NULL, N" where N varies from 
> run to run.
> The right answer is empty results set as can be seen by running
> {noformat}
> select ROW__ID, * from over10k_orc_bucketed where ROW__ID is null
> {noformat}
> in the same test.
> This is with 
> {noformat}
> set hive.vectorized.execution.enabled=true;
> set hive.vectorized.row.identifier.enabled=true;
> {noformat}
> It fails with TestMiniLlapCliDriver but not TestMiniTezCliDriver.  See 
> acid_vectorization_original_tez.q which has identical query.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)