from:"Eugene Koifman \(JIRA\)"

[jira] [Updated] (HIVE-20862) QueryId no longer shows up in the logs

2018-11-08 Thread Eugene Koifman (JIRA)



 [ 
https://issues.apache.org/jira/browse/HIVE-20862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eugene Koifman updated HIVE-20862:
--
   Resolution: Fixed
Fix Version/s: 4.0.0
 Release Note: n/a
   Status: Resolved  (was: Patch Available)

committed to master

thanks Vaibhav for the review

> QueryId no longer shows up in the logs
> --
>
> Key: HIVE-20862
> URL: https://issues.apache.org/jira/browse/HIVE-20862
> Project: Hive
>  Issue Type: Bug
>  Components: Logging
>Affects Versions: 4.0.0
>Reporter: Eugene Koifman
>Assignee: Eugene Koifman
>Priority: Major
> Fix For: 4.0.0
>
> Attachments: HIVE-20862.01.patch
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (HIVE-20887) Tests: openjdk 8 has a bug that prevents surefire from working

2018-11-08 Thread Eugene Koifman (JIRA)



[ 
https://issues.apache.org/jira/browse/HIVE-20887?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16680188#comment-16680188
 ] 

Eugene Koifman commented on HIVE-20887:
---

Switching to a lower version of surefire also works, like 2.19.0

> Tests: openjdk 8 has a bug that prevents surefire from working
> --
>
> Key: HIVE-20887
> URL: https://issues.apache.org/jira/browse/HIVE-20887
> Project: Hive
>  Issue Type: Bug
>Reporter: Gopal V
>Assignee: Gopal V
>Priority: Major
>
> It looks like the problem is 
> https://bugs.openjdk.java.net/browse/JDK-8030046. It looks like:
> {code:bash}
> [ERROR] Caused by: 
> org.apache.maven.surefire.booter.SurefireBooterForkException: The forked VM 
> terminated without properly saying goodbye. VM crash or System.exit called?
> {code}
> The surefire-reports/*.dumpstream looks like:
> {code:bash}
> Error: Could not find or load main class 
> org.apache.maven.surefire.booter.ForkedBooter
> {code}
>  and we can work around the problem by changing the surefire configuration:
> {code:bash}
> +  false
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (HIVE-20823) Make Compactor run in a transaction

2018-11-06 Thread Eugene Koifman (JIRA)



 [ 
https://issues.apache.org/jira/browse/HIVE-20823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eugene Koifman updated HIVE-20823:
--
Attachment: HIVE-20823.04.patch

> Make Compactor run in a transaction
> ---
>
> Key: HIVE-20823
> URL: https://issues.apache.org/jira/browse/HIVE-20823
> Project: Hive
>  Issue Type: Improvement
>  Components: Transactions
>Affects Versions: 3.0.0
>Reporter: Eugene Koifman
>Assignee: Eugene Koifman
>Priority: Major
> Attachments: HIVE-20823.01.patch, HIVE-20823.03.patch, 
> HIVE-20823.04.patch
>
>
> Have compactor open a transaction and run the job in that transaction.
> # make compactor produced base/delta include this txn id in the folder name, 
> e.g. base_7_c17 where 17 is the txnid.
> # add {{CQ_TXN_ID bigint}} to COMPACTION_QUEUE and COMPLETED_COMPACTIONS to 
> record this txn id
> # make sure {{AcidUtils.getAcidState()}} pays attention to this transaction 
> on read and ignores this dir if this txn id is not committed in the current 
> snapshot
> ## this means not only validWriteIdList but ValidTxnIdList should be passed 
> along in config (if it isn't yet)
> # once this is done, {{CompactorMR.createCompactorMarker()}} can be 
> eliminated and {{AcidUtils.isValidBase}} modified accordingly
> # modify Cleaner so that it doesn't clean old files until new file is visible 
> to all readers
> # 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (HIVE-20581) Eliminate rename() from full CRUD transactional tables

2018-11-06 Thread Eugene Koifman (JIRA)



[ 
https://issues.apache.org/jira/browse/HIVE-20581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16677349#comment-16677349
 ] 

Eugene Koifman commented on HIVE-20581:
---

Like Sergey said, this is to make full CRUD behave like Insert-only tables with 
respect to rename().

> Eliminate rename() from full CRUD transactional tables
> --
>
> Key: HIVE-20581
> URL: https://issues.apache.org/jira/browse/HIVE-20581
> Project: Hive
>  Issue Type: Improvement
>  Components: Transactions
>Reporter: Eugene Koifman
>Assignee: Emily lozano d1751740 551913586
>Priority: Major
>
> The {{MoveTask}} in a query writing to full CRUD transactional table still 
> performs a {{FileSystem.rename()}}.  Full CRUD should follow the insert-only 
> transactional table implementation and write directly to delta_x_x in the 
> partition dir.  If the txn fails, this delta will be marked aborted and will 
> not be read.
> There are several places that rely on this rename.  For example, support for 
> {{Insert ... select ... Union All ... Select }} which creates multiple dirs, 
> 1 for each leg of the union.
> Others?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (HIVE-20730) Do delete event filtering even if hive.acid.index is not there

2018-11-06 Thread Eugene Koifman (JIRA)



[ 
https://issues.apache.org/jira/browse/HIVE-20730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16677240#comment-16677240
 ] 

Eugene Koifman commented on HIVE-20730:
---

I think {{OrcRecordUpdater.parseKeyIndex(reader);}} will probably NPE if 
{{hive.acid.key.index}} is missing.  It should probably 1st check if there is 
something under {{hive.acid.key.index}} key.

You could add a property like {{HiveConf.HIVETESTMODEROLLBACKTXN}} that forces 
{{OrcRecordUpdater}} to skip generating the index.  Aside form this, I can't 
think of a good way to test this until query based compactor is ready.  Query 
based compactor won't be using OrcRecordUpdater - that is what is causing this 
issue.

 

otherwise LGTM

> Do delete event filtering even if hive.acid.index is not there
> --
>
> Key: HIVE-20730
> URL: https://issues.apache.org/jira/browse/HIVE-20730
> Project: Hive
>  Issue Type: Improvement
>  Components: Transactions
>Affects Versions: 4.0.0
>Reporter: Eugene Koifman
>Assignee: Saurabh Seth
>Priority: Major
> Attachments: HIVE-20730.patch
>
>
> since HIVE-16812 {{VectorizedOrcAcidRowBatchReader}} filters delete events 
> based on min/max ROW__ID in the split which relies on {{hive.acid.index}} to 
> be in the ORC footer.  
> There is no way to generate {{hive.acid.index}} from a plain query as in 
> HIVE-20699 and so we need to make sure that we generate a SARG into 
> delete_delta/bucket_x based on stripe stats even the index is missing



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (HIVE-20874) Add ability to to run high priority compaction

2018-11-06 Thread Eugene Koifman (JIRA)



 [ 
https://issues.apache.org/jira/browse/HIVE-20874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eugene Koifman updated HIVE-20874:
--
Description: 
currently all compaction requests (via Alter Table command or auto initiated 
({{Initiator.java}}) land in a queue ({{COMPACTION_QUEUE}} metastore DB table) 
and are executed in order.

If the queue is long and some table/partition needs to be compacted urgently, 
there is no way to send it to the beginning of the queue.

Need a way to address this.

 

  was:
currently all compaction requests (via Alter Table command or auto initiated 
(\{{Initiator.java}}) land in a queue (\{{COMPACTION_QUEUE}} metastore DB 
table) and are executed in order.

If the queue is long and some table/partition needs to e compacted urgently, 
there is no way to send it to the beginning of the queue.

Need a way to address this.

 


> Add ability to to run high priority compaction
> --
>
> Key: HIVE-20874
> URL: https://issues.apache.org/jira/browse/HIVE-20874
> Project: Hive
>  Issue Type: New Feature
>  Components: Transactions
>Affects Versions: 1.0.0
>Reporter: Eugene Koifman
>Priority: Major
>
> currently all compaction requests (via Alter Table command or auto initiated 
> ({{Initiator.java}}) land in a queue ({{COMPACTION_QUEUE}} metastore DB 
> table) and are executed in order.
> If the queue is long and some table/partition needs to be compacted urgently, 
> there is no way to send it to the beginning of the queue.
> Need a way to address this.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (HIVE-20863) remove dead code

2018-11-05 Thread Eugene Koifman (JIRA)



 [ 
https://issues.apache.org/jira/browse/HIVE-20863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eugene Koifman updated HIVE-20863:
--
Status: Open  (was: Patch Available)

> remove dead code
> 
>
> Key: HIVE-20863
> URL: https://issues.apache.org/jira/browse/HIVE-20863
> Project: Hive
>  Issue Type: Improvement
>  Components: Transactions
>Affects Versions: 3.0.0
>Reporter: Eugene Koifman
>Assignee: Eugene Koifman
>Priority: Minor
> Attachments: HIVE-20863.01.patch
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (HIVE-20862) QueryId no longer shows up in the logs

2018-11-05 Thread Eugene Koifman (JIRA)



[ 
https://issues.apache.org/jira/browse/HIVE-20862?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16675580#comment-16675580
 ] 

Eugene Koifman commented on HIVE-20862:
---

no related failures

 

> QueryId no longer shows up in the logs
> --
>
> Key: HIVE-20862
> URL: https://issues.apache.org/jira/browse/HIVE-20862
> Project: Hive
>  Issue Type: Bug
>  Components: Logging
>Affects Versions: 4.0.0
>Reporter: Eugene Koifman
>Assignee: Eugene Koifman
>Priority: Major
> Attachments: HIVE-20862.01.patch
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (HIVE-20863) remove dead code

2018-11-02 Thread Eugene Koifman (JIRA)



[ 
https://issues.apache.org/jira/browse/HIVE-20863?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16673874#comment-16673874
 ] 

Eugene Koifman commented on HIVE-20863:
---

[~vgumashta] could you review please

> remove dead code
> 
>
> Key: HIVE-20863
> URL: https://issues.apache.org/jira/browse/HIVE-20863
> Project: Hive
>  Issue Type: Improvement
>  Components: Transactions
>Affects Versions: 3.0.0
>Reporter: Eugene Koifman
>Assignee: Eugene Koifman
>Priority: Minor
> Attachments: HIVE-20863.01.patch
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (HIVE-20862) QueryId no longer shows up in the logs

2018-11-02 Thread Eugene Koifman (JIRA)



[ 
https://issues.apache.org/jira/browse/HIVE-20862?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16673875#comment-16673875
 ] 

Eugene Koifman commented on HIVE-20862:
---

[~vgumashta] could you review please

> QueryId no longer shows up in the logs
> --
>
> Key: HIVE-20862
> URL: https://issues.apache.org/jira/browse/HIVE-20862
> Project: Hive
>  Issue Type: Bug
>  Components: Logging
>Affects Versions: 4.0.0
>Reporter: Eugene Koifman
>Assignee: Eugene Koifman
>Priority: Major
> Attachments: HIVE-20862.01.patch
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (HIVE-20823) Make Compactor run in a transaction

2018-11-02 Thread Eugene Koifman (JIRA)



 [ 
https://issues.apache.org/jira/browse/HIVE-20823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eugene Koifman updated HIVE-20823:
--
Attachment: HIVE-20823.03.patch

> Make Compactor run in a transaction
> ---
>
> Key: HIVE-20823
> URL: https://issues.apache.org/jira/browse/HIVE-20823
> Project: Hive
>  Issue Type: Improvement
>  Components: Transactions
>Affects Versions: 3.0.0
>Reporter: Eugene Koifman
>Assignee: Eugene Koifman
>Priority: Major
> Attachments: HIVE-20823.01.patch, HIVE-20823.03.patch
>
>
> Have compactor open a transaction and run the job in that transaction.
> # make compactor produced base/delta include this txn id in the folder name, 
> e.g. base_7_c17 where 17 is the txnid.
> # add {{CQ_TXN_ID bigint}} to COMPACTION_QUEUE and COMPLETED_COMPACTIONS to 
> record this txn id
> # make sure {{AcidUtils.getAcidState()}} pays attention to this transaction 
> on read and ignores this dir if this txn id is not committed in the current 
> snapshot
> ## this means not only validWriteIdList but ValidTxnIdList should be passed 
> along in config (if it isn't yet)
> # once this is done, {{CompactorMR.createCompactorMarker()}} can be 
> eliminated and {{AcidUtils.isValidBase}} modified accordingly
> # modify Cleaner so that it doesn't clean old files until new file is visible 
> to all readers
> # 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (HIVE-20863) remove dead code

2018-11-02 Thread Eugene Koifman (JIRA)



 [ 
https://issues.apache.org/jira/browse/HIVE-20863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eugene Koifman updated HIVE-20863:
--
Attachment: HIVE-20863.01.patch

> remove dead code
> 
>
> Key: HIVE-20863
> URL: https://issues.apache.org/jira/browse/HIVE-20863
> Project: Hive
>  Issue Type: Improvement
>  Components: Transactions
>Affects Versions: 3.0.0
>Reporter: Eugene Koifman
>Assignee: Eugene Koifman
>Priority: Minor
> Attachments: HIVE-20863.01.patch
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (HIVE-20863) remove dead code

2018-11-02 Thread Eugene Koifman (JIRA)



 [ 
https://issues.apache.org/jira/browse/HIVE-20863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eugene Koifman updated HIVE-20863:
--
Status: Patch Available  (was: Open)

> remove dead code
> 
>
> Key: HIVE-20863
> URL: https://issues.apache.org/jira/browse/HIVE-20863
> Project: Hive
>  Issue Type: Improvement
>  Components: Transactions
>Affects Versions: 3.0.0
>Reporter: Eugene Koifman
>Assignee: Eugene Koifman
>Priority: Minor
> Attachments: HIVE-20863.01.patch
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (HIVE-20862) QueryId no longer shows up in the logs

2018-11-02 Thread Eugene Koifman (JIRA)



 [ 
https://issues.apache.org/jira/browse/HIVE-20862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eugene Koifman updated HIVE-20862:
--
Status: Patch Available  (was: Open)

> QueryId no longer shows up in the logs
> --
>
> Key: HIVE-20862
> URL: https://issues.apache.org/jira/browse/HIVE-20862
> Project: Hive
>  Issue Type: Bug
>  Components: Logging
>Affects Versions: 4.0.0
>Reporter: Eugene Koifman
>Assignee: Eugene Koifman
>Priority: Major
> Attachments: HIVE-20862.01.patch
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (HIVE-20862) QueryId no longer shows up in the logs

2018-11-02 Thread Eugene Koifman (JIRA)



 [ 
https://issues.apache.org/jira/browse/HIVE-20862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eugene Koifman updated HIVE-20862:
--
Attachment: HIVE-20862.01.patch

> QueryId no longer shows up in the logs
> --
>
> Key: HIVE-20862
> URL: https://issues.apache.org/jira/browse/HIVE-20862
> Project: Hive
>  Issue Type: Bug
>  Components: Logging
>Affects Versions: 4.0.0
>Reporter: Eugene Koifman
>Assignee: Eugene Koifman
>Priority: Major
> Attachments: HIVE-20862.01.patch
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Assigned] (HIVE-20862) QueryId no longer shows up in the logs

2018-11-02 Thread Eugene Koifman (JIRA)



 [ 
https://issues.apache.org/jira/browse/HIVE-20862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eugene Koifman reassigned HIVE-20862:
-


> QueryId no longer shows up in the logs
> --
>
> Key: HIVE-20862
> URL: https://issues.apache.org/jira/browse/HIVE-20862
> Project: Hive
>  Issue Type: Bug
>  Components: Logging
>Affects Versions: 4.0.0
>Reporter: Eugene Koifman
>Assignee: Eugene Koifman
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Assigned] (HIVE-20863) remove dead code

2018-11-02 Thread Eugene Koifman (JIRA)



 [ 
https://issues.apache.org/jira/browse/HIVE-20863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eugene Koifman reassigned HIVE-20863:
-


> remove dead code
> 
>
> Key: HIVE-20863
> URL: https://issues.apache.org/jira/browse/HIVE-20863
> Project: Hive
>  Issue Type: Improvement
>  Components: Transactions
>Affects Versions: 3.0.0
>Reporter: Eugene Koifman
>Assignee: Eugene Koifman
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (HIVE-18772) Make Acid Cleaner use MIN_HISTORY_LEVEL

2018-11-01 Thread Eugene Koifman (JIRA)



[ 
https://issues.apache.org/jira/browse/HIVE-18772?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16672392#comment-16672392
 ] 

Eugene Koifman commented on HIVE-18772:
---

patch 4 is a rebase of 3

> Make Acid Cleaner use MIN_HISTORY_LEVEL
> ---
>
> Key: HIVE-18772
> URL: https://issues.apache.org/jira/browse/HIVE-18772
> Project: Hive
>  Issue Type: Improvement
>  Components: Transactions
>Affects Versions: 3.0.0
>Reporter: Eugene Koifman
>Assignee: Eugene Koifman
>Priority: Major
> Attachments: HIVE-18772.01.patch, HIVE-18772.02.patch, 
> HIVE-18772.02.patch, HIVE-18772.03.patch, HIVE-18772.04.patch
>
>
> Instead of using Lock Manager state as it currently does.
> This will eliminate possible race conditions
> See this 
> [comment|https://issues.apache.org/jira/browse/HIVE-18192?focusedCommentId=16338208&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16338208]
> Suppose A is the set of all ValidTxnList across all active readers.  Each 
> ValidTxnList has minOpenTxnId.
> MIN_HISTORY_LEVEL allows us to determine X = min(minOpenTxnId) across all 
> currently active readers
> This means that no active transaction in the system sees any txn with txnid < 
> X as open.
> This means if construct ValidTxnIdList with HWM=X-1 and use that in 
> getAcidState(), any files determined by this call as 'obsolete', will be seen 
> as obsolete by any existing/future reader, i.e. can be physically deleted.
> This is also necessary for multi-statement transactions where relying on the 
> state of Lock Manager is not sufficient.  For example
> Suppose txn 17 starts at t1 and sees txnid 13 with writeID 13 open.
> 13 commits (via it's parent txn) at t2 > t1.  (17 is still running).
> Compaction runs at t3 >t2 to produce base_14 (or delta_10_14 for example) on 
> Table1/Part1 (17 is still running)
> Now delta_13 may be cleaned since it can be seen as obsolete and there may be 
> no locks on it, i.e. no one is reading it.
> Now at t4 > t3 17 may (multi stmt txn) needs to read Table1/Part1. It cannot 
> use base_14 is that may have absorbed delete events from delete_delta_14.
> Another Use Case
> There is delta_1_1 and delta_2_2 on disk both created by committed txns.
> T5 starts reading these.  At the same time compactor creates delta_1_2.
> Now Cleaner sees delta_1_1 and delta_1_2 as obsolete and may remove them 
> while the read is still in progress.  This is because Compactor itself is not 
> running in a txn and the files that
> it produces are visible immediately.  If it ran in a txn, the new files would 
> only be visible once
> this txn is visible to others (including the Cleaner).  
> Using MIN_HISTORY_LEVEL solves this.
> See description of HIVE-18747 for more details on MIN_HISTORY_LEVEL



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (HIVE-18772) Make Acid Cleaner use MIN_HISTORY_LEVEL

2018-11-01 Thread Eugene Koifman (JIRA)



 [ 
https://issues.apache.org/jira/browse/HIVE-18772?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eugene Koifman updated HIVE-18772:
--
Attachment: HIVE-18772.04.patch

> Make Acid Cleaner use MIN_HISTORY_LEVEL
> ---
>
> Key: HIVE-18772
> URL: https://issues.apache.org/jira/browse/HIVE-18772
> Project: Hive
>  Issue Type: Improvement
>  Components: Transactions
>Affects Versions: 3.0.0
>Reporter: Eugene Koifman
>Assignee: Eugene Koifman
>Priority: Major
> Attachments: HIVE-18772.01.patch, HIVE-18772.02.patch, 
> HIVE-18772.02.patch, HIVE-18772.03.patch, HIVE-18772.04.patch
>
>
> Instead of using Lock Manager state as it currently does.
> This will eliminate possible race conditions
> See this 
> [comment|https://issues.apache.org/jira/browse/HIVE-18192?focusedCommentId=16338208&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16338208]
> Suppose A is the set of all ValidTxnList across all active readers.  Each 
> ValidTxnList has minOpenTxnId.
> MIN_HISTORY_LEVEL allows us to determine X = min(minOpenTxnId) across all 
> currently active readers
> This means that no active transaction in the system sees any txn with txnid < 
> X as open.
> This means if construct ValidTxnIdList with HWM=X-1 and use that in 
> getAcidState(), any files determined by this call as 'obsolete', will be seen 
> as obsolete by any existing/future reader, i.e. can be physically deleted.
> This is also necessary for multi-statement transactions where relying on the 
> state of Lock Manager is not sufficient.  For example
> Suppose txn 17 starts at t1 and sees txnid 13 with writeID 13 open.
> 13 commits (via it's parent txn) at t2 > t1.  (17 is still running).
> Compaction runs at t3 >t2 to produce base_14 (or delta_10_14 for example) on 
> Table1/Part1 (17 is still running)
> Now delta_13 may be cleaned since it can be seen as obsolete and there may be 
> no locks on it, i.e. no one is reading it.
> Now at t4 > t3 17 may (multi stmt txn) needs to read Table1/Part1. It cannot 
> use base_14 is that may have absorbed delete events from delete_delta_14.
> Another Use Case
> There is delta_1_1 and delta_2_2 on disk both created by committed txns.
> T5 starts reading these.  At the same time compactor creates delta_1_2.
> Now Cleaner sees delta_1_1 and delta_1_2 as obsolete and may remove them 
> while the read is still in progress.  This is because Compactor itself is not 
> running in a txn and the files that
> it produces are visible immediately.  If it ran in a txn, the new files would 
> only be visible once
> this txn is visible to others (including the Cleaner).  
> Using MIN_HISTORY_LEVEL solves this.
> See description of HIVE-18747 for more details on MIN_HISTORY_LEVEL



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (HIVE-20859) clean up invocation of Worker/Cleaner/Initiator in test code

2018-11-01 Thread Eugene Koifman (JIRA)



 [ 
https://issues.apache.org/jira/browse/HIVE-20859?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eugene Koifman updated HIVE-20859:
--
Priority: Minor  (was: Major)

> clean up invocation of Worker/Cleaner/Initiator in test code
> 
>
> Key: HIVE-20859
> URL: https://issues.apache.org/jira/browse/HIVE-20859
> Project: Hive
>  Issue Type: Improvement
>  Components: Test, Transactions
>Affects Versions: 3.0.0
>Reporter: Eugene Koifman
>Priority: Minor
>
> there are many places like {{CompactorTest}} that use
> {code:java|title=CompactorTest.java}
> AtomicBoolean stop = new AtomicBoolean(true);
> Worker t = new Worker();
> t.setThreadId((int) t.getId());
> t.setConf(hiveConf);
> AtomicBoolean looped = new AtomicBoolean();
> t.init(stop, looped);
> t.run();
> {code}
> should instead standardize on {{TestTxnCommands2.runWorker()}}
>  same for {{Cleaner}} and {{Initiator}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (HIVE-20856) ValidReaderWriteIdList() is not valid in most places

2018-11-01 Thread Eugene Koifman (JIRA)



 [ 
https://issues.apache.org/jira/browse/HIVE-20856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eugene Koifman updated HIVE-20856:
--
Description: 
Most of the time it's something like this:
{code:java|title=VectorizedOrcAcidRowBatchReader.SortMergedDeleteEventRegistry}
String txnString = conf.get(ValidWriteIdList.VALID_WRITEIDS_KEY);
this.validWriteIdList = (txnString == null) ? 
   new ValidReaderWriteIdList() : new ValidReaderWriteIdList(txnString);
{code}
or
{code:java|title=OrcInputFormat.Context}
String value = conf.get(ValidWriteIdList.VALID_WRITEIDS_KEY);
writeIdList = value == null ? new ValidReaderWriteIdList() : new 
ValidReaderWriteIdList(value);
{code}

and many others but {{ValidReaderWriteIdList()}}(no arg c'tor) creates a write 
ID list that considers every base/delta valid - this unlikely to be the correct 
for a general read of acid data.

  was:
Most of the time it's something like this:
{code:java}
String txnString = conf.get(ValidWriteIdList.VALID_WRITEIDS_KEY);
this.validWriteIdList = (txnString == null) ? 
   new ValidReaderWriteIdList() : new ValidReaderWriteIdList(txnString);
{code}

but ValidReaderWriteIdList() (no arg c'tor) creates a write ID list that 
considers every base/delta valid - this unlikely to be the correct for a 
general read of acid data.


> ValidReaderWriteIdList() is not valid in most places
> 
>
> Key: HIVE-20856
> URL: https://issues.apache.org/jira/browse/HIVE-20856
> Project: Hive
>  Issue Type: Bug
>  Components: Transactions
>Affects Versions: 3.0.0
>Reporter: Eugene Koifman
>Priority: Major
>
> Most of the time it's something like this:
> {code:java|title=VectorizedOrcAcidRowBatchReader.SortMergedDeleteEventRegistry}
> String txnString = conf.get(ValidWriteIdList.VALID_WRITEIDS_KEY);
> this.validWriteIdList = (txnString == null) ? 
>new ValidReaderWriteIdList() : new ValidReaderWriteIdList(txnString);
> {code}
> or
> {code:java|title=OrcInputFormat.Context}
> String value = conf.get(ValidWriteIdList.VALID_WRITEIDS_KEY);
> writeIdList = value == null ? new ValidReaderWriteIdList() : new 
> ValidReaderWriteIdList(value);
> {code}
> and many others but {{ValidReaderWriteIdList()}}(no arg c'tor) creates a 
> write ID list that considers every base/delta valid - this unlikely to be the 
> correct for a general read of acid data.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (HIVE-20852) Compaction Initiator ignores datas inserted by Stream Data Ingest

2018-11-01 Thread Eugene Koifman (JIRA)



[ 
https://issues.apache.org/jira/browse/HIVE-20852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16671961#comment-16671961
 ] 

Eugene Koifman commented on HIVE-20852:
---

could you provide a more detailed repro case?

> Compaction Initiator ignores datas inserted by Stream Data Ingest
> -
>
> Key: HIVE-20852
> URL: https://issues.apache.org/jira/browse/HIVE-20852
> Project: Hive
>  Issue Type: Bug
>  Components: API, Transactions
>Affects Versions: 3.1.0
>Reporter: Kei Miyauchi
>Priority: Major
>
> HI,
> Before compaction, Initiator decides whether a table/partition is 
> potentialCompaction by querying COMPLETE_TXN_COMPONENT.
> But I found the transactions which committed by Stream Data Ingest is not 
> stored on COMPLETE_TXN_COMPONENT. This is because the statement "insert into 
> COMPLETED_TXN_COMPONENTS (ctc_txnid, ctc_database, ctc_table, ctc_partition, 
> ctc_writeid, ctc_update_delete) select tc_txnid, tc_database, tc_table, 
> tc_partition, tc_writeid, 'N' from TXN_COMPONENTS where tc_txnid = (id)"  
> fails.
> I found INSERT statement to TXN_COMPONENTS isn't fired. select subquery above 
> returns 0 row.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (HIVE-20852) Compaction Initiator ignores datas inserted by Stream Data Ingest

2018-11-01 Thread Eugene Koifman (JIRA)



 [ 
https://issues.apache.org/jira/browse/HIVE-20852?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eugene Koifman updated HIVE-20852:
--
Component/s: Transactions

> Compaction Initiator ignores datas inserted by Stream Data Ingest
> -
>
> Key: HIVE-20852
> URL: https://issues.apache.org/jira/browse/HIVE-20852
> Project: Hive
>  Issue Type: Bug
>  Components: API, Transactions
>Affects Versions: 3.1.0
>Reporter: Kei Miyauchi
>Priority: Major
>
> HI,
> Before compaction, Initiator decides whether a table/partition is 
> potentialCompaction by querying COMPLETE_TXN_COMPONENT.
> But I found the transactions which committed by Stream Data Ingest is not 
> stored on COMPLETE_TXN_COMPONENT. This is because the statement "insert into 
> COMPLETED_TXN_COMPONENTS (ctc_txnid, ctc_database, ctc_table, ctc_partition, 
> ctc_writeid, ctc_update_delete) select tc_txnid, tc_database, tc_table, 
> tc_partition, tc_writeid, 'N' from TXN_COMPONENTS where tc_txnid = (id)"  
> fails.
> I found INSERT statement to TXN_COMPONENTS isn't fired. select subquery above 
> returns 0 row.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (HIVE-20460) AcidUtils.Directory.getAbortedDirectories() may be missed for full CRUD tables

2018-10-31 Thread Eugene Koifman (JIRA)



[ 
https://issues.apache.org/jira/browse/HIVE-20460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16670775#comment-16670775
 ] 

Eugene Koifman commented on HIVE-20460:
---

the system isn't quite ready for this.  full CRUD still relies on rename().

> AcidUtils.Directory.getAbortedDirectories() may be missed for full CRUD tables
> --
>
> Key: HIVE-20460
> URL: https://issues.apache.org/jira/browse/HIVE-20460
> Project: Hive
>  Issue Type: Improvement
>  Components: Transactions
>Affects Versions: 3.0.0
>Reporter: Eugene Koifman
>Assignee: Eugene Koifman
>Priority: Major
>
> {{Directory.getAbortedDirectories()}} lists deltas where all txns in the 
> range are aborted.
> These are then purged by {{Worker}} (\{{CompactorMR}} but only for 
> insert-only tables.
> Full CRUD tables currently rely on {{FileSystem.rename()}} in {{MoveTask}} 
> and so no reader (or {{Cleaner}} should every see a delta where all data is 
> aborted.  
>  
> Once rename() is eliminated for full CRUD (just like insert-only) 
> transactional tables, Cleaner (or Worker) should take care of these.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (HIVE-20823) Make Compactor run in a transaction

2018-10-29 Thread Eugene Koifman (JIRA)



 [ 
https://issues.apache.org/jira/browse/HIVE-20823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eugene Koifman updated HIVE-20823:
--
Status: Patch Available  (was: Open)

01.patch - see what breaks

> Make Compactor run in a transaction
> ---
>
> Key: HIVE-20823
> URL: https://issues.apache.org/jira/browse/HIVE-20823
> Project: Hive
>  Issue Type: Improvement
>  Components: Transactions
>Affects Versions: 3.0.0
>Reporter: Eugene Koifman
>Assignee: Eugene Koifman
>Priority: Major
> Attachments: HIVE-20823.01.patch
>
>
> Have compactor open a transaction and run the job in that transaction.
> # make compactor produced base/delta include this txn id in the folder name, 
> e.g. base_7_c17 where 17 is the txnid.
> # add {{CQ_TXN_ID bigint}} to COMPACTION_QUEUE and COMPLETED_COMPACTIONS to 
> record this txn id
> # make sure {{AcidUtils.getAcidState()}} pays attention to this transaction 
> on read and ignores this dir if this txn id is not committed in the current 
> snapshot
> ## this means not only validWriteIdList but ValidTxnIdList should be passed 
> along in config (if it isn't yet)
> # once this is done, {{CompactorMR.createCompactorMarker()}} can be 
> eliminated and {{AcidUtils.isValidBase}} modified accordingly
> # modify Cleaner so that it doesn't clean old files until new file is visible 
> to all readers
> # 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (HIVE-20823) Make Compactor run in a transaction

2018-10-29 Thread Eugene Koifman (JIRA)



 [ 
https://issues.apache.org/jira/browse/HIVE-20823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eugene Koifman updated HIVE-20823:
--
Attachment: HIVE-20823.01.patch

> Make Compactor run in a transaction
> ---
>
> Key: HIVE-20823
> URL: https://issues.apache.org/jira/browse/HIVE-20823
> Project: Hive
>  Issue Type: Improvement
>  Components: Transactions
>Affects Versions: 3.0.0
>Reporter: Eugene Koifman
>Assignee: Eugene Koifman
>Priority: Major
> Attachments: HIVE-20823.01.patch
>
>
> Have compactor open a transaction and run the job in that transaction.
> # make compactor produced base/delta include this txn id in the folder name, 
> e.g. base_7_c17 where 17 is the txnid.
> # add {{CQ_TXN_ID bigint}} to COMPACTION_QUEUE and COMPLETED_COMPACTIONS to 
> record this txn id
> # make sure {{AcidUtils.getAcidState()}} pays attention to this transaction 
> on read and ignores this dir if this txn id is not committed in the current 
> snapshot
> ## this means not only validWriteIdList but ValidTxnIdList should be passed 
> along in config (if it isn't yet)
> # once this is done, {{CompactorMR.createCompactorMarker()}} can be 
> eliminated and {{AcidUtils.isValidBase}} modified accordingly
> # modify Cleaner so that it doesn't clean old files until new file is visible 
> to all readers
> # 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Assigned] (HIVE-20825) Hive ACID Merge generates invalid ORC files (bucket files 0 or 3 bytes in length) causing the "Not a valid ORC file" error

2018-10-29 Thread Eugene Koifman (JIRA)



 [ 
https://issues.apache.org/jira/browse/HIVE-20825?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eugene Koifman reassigned HIVE-20825:
-

Assignee: (was: Eugene Koifman)

> Hive ACID Merge generates invalid ORC files (bucket files 0 or 3 bytes in 
> length) causing the "Not a valid ORC file" error
> --
>
> Key: HIVE-20825
> URL: https://issues.apache.org/jira/browse/HIVE-20825
> Project: Hive
>  Issue Type: Bug
>  Components: Hive, ORC, Transactions
>Affects Versions: 2.2.0, 2.3.1, 2.3.2
> Environment: Hive 2.3.x on Amazon EMR 5.8.0 to 5.18.0
>Reporter: Tom Zeng
>Priority: Major
> Attachments: hive-merge-invalid-orc-repro.hql, 
> hive-merge-invalid-orc-repro.log
>
>
> When using Hive ACID Merge (supported with the ORC format) to update/insert 
> data, bucket files with 0 byte or 3 bytes (file content contains three 
> characters: ORC) are generated during MERGE INTO operations which finish with 
> no errors. Subsequent queries on the base table will get "Not a valid ORC 
> file" error.
>  
> The following script can be used to reproduce the issue(note that with small 
> amount of data like this increasing the number of buckets could result in 
> query working, but with large data set it will fail no matter what bucket 
> size):
> set hive.auto.convert.join=false;
>  set hive.enforce.bucketing=true;
>  set hive.exec.dynamic.partition.mode = nonstrict;
>  set hive.support.concurrency=true;
>  set hive.txn.manager = org.apache.hadoop.hive.ql.lockmgr.DbTxnManager;
> drop table if exists mergedelta_txt_1;
>  drop table if exists mergedelta_txt_2;
> CREATE TABLE mergedelta_txt_1 (
>  id_str varchar(12), time_key int, value bigint)
>  PARTITIONED BY (date_key int)
>  ROW FORMAT DELIMITED
>  STORED AS TEXTFILE;
> CREATE TABLE mergedelta_txt_2 (
>  id_str varchar(12), time_key int, value bigint)
>  PARTITIONED BY (date_key int)
>  ROW FORMAT DELIMITED
>  STORED AS TEXTFILE;
> INSERT INTO TABLE mergedelta_txt_1
>  partition(date_key=20170103)
>  VALUES
>  ("AB94LIENR0",46700,12345676836978),
>  ("AB94LIENR1",46825,12345676836978),
>  ("AB94LIENS0",46709,12345676836978),
>  ("AB94LIENS1",46834,12345676836978),
>  ("AB94LIENT0",46709,12345676836978),
>  ("AB94LIENT1",46834,12345676836978),
>  ("AB94LIENU0",46718,12345676836978),
>  ("AB94LIENU1",46844,12345676836978),
>  ("AB94LIENV0",46719,12345676836978),
>  ("AB94LIENV1",46844,12345676836978),
>  ("AB94LIENW0",46728,12345676836978),
>  ("AB94LIENW1",46854,12345676836978),
>  ("AB94LIENX0",46728,12345676836978),
>  ("AB94LIENX1",46854,12345676836978),
>  ("AB94LIENY0",46737,12345676836978),
>  ("AB94LIENY1",46863,12345676836978),
>  ("AB94LIENZ0",46738,12345676836978),
>  ("AB94LIENZ1",46863,12345676836978),
>  ("AB94LIERA0",47176,12345676836982),
>  ("AB94LIERA1",47302,12345676836982);
> INSERT INTO TABLE mergedelta_txt_2
>  partition(date_key=20170103)
>  VALUES 
>  ("AB94LIENT1",46834,12345676836978),
>  ("AB94LIENU0",46718,12345676836978),
>  ("AB94LIENU1",46844,12345676836978),
>  ("AB94LIENV0",46719,12345676836978),
>  ("AB94LIENV1",46844,12345676836978),
>  ("AB94LIENW0",46728,12345676836978),
>  ("AB94LIENW1",46854,12345676836978),
>  ("AB94LIENX0",46728,12345676836978),
>  ("AB94LIENX1",46854,12345676836978),
>  ("AB94LIENY0",46737,12345676836978),
>  ("AB94LIENY1",46863,12345676836978),
>  ("AB94LIENZ0",46738,12345676836978),
>  ("AB94LIENZ1",46863,12345676836978),
>  ("AB94LIERA0",47176,12345676836982),
>  ("AB94LIERA1",47302,12345676836982),
>  ("AB94LIERA2",47418,12345676836982),
>  ("AB94LIERB0",47176,12345676836982),
>  ("AB94LIERB1",47302,12345676836982),
>  ("AB94LIERB2",47418,12345676836982),
>  ("AB94LIERC0",47185,12345676836982);
> DROP TABLE IF EXISTS mergebase_1;
>  CREATE TABLE mergebase_1 (
>  id_str varchar(12) , time_key int , value bigint)
>  PARTITIONED BY (date_key int)
>  CLUSTERED BY (id_str,time_key) INTO 4 BUCKETS
>  STORED AS ORC
>  TBLPROPERTIES (
>  'orc.compress'='SNAPPY',
>  'pk_columns'='id_str,date_key,time_key',
>  'NO_AUTO_COMPACTION'='true',
>  'transactional'='true');
> MERGE INTO mergebase_1 AS base
>  USING (SELECT * 
>  FROM (
>  SELECT id_str ,time_key ,value, date_key, rank() OVER (PARTITION BY 
> id_str,date_key,time_key ORDER BY id_str,date_key,time_key) AS rk 
>  FROM mergedelta_txt_1
>  DISTRIBUTE BY date_key
>  ) rankedtbl 
>  WHERE rankedtbl.rk=1
>  ) AS delta
>  ON delta.id_str=base.id_str AND delta.date_key=base.date_key AND 
> delta.time_key=base.time_key
>  WHEN MATCHED THEN UPDATE SET value=delta.value
>  WHEN NOT MATCHED THEN INSERT VALUES ( delta.id_str , delta.time_key , 
> delta.value, delta.date_key);
> MERGE INTO mergebase_1 AS base
>  USING (SELECT * 
>  FROM (
>  SELECT id_str ,time_key ,value, date_key,

[jira] [Commented] (HIVE-20825) Hive ACID Merge generates invalid ORC files (bucket files 0 or 3 bytes in length) causing the "Not a valid ORC file" error

2018-10-29 Thread Eugene Koifman (JIRA)



[ 
https://issues.apache.org/jira/browse/HIVE-20825?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16667760#comment-16667760
 ] 

Eugene Koifman commented on HIVE-20825:
---

Perhaps HIVE-14014 is relevant.
I tried these examples on master and branch-2.2 and I don't see the error 
(count returns 25) nor do I see any empty (3-byte) files.  

> Hive ACID Merge generates invalid ORC files (bucket files 0 or 3 bytes in 
> length) causing the "Not a valid ORC file" error
> --
>
> Key: HIVE-20825
> URL: https://issues.apache.org/jira/browse/HIVE-20825
> Project: Hive
>  Issue Type: Bug
>  Components: Hive, ORC, Transactions
>Affects Versions: 2.2.0, 2.3.1, 2.3.2
> Environment: Hive 2.3.x on Amazon EMR 5.8.0 to 5.18.0
>Reporter: Tom Zeng
>Assignee: Eugene Koifman
>Priority: Major
> Attachments: hive-merge-invalid-orc-repro.hql, 
> hive-merge-invalid-orc-repro.log
>
>
> When using Hive ACID Merge (supported with the ORC format) to update/insert 
> data, bucket files with 0 byte or 3 bytes (file content contains three 
> characters: ORC) are generated during MERGE INTO operations which finish with 
> no errors. Subsequent queries on the base table will get "Not a valid ORC 
> file" error.
>  
> The following script can be used to reproduce the issue(note that with small 
> amount of data like this increasing the number of buckets could result in 
> query working, but with large data set it will fail no matter what bucket 
> size):
> set hive.auto.convert.join=false;
>  set hive.enforce.bucketing=true;
>  set hive.exec.dynamic.partition.mode = nonstrict;
>  set hive.support.concurrency=true;
>  set hive.txn.manager = org.apache.hadoop.hive.ql.lockmgr.DbTxnManager;
> drop table if exists mergedelta_txt_1;
>  drop table if exists mergedelta_txt_2;
> CREATE TABLE mergedelta_txt_1 (
>  id_str varchar(12), time_key int, value bigint)
>  PARTITIONED BY (date_key int)
>  ROW FORMAT DELIMITED
>  STORED AS TEXTFILE;
> CREATE TABLE mergedelta_txt_2 (
>  id_str varchar(12), time_key int, value bigint)
>  PARTITIONED BY (date_key int)
>  ROW FORMAT DELIMITED
>  STORED AS TEXTFILE;
> INSERT INTO TABLE mergedelta_txt_1
>  partition(date_key=20170103)
>  VALUES
>  ("AB94LIENR0",46700,12345676836978),
>  ("AB94LIENR1",46825,12345676836978),
>  ("AB94LIENS0",46709,12345676836978),
>  ("AB94LIENS1",46834,12345676836978),
>  ("AB94LIENT0",46709,12345676836978),
>  ("AB94LIENT1",46834,12345676836978),
>  ("AB94LIENU0",46718,12345676836978),
>  ("AB94LIENU1",46844,12345676836978),
>  ("AB94LIENV0",46719,12345676836978),
>  ("AB94LIENV1",46844,12345676836978),
>  ("AB94LIENW0",46728,12345676836978),
>  ("AB94LIENW1",46854,12345676836978),
>  ("AB94LIENX0",46728,12345676836978),
>  ("AB94LIENX1",46854,12345676836978),
>  ("AB94LIENY0",46737,12345676836978),
>  ("AB94LIENY1",46863,12345676836978),
>  ("AB94LIENZ0",46738,12345676836978),
>  ("AB94LIENZ1",46863,12345676836978),
>  ("AB94LIERA0",47176,12345676836982),
>  ("AB94LIERA1",47302,12345676836982);
> INSERT INTO TABLE mergedelta_txt_2
>  partition(date_key=20170103)
>  VALUES 
>  ("AB94LIENT1",46834,12345676836978),
>  ("AB94LIENU0",46718,12345676836978),
>  ("AB94LIENU1",46844,12345676836978),
>  ("AB94LIENV0",46719,12345676836978),
>  ("AB94LIENV1",46844,12345676836978),
>  ("AB94LIENW0",46728,12345676836978),
>  ("AB94LIENW1",46854,12345676836978),
>  ("AB94LIENX0",46728,12345676836978),
>  ("AB94LIENX1",46854,12345676836978),
>  ("AB94LIENY0",46737,12345676836978),
>  ("AB94LIENY1",46863,12345676836978),
>  ("AB94LIENZ0",46738,12345676836978),
>  ("AB94LIENZ1",46863,12345676836978),
>  ("AB94LIERA0",47176,12345676836982),
>  ("AB94LIERA1",47302,12345676836982),
>  ("AB94LIERA2",47418,12345676836982),
>  ("AB94LIERB0",47176,12345676836982),
>  ("AB94LIERB1",47302,12345676836982),
>  ("AB94LIERB2",47418,12345676836982),
>  ("AB94LIERC0",47185,12345676836982);
> DROP TABLE IF EXISTS mergebase_1;
>  CREATE TABLE mergebase_1 (
>  id_str varchar(12) , time_key int , value bigint)
>  PARTITIONED BY (date_key int)
>  CLUSTERED BY (id_str,time_key) INTO 4 BUCKETS
>  STORED AS ORC
>  TBLPROPERTIES (
>  'orc.compress'='SNAPPY',
>  'pk_columns'='id_str,date_key,time_key',
>  'NO_AUTO_COMPACTION'='true',
>  'transactional'='true');
> MERGE INTO mergebase_1 AS base
>  USING (SELECT * 
>  FROM (
>  SELECT id_str ,time_key ,value, date_key, rank() OVER (PARTITION BY 
> id_str,date_key,time_key ORDER BY id_str,date_key,time_key) AS rk 
>  FROM mergedelta_txt_1
>  DISTRIBUTE BY date_key
>  ) rankedtbl 
>  WHERE rankedtbl.rk=1
>  ) AS delta
>  ON delta.id_str=base.id_str AND delta.date_key=base.date_key AND 
> delta.time_key=base.time_key
>  WHEN MATCHED THEN UPDATE SET value=delta.valu

[jira] [Assigned] (HIVE-20825) Hive ACID Merge generates invalid ORC files (bucket files 0 or 3 bytes in length) causing the "Not a valid ORC file" error

2018-10-27 Thread Eugene Koifman (JIRA)



 [ 
https://issues.apache.org/jira/browse/HIVE-20825?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eugene Koifman reassigned HIVE-20825:
-

Assignee: Eugene Koifman

> Hive ACID Merge generates invalid ORC files (bucket files 0 or 3 bytes in 
> length) causing the "Not a valid ORC file" error
> --
>
> Key: HIVE-20825
> URL: https://issues.apache.org/jira/browse/HIVE-20825
> Project: Hive
>  Issue Type: Bug
>  Components: Hive, ORC, Transactions
>Affects Versions: 2.2.0, 2.3.1, 2.3.2
> Environment: Hive 2.3.x on Amazon EMR 5.8.0 to 5.18.0
>Reporter: Tom Zeng
>Assignee: Eugene Koifman
>Priority: Major
> Attachments: hive-merge-invalid-orc-repro.hql, 
> hive-merge-invalid-orc-repro.log
>
>
> When using Hive ACID Merge (supported with the ORC format) to update/insert 
> data, bucket files with 0 byte or 3 bytes (file content is three character: 
> ORC) are generated during MERGE INTO operations which finish with no errors. 
> Subsequent queries on the base table will get "Not a valid ORC file" error.
>  
> The following script can be used to reproduce the issue:
> set hive.auto.convert.join=false;
> set hive.enforce.bucketing=true;
> set hive.exec.dynamic.partition.mode = nonstrict;
> set hive.support.concurrency=true;
> set hive.txn.manager = org.apache.hadoop.hive.ql.lockmgr.DbTxnManager;
> drop table if exists mergedelta_txt_1;
> drop table if exists mergedelta_txt_2;
> CREATE TABLE mergedelta_txt_1 (
> id_str varchar(12), time_key int, value bigint)
> PARTITIONED BY (date_key int)
> ROW FORMAT DELIMITED
> STORED AS TEXTFILE;
> CREATE TABLE mergedelta_txt_2 (
> id_str varchar(12), time_key int, value bigint)
> PARTITIONED BY (date_key int)
> ROW FORMAT DELIMITED
> STORED AS TEXTFILE;
> INSERT INTO TABLE mergedelta_txt_1
> partition(date_key=20170103)
> VALUES
>  ("AB94LIENR0",46700,12345676836978),
>  ("AB94LIENR1",46825,12345676836978),
>  ("AB94LIENS0",46709,12345676836978),
>  ("AB94LIENS1",46834,12345676836978),
>  ("AB94LIENT0",46709,12345676836978),
>  ("AB94LIENT1",46834,12345676836978),
>  ("AB94LIENU0",46718,12345676836978),
>  ("AB94LIENU1",46844,12345676836978),
>  ("AB94LIENV0",46719,12345676836978),
>  ("AB94LIENV1",46844,12345676836978),
>  ("AB94LIENW0",46728,12345676836978),
>  ("AB94LIENW1",46854,12345676836978),
>  ("AB94LIENX0",46728,12345676836978),
>  ("AB94LIENX1",46854,12345676836978),
>  ("AB94LIENY0",46737,12345676836978),
>  ("AB94LIENY1",46863,12345676836978),
>  ("AB94LIENZ0",46738,12345676836978),
>  ("AB94LIENZ1",46863,12345676836978),
>  ("AB94LIERA0",47176,12345676836982),
>  ("AB94LIERA1",47302,12345676836982);
> INSERT INTO TABLE mergedelta_txt_2
> partition(date_key=20170103)
> VALUES 
>  ("AB94LIENT1",46834,12345676836978),
>  ("AB94LIENU0",46718,12345676836978),
>  ("AB94LIENU1",46844,12345676836978),
>  ("AB94LIENV0",46719,12345676836978),
>  ("AB94LIENV1",46844,12345676836978),
>  ("AB94LIENW0",46728,12345676836978),
>  ("AB94LIENW1",46854,12345676836978),
>  ("AB94LIENX0",46728,12345676836978),
>  ("AB94LIENX1",46854,12345676836978),
>  ("AB94LIENY0",46737,12345676836978),
>  ("AB94LIENY1",46863,12345676836978),
>  ("AB94LIENZ0",46738,12345676836978),
>  ("AB94LIENZ1",46863,12345676836978),
>  ("AB94LIERA0",47176,12345676836982),
>  ("AB94LIERA1",47302,12345676836982),
>  ("AB94LIERA2",47418,12345676836982),
>  ("AB94LIERB0",47176,12345676836982),
>  ("AB94LIERB1",47302,12345676836982),
>  ("AB94LIERB2",47418,12345676836982),
>  ("AB94LIERC0",47185,12345676836982);
> DROP TABLE IF EXISTS mergebase_1;
> CREATE TABLE mergebase_1 (
> id_str varchar(12) , time_key int , value bigint)
> PARTITIONED BY (date_key int)
> CLUSTERED BY (id_str,time_key) INTO 32 BUCKETS
> STORED AS ORC
> TBLPROPERTIES (
>  'orc.compress'='SNAPPY',
>  'pk_columns'='id_str,date_key,time_key',
>  'NO_AUTO_COMPACTION'='true',
>  'transactional'='true');
> MERGE INTO mergebase_1 AS base
> USING (SELECT * 
>  FROM (
>  SELECT id_str ,time_key ,value, date_key, rank() OVER (PARTITION BY 
> id_str,date_key,time_key ORDER BY id_str,date_key,time_key) AS rk 
>  FROM mergedelta_txt_1
>  DISTRIBUTE BY date_key
>  ) rankedtbl 
>  WHERE rankedtbl.rk=1
> ) AS delta
> ON delta.id_str=base.id_str AND delta.date_key=base.date_key AND 
> delta.time_key=base.time_key
> WHEN MATCHED THEN UPDATE SET value=delta.value
> WHEN NOT MATCHED THEN INSERT VALUES ( delta.id_str , delta.time_key , 
> delta.value, delta.date_key);
> MERGE INTO mergebase_1 AS base
> USING (SELECT * 
>  FROM (
>  SELECT id_str ,time_key ,value, date_key, rank() OVER (PARTITION BY 
> id_str,date_key,time_key ORDER BY id_str,date_key,time_key) AS rk 
>  FROM mergedelta_txt_2
>  DISTRIBUTE BY date_key
>  ) rankedtbl 
>  WHERE rankedtbl.rk=1
> ) AS d

[jira] [Assigned] (HIVE-20823) Make Compactor run in a transaction

2018-10-26 Thread Eugene Koifman (JIRA)



 [ 
https://issues.apache.org/jira/browse/HIVE-20823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eugene Koifman reassigned HIVE-20823:
-


> Make Compactor run in a transaction
> ---
>
> Key: HIVE-20823
> URL: https://issues.apache.org/jira/browse/HIVE-20823
> Project: Hive
>  Issue Type: Improvement
>  Components: Transactions
>Affects Versions: 3.0.0
>Reporter: Eugene Koifman
>Assignee: Eugene Koifman
>Priority: Major
>
> Have compactor open a transaction and run the job in that transaction.
> # make compactor produced base/delta include this txn id in the folder name, 
> e.g. base_7_c17 where 17 is the txnid.
> # add {{CQ_TXN_ID bigint}} to COMPACTION_QUEUE and COMPLETED_COMPACTIONS to 
> record this txn id
> # make sure {{AcidUtils.getAcidState()}} pays attention to this transaction 
> on read and ignores this dir if this txn id is not committed in the current 
> snapshot
> ## this means not only validWriteIdList but ValidTxnIdList should be passed 
> along in config (if it isn't yet)
> # once this is done, {{CompactorMR.createCompactorMarker()}} can be 
> eliminated and {{AcidUtils.isValidBase}} modified accordingly
> # modify Cleaner so that it doesn't clean old files until new file is visible 
> to all readers
> # 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (HIVE-14516) OrcInputFormat.SplitGenerator.callInternal() can be optimized

2018-10-25 Thread Eugene Koifman (JIRA)



 [ 
https://issues.apache.org/jira/browse/HIVE-14516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eugene Koifman updated HIVE-14516:
--
   Resolution: Fixed
Fix Version/s: 4.0.0
   Status: Resolved  (was: Patch Available)

committed to master
thanks Igor for the contribution

> OrcInputFormat.SplitGenerator.callInternal() can be optimized
> -
>
> Key: HIVE-14516
> URL: https://issues.apache.org/jira/browse/HIVE-14516
> Project: Hive
>  Issue Type: Bug
>  Components: Transactions
>Affects Versions: 2.2.0
>Reporter: Eugene Koifman
>Assignee: Igor Kryvenko
>Priority: Major
> Fix For: 4.0.0
>
> Attachments: HIVE-14516.01.patch
>
>
> callIntenal() has 
> // We can't eliminate stripes if there are deltas because the
> // deltas may change the rows making them match the predicate.
> but in Acid 2.0, the deltas only have delete events thus eliminating stripes 
> from  "base" of split should be safe.
> cc [~gopalv]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (HIVE-14516) OrcInputFormat.SplitGenerator.callInternal() can be optimized

2018-10-25 Thread Eugene Koifman (JIRA)



[ 
https://issues.apache.org/jira/browse/HIVE-14516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16664066#comment-16664066
 ] 

Eugene Koifman commented on HIVE-14516:
---

+1

> OrcInputFormat.SplitGenerator.callInternal() can be optimized
> -
>
> Key: HIVE-14516
> URL: https://issues.apache.org/jira/browse/HIVE-14516
> Project: Hive
>  Issue Type: Bug
>  Components: Transactions
>Affects Versions: 2.2.0
>Reporter: Eugene Koifman
>Assignee: Igor Kryvenko
>Priority: Major
> Attachments: HIVE-14516.01.patch
>
>
> callIntenal() has 
> // We can't eliminate stripes if there are deltas because the
> // deltas may change the rows making them match the predicate.
> but in Acid 2.0, the deltas only have delete events thus eliminating stripes 
> from  "base" of split should be safe.
> cc [~gopalv]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Assigned] (HIVE-20699) Query based compactor for full CRUD Acid tables

2018-10-24 Thread Eugene Koifman (JIRA)



 [ 
https://issues.apache.org/jira/browse/HIVE-20699?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eugene Koifman reassigned HIVE-20699:
-

Assignee: Vaibhav Gumashta  (was: Eugene Koifman)

> Query based compactor for full CRUD Acid tables
> ---
>
> Key: HIVE-20699
> URL: https://issues.apache.org/jira/browse/HIVE-20699
> Project: Hive
>  Issue Type: New Feature
>  Components: Transactions
>Affects Versions: 3.1.0
>Reporter: Eugene Koifman
>Assignee: Vaibhav Gumashta
>Priority: Major
>
> Currently the Acid compactor is implemented as generated MR job 
> ({{CompactorMR.java}}).
> It could also be expressed as a Hive query that reads from a given partition 
> and writes data back to the same partition.  This will merge the deltas and 
> 'apply' the delete events.  The simplest would be to just use Insert 
> Overwrite but that will change all ROW__IDs which we don't want.
> Need to implement this in a way that preserves ROW__IDs and creates a new 
> {{base_x}} directory to handle Major compaction.
> Minor compaction will be investigated separately.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Assigned] (HIVE-14516) OrcInputFormat.SplitGenerator.callInternal() can be optimized

2018-10-24 Thread Eugene Koifman (JIRA)



 [ 
https://issues.apache.org/jira/browse/HIVE-14516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eugene Koifman reassigned HIVE-14516:
-

Assignee: Igor Kryvenko  (was: Eugene Koifman)

> OrcInputFormat.SplitGenerator.callInternal() can be optimized
> -
>
> Key: HIVE-14516
> URL: https://issues.apache.org/jira/browse/HIVE-14516
> Project: Hive
>  Issue Type: Bug
>  Components: Transactions
>Affects Versions: 2.2.0
>Reporter: Eugene Koifman
>Assignee: Igor Kryvenko
>Priority: Major
>
> callIntenal() has 
> // We can't eliminate stripes if there are deltas because the
> // deltas may change the rows making them match the predicate.
> but in Acid 2.0, the deltas only have delete events thus eliminating stripes 
> from  "base" of split should be safe.
> cc [~gopalv]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (HIVE-18045) can VectorizedOrcAcidRowBatchReader be used all the time

2018-10-22 Thread Eugene Koifman (JIRA)



[ 
https://issues.apache.org/jira/browse/HIVE-18045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16659680#comment-16659680
 ] 

Eugene Koifman commented on HIVE-18045:
---

make sure VirtualColumn like INPUT__FILE__NAME continue to work

> can VectorizedOrcAcidRowBatchReader be used all the time
> 
>
> Key: HIVE-18045
> URL: https://issues.apache.org/jira/browse/HIVE-18045
> Project: Hive
>  Issue Type: Improvement
>  Components: Transactions
>Reporter: Eugene Koifman
>Assignee: Eugene Koifman
>Priority: Blocker
>
> Can we use VectorizedOrcAcidRowBatchReader for non-vectorized queries?
> It would just need a wrapper on top of it to turn VRBs into rows.
> This would mean there is just 1 acid reader to maintain - not 2.
> Would this be an issue for sorted reader/SMB support?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (HIVE-18772) Make Acid Cleaner use MIN_HISTORY_LEVEL

2018-10-19 Thread Eugene Koifman (JIRA)



 [ 
https://issues.apache.org/jira/browse/HIVE-18772?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eugene Koifman updated HIVE-18772:
--
Description: 
Instead of using Lock Manager state as it currently does.
This will eliminate possible race conditions

See this 
[comment|https://issues.apache.org/jira/browse/HIVE-18192?focusedCommentId=16338208&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16338208]

Suppose A is the set of all ValidTxnList across all active readers.  Each 
ValidTxnList has minOpenTxnId.
MIN_HISTORY_LEVEL allows us to determine X = min(minOpenTxnId) across all 
currently active readers

This means that no active transaction in the system sees any txn with txnid < X 
as open.
This means if construct ValidTxnIdList with HWM=X-1 and use that in 
getAcidState(), any files determined by this call as 'obsolete', will be seen 
as obsolete by any existing/future reader, i.e. can be physically deleted.

This is also necessary for multi-statement transactions where relying on the 
state of Lock Manager is not sufficient.  For example

Suppose txn 17 starts at t1 and sees txnid 13 with writeID 13 open.
13 commits (via it's parent txn) at t2 > t1.  (17 is still running).
Compaction runs at t3 >t2 to produce base_14 (or delta_10_14 for example) on 
Table1/Part1 (17 is still running)
Now delta_13 may be cleaned since it can be seen as obsolete and there may be 
no locks on it, i.e. no one is reading it.
Now at t4 > t3 17 may (multi stmt txn) needs to read Table1/Part1. It cannot 
use base_14 is that may have absorbed delete events from delete_delta_14.

Another Use Case
There is delta_1_1 and delta_2_2 on disk both created by committed txns.
T5 starts reading these.  At the same time compactor creates delta_1_2.
Now Cleaner sees delta_1_1 and delta_1_2 as obsolete and may remove them while 
the read is still in progress.  This is because Compactor itself is not running 
in a txn and the files that
it produces are visible immediately.  If it ran in a txn, the new files would 
only be visible once
this txn is visible to others (including the Cleaner).  

Using MIN_HISTORY_LEVEL solves this.

See description of HIVE-18747 for more details on MIN_HISTORY_LEVEL


  was:
Instead of using Lock Manager state as it currently does.
This will eliminate possible race conditions

See this 
[comment|https://issues.apache.org/jira/browse/HIVE-18192?focusedCommentId=16338208&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16338208]

Suppose A is the set of all ValidTxnList across all active readers.  Each 
ValidTxnList has minOpenTxnId.
MIN_HISTORY_LEVEL allows us to determine X = min(minOpenTxnId) across all 
currently active readers

This means that no active transaction in the system sees any txn with txnid < X 
as open.
This means if construct ValidTxnIdList with HWM=X-1 and use that in 
getAcidState(), any files determined by this call as 'obsolete', will be seen 
as obsolete by any existing/future reader, i.e. can be physically deleted.

This is also necessary for multi-statement transactions where relying on the 
state of Lock Manager is not sufficient.  For example

Suppose txn 17 starts at t1 and sees txnid 13 with writeID 13 open.
13 commits (via it's parent txn) at t2 > t1.  (17 is still running).
Compaction runs at t3 >t2 to produce base_14 (or delta_10_14 for example) on 
Table1/Part1 (17 is still running)
Now delta_13 may be cleaned since it can be seen as obsolete and there may be 
no locks on it, i.e. no one is reading it.
Now at t4 > t3 17 may (multi stmt txn) needs to read Table1/Part1. It cannot 
use base_14 is that may have absorbed delete events from delete_delta_14.

Using MIN_HISTORY_LEVEL solves this.

See description of HIVE-18747 for more details on MIN_HISTORY_LEVEL



> Make Acid Cleaner use MIN_HISTORY_LEVEL
> ---
>
> Key: HIVE-18772
> URL: https://issues.apache.org/jira/browse/HIVE-18772
> Project: Hive
>  Issue Type: Improvement
>  Components: Transactions
>Affects Versions: 3.0.0
>Reporter: Eugene Koifman
>Assignee: Eugene Koifman
>Priority: Major
> Attachments: HIVE-18772.01.patch, HIVE-18772.02.patch, 
> HIVE-18772.02.patch, HIVE-18772.03.patch
>
>
> Instead of using Lock Manager state as it currently does.
> This will eliminate possible race conditions
> See this 
> [comment|https://issues.apache.org/jira/browse/HIVE-18192?focusedCommentId=16338208&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16338208]
> Suppose A is the set of all ValidTxnList across all active readers.  Each 
> ValidTxnList has minOpenTxnId.
> MIN_HISTORY_LEVEL allows us to determine X = min(minOpenTxnId) across all 
> currently active readers
> This means that no active transaction in the system sees any txn

[jira] [Commented] (HIVE-17296) Acid tests with multiple splits

2018-10-17 Thread Eugene Koifman (JIRA)



[ 
https://issues.apache.org/jira/browse/HIVE-17296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16654224#comment-16654224
 ] 

Eugene Koifman commented on HIVE-17296:
---

see HIVE-20694 and TestVectorizedOrcAcidRowBatchReader

> Acid tests with multiple splits
> ---
>
> Key: HIVE-17296
> URL: https://issues.apache.org/jira/browse/HIVE-17296
> Project: Hive
>  Issue Type: Test
>  Components: Transactions
>Affects Versions: 3.0.0
>Reporter: Eugene Koifman
>Assignee: Eugene Koifman
>Priority: Major
>
> data files in an Acid table are ORC files which may have multiple stripes
> for such files in base/ or delta/ (and original files with non acid to acid 
> conversion) are split by OrcInputFormat into multiple (stripe sized) chunks.
> There is additional logic in in OrcRawRecordMerger 
> (discoverKeyBounds/discoverOriginalKeyBounds) that is not tested by any E2E 
> tests since none of the have enough data to generate multiple stripes in a 
> single file.
> testRecordReaderOldBaseAndDelta/testRecordReaderNewBaseAndDelta/testOriginalReaderPair
> in TestOrcRawRecordMerger has some logic to test this but it really needs e2e 
> tests.
> With ORC-228 it will be possible to write such tests.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (HIVE-17296) Acid tests with multiple splits

2018-10-17 Thread Eugene Koifman (JIRA)



 [ 
https://issues.apache.org/jira/browse/HIVE-17296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eugene Koifman updated HIVE-17296:
--
Priority: Major  (was: Blocker)

> Acid tests with multiple splits
> ---
>
> Key: HIVE-17296
> URL: https://issues.apache.org/jira/browse/HIVE-17296
> Project: Hive
>  Issue Type: Test
>  Components: Transactions
>Affects Versions: 3.0.0
>Reporter: Eugene Koifman
>Assignee: Eugene Koifman
>Priority: Major
>
> data files in an Acid table are ORC files which may have multiple stripes
> for such files in base/ or delta/ (and original files with non acid to acid 
> conversion) are split by OrcInputFormat into multiple (stripe sized) chunks.
> There is additional logic in in OrcRawRecordMerger 
> (discoverKeyBounds/discoverOriginalKeyBounds) that is not tested by any E2E 
> tests since none of the have enough data to generate multiple stripes in a 
> single file.
> testRecordReaderOldBaseAndDelta/testRecordReaderNewBaseAndDelta/testOriginalReaderPair
> in TestOrcRawRecordMerger has some logic to test this but it really needs e2e 
> tests.
> With ORC-228 it will be possible to write such tests.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Resolved] (HIVE-17231) ColumnizedDeleteEventRegistry.DeleteReaderValue optimization

2018-10-16 Thread Eugene Koifman (JIRA)



 [ 
https://issues.apache.org/jira/browse/HIVE-17231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eugene Koifman resolved HIVE-17231.
---
   Resolution: Fixed
Fix Version/s: 4.0.0
 Release Note: n/a

committed to master
thanks Gopal for the review

> ColumnizedDeleteEventRegistry.DeleteReaderValue optimization
> 
>
> Key: HIVE-17231
> URL: https://issues.apache.org/jira/browse/HIVE-17231
> Project: Hive
>  Issue Type: Sub-task
>  Components: Transactions
>Reporter: Eugene Koifman
>Assignee: Eugene Koifman
>Priority: Major
> Fix For: 4.0.0
>
> Attachments: HIVE-17231.01.patch, HIVE-17231.02.patch
>
>
>  For unbucketed tables DeleteReaderValue will currently return all delete 
> events.  Once we trust that
>  the N in bucketN for "base" spit is reliable, all delete events not 
> matching N can be skipped.
> This is useful to protect against extreme cases where someone runs an 
> update/delete on a partition that matches 10 billion rows thus generates very 
> many delete events.
> Since HIVE-19890 all acid data files must have bucketid/writerid in the file 
> name match bucketid/writerid in ROW__ID in the data.
> {{OrcRawRecrodMerger.getDeltaFiles()}} should only return files representing 
> the right {{bucket}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (HIVE-20753) Derby thread interrupt during ptest

2018-10-16 Thread Eugene Koifman (JIRA)



[ 
https://issues.apache.org/jira/browse/HIVE-20753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16652021#comment-16652021
 ] 

Eugene Koifman commented on HIVE-20753:
---

derby can't handle concurrent threads.   In the past I've mostly seen it get 
wedged.
The hearbeater thread is normally started automatically by the client 
(DbTxnManager) to let the system know the txn is alive.
By default, the txn timeout is 5min and the 1st hearbeat for a given query 
should happen after 2.5 minutes.
So if your query is running longer than that (which is unusual in UTs), it's 
possible that the heartbeater is the 2nd thread...

> Derby thread interrupt during ptest
> ---
>
> Key: HIVE-20753
> URL: https://issues.apache.org/jira/browse/HIVE-20753
> Project: Hive
>  Issue Type: Bug
>Reporter: Zoltan Haindrich
>Priority: Major
> Attachments: derby.log
>
>
> I've had another failed ptest executionit seems like derby have caught an 
> unexpected interrupt call ; which have hanged the execution; after that 
> nothing happend for about half an hour - after which batch timeout have 
> happened...
> {code}
> Caused by: ERROR XSDG9: Derby thread received an interrupt during a disk I/O 
> operation, please check your application for the source of the interrupt.
>   at org.apache.derby.iapi.error.StandardException.newException(Unknown 
> Source)
>   at 
> org.apache.derby.impl.jdbc.SQLExceptionFactory.wrapArgsForTransportAcrossDRDA(Unknown
>  Source)
>   ... 42 more
> {code}
> full stacktrack:
> {code}
> 2018-10-16T06:47:29,355 ERROR [Heartbeater-3] lockmgr.DbTxnManager: Failed 
> trying to heartbeat queryId=null, currentUser: hiveptest (auth:SIMPLE): null
> java.lang.reflect.UndeclaredThrowableException: null
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1700)
>  ~[hadoop-common-3.1.0.jar:?]
>   at 
> org.apache.hadoop.hive.ql.lockmgr.DbTxnManager$Heartbeater.run(DbTxnManager.java:955)
>  [hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
>   at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) 
> [?:1.8.0_102]
>   at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) 
> [?:1.8.0_102]
>   at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
>  [?:1.8.0_102]
>   at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
>  [?:1.8.0_102]
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>  [?:1.8.0_102]
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>  [?:1.8.0_102]
>   at java.lang.Thread.run(Thread.java:745) [?:1.8.0_102]
> Caused by: org.apache.hadoop.hive.ql.lockmgr.LockException: Error 
> communicating with the metastore(txnid:15,lockid:0 queryId=null txnid:0)
>   at 
> org.apache.hadoop.hive.ql.lockmgr.DbTxnManager.heartbeat(DbTxnManager.java:590)
>  ~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
>   at 
> org.apache.hadoop.hive.ql.lockmgr.DbTxnManager$Heartbeater.lambda$run$0(DbTxnManager.java:956)
>  ~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
>   at java.security.AccessController.doPrivileged(Native Method) 
> ~[?:1.8.0_102]
>   at javax.security.auth.Subject.doAs(Subject.java:422) ~[?:1.8.0_102]
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1682)
>  ~[hadoop-common-3.1.0.jar:?]
>   ... 8 more
> Caused by: org.apache.hadoop.hive.metastore.api.MetaException: Unable to 
> select from transaction database java.sql.SQLException: Derby thread received 
> an interrupt during a disk I/O operation, please check your application for 
> the source of the interrupt.
>   at 
> org.apache.derby.impl.jdbc.SQLExceptionFactory.getSQLException(Unknown Source)
>   at 
> org.apache.derby.impl.jdbc.SQLExceptionFactory.getSQLException(Unknown Source)
>   at org.apache.derby.impl.jdbc.Util.seeNextException(Unknown Source)
>   at 
> org.apache.derby.impl.jdbc.TransactionResourceImpl.wrapInSQLException(Unknown 
> Source)
>   at 
> org.apache.derby.impl.jdbc.TransactionResourceImpl.handleException(Unknown 
> Source)
>   at org.apache.derby.impl.jdbc.EmbedConnection.handleException(Unknown 
> Source)
>   at org.apache.derby.impl.jdbc.ConnectionChild.handleException(Unknown 
> Source)
>   at org.apache.derby.impl.jdbc.EmbedStatement.execute(Unknown Source)
>   at org.apache.derby.impl.jdbc.EmbedStatement.executeLargeUpdate(Unknown 
> Source)
>   at org.apache.derby.impl.jdbc.EmbedStatement.executeUpdate(Unknown 
> Source)
>   at 
> com.zaxxer.hikari.pool.ProxyStatement.executeUpdate(ProxyStatement.java:11

[jira] [Commented] (HIVE-17231) ColumnizedDeleteEventRegistry.DeleteReaderValue optimization

2018-10-16 Thread Eugene Koifman (JIRA)



[ 
https://issues.apache.org/jira/browse/HIVE-17231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16651986#comment-16651986
 ] 

Eugene Koifman commented on HIVE-17231:
---

[~gopalv], ping

> ColumnizedDeleteEventRegistry.DeleteReaderValue optimization
> 
>
> Key: HIVE-17231
> URL: https://issues.apache.org/jira/browse/HIVE-17231
> Project: Hive
>  Issue Type: Sub-task
>  Components: Transactions
>Reporter: Eugene Koifman
>Assignee: Eugene Koifman
>Priority: Major
> Attachments: HIVE-17231.01.patch, HIVE-17231.02.patch
>
>
>  For unbucketed tables DeleteReaderValue will currently return all delete 
> events.  Once we trust that
>  the N in bucketN for "base" spit is reliable, all delete events not 
> matching N can be skipped.
> This is useful to protect against extreme cases where someone runs an 
> update/delete on a partition that matches 10 billion rows thus generates very 
> many delete events.
> Since HIVE-19890 all acid data files must have bucketid/writerid in the file 
> name match bucketid/writerid in ROW__ID in the data.
> {{OrcRawRecrodMerger.getDeltaFiles()}} should only return files representing 
> the right {{bucket}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (HIVE-20538) Allow to store a key value together with a transaction.

2018-10-15 Thread Eugene Koifman (JIRA)



 [ 
https://issues.apache.org/jira/browse/HIVE-20538?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eugene Koifman updated HIVE-20538:
--
   Resolution: Fixed
Fix Version/s: 4.0.0
   Status: Resolved  (was: Patch Available)

committed to master
thanks Jaume for the contribution

> Allow to store a key value together with a transaction.
> ---
>
> Key: HIVE-20538
> URL: https://issues.apache.org/jira/browse/HIVE-20538
> Project: Hive
>  Issue Type: New Feature
>  Components: Standalone Metastore, Transactions
>Reporter: Jaume M
>Assignee: Jaume M
>Priority: Major
> Fix For: 4.0.0
>
> Attachments: HIVE-20538.1.patch, HIVE-20538.1.patch, 
> HIVE-20538.2.patch, HIVE-20538.3.patch, HIVE-20538.4.patch, 
> HIVE-20538.5.patch, HIVE-20538.6.patch, HIVE-20538.7.patch, HIVE-20538.8.patch
>
>
> This can be useful for example to know if a transaction has already happened.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (HIVE-20538) Allow to store a key value together with a transaction.

2018-10-15 Thread Eugene Koifman (JIRA)



[ 
https://issues.apache.org/jira/browse/HIVE-20538?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16650537#comment-16650537
 ] 

Eugene Koifman commented on HIVE-20538:
---

+1

> Allow to store a key value together with a transaction.
> ---
>
> Key: HIVE-20538
> URL: https://issues.apache.org/jira/browse/HIVE-20538
> Project: Hive
>  Issue Type: New Feature
>  Components: Standalone Metastore, Transactions
>Reporter: Jaume M
>Assignee: Jaume M
>Priority: Major
> Attachments: HIVE-20538.1.patch, HIVE-20538.1.patch, 
> HIVE-20538.2.patch, HIVE-20538.3.patch, HIVE-20538.4.patch, 
> HIVE-20538.5.patch, HIVE-20538.6.patch, HIVE-20538.7.patch, HIVE-20538.8.patch
>
>
> This can be useful for example to know if a transaction has already happened.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (HIVE-20723) Allow per table specification of compaction yarn queue

2018-10-13 Thread Eugene Koifman (JIRA)



[ 
https://issues.apache.org/jira/browse/HIVE-20723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16649159#comment-16649159
 ] 

Eugene Koifman commented on HIVE-20723:
---

committed to master
thanks Saurabh for the contribution

> Allow per table specification of compaction yarn queue
> --
>
> Key: HIVE-20723
> URL: https://issues.apache.org/jira/browse/HIVE-20723
> Project: Hive
>  Issue Type: New Feature
>  Components: Transactions
>Affects Versions: 2.0.0
>Reporter: Eugene Koifman
>Assignee: Saurabh Seth
>Priority: Major
> Fix For: 4.0.0
>
> Attachments: HIVE-20723.patch
>
>
> Currently compactions of full CRUD transactional tables are Map-Reduce jobs 
> submitted to a yarn queue defined by hive.compactor.job.queue property.
> If would be useful to be able to override this on per table basis by putting 
> it into table properties so that compactions for different tables can use 
> different queues.
>  
> There is already ability to override other compaction related configs via 
> table props, though this will need additional handling to set the queue name 
> {{CompactorMr.createBaseJobConf}}
> [https://cwiki.apache.org/confluence/display/Hive/Hive+Transactions#HiveTransactions-TableProperties]
>  
> See {{CopactorMR.COMPACTOR_PREFIX}} and 
> {{Initiator.COMPACTORTHRESHOLD_PREFIX}}
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (HIVE-20723) Allow per table specification of compaction yarn queue

2018-10-13 Thread Eugene Koifman (JIRA)



 [ 
https://issues.apache.org/jira/browse/HIVE-20723?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eugene Koifman updated HIVE-20723:
--
   Resolution: Fixed
Fix Version/s: 4.0.0
   Status: Resolved  (was: Patch Available)

> Allow per table specification of compaction yarn queue
> --
>
> Key: HIVE-20723
> URL: https://issues.apache.org/jira/browse/HIVE-20723
> Project: Hive
>  Issue Type: New Feature
>  Components: Transactions
>Affects Versions: 2.0.0
>Reporter: Eugene Koifman
>Assignee: Saurabh Seth
>Priority: Major
> Fix For: 4.0.0
>
> Attachments: HIVE-20723.patch
>
>
> Currently compactions of full CRUD transactional tables are Map-Reduce jobs 
> submitted to a yarn queue defined by hive.compactor.job.queue property.
> If would be useful to be able to override this on per table basis by putting 
> it into table properties so that compactions for different tables can use 
> different queues.
>  
> There is already ability to override other compaction related configs via 
> table props, though this will need additional handling to set the queue name 
> {{CompactorMr.createBaseJobConf}}
> [https://cwiki.apache.org/confluence/display/Hive/Hive+Transactions#HiveTransactions-TableProperties]
>  
> See {{CopactorMR.COMPACTOR_PREFIX}} and 
> {{Initiator.COMPACTORTHRESHOLD_PREFIX}}
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (HIVE-20719) SELECT statement fails after UPDATE with hive.optimize.sort.dynamic.partition optimization and vectorization on

2018-10-13 Thread Eugene Koifman (JIRA)



 [ 
https://issues.apache.org/jira/browse/HIVE-20719?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eugene Koifman updated HIVE-20719:
--
   Resolution: Fixed
Fix Version/s: 4.0.0
   Status: Resolved  (was: Patch Available)

> SELECT statement fails after UPDATE with hive.optimize.sort.dynamic.partition 
> optimization and vectorization on
> ---
>
> Key: HIVE-20719
> URL: https://issues.apache.org/jira/browse/HIVE-20719
> Project: Hive
>  Issue Type: Bug
>  Components: Transactions
>Affects Versions: 4.0.0
>Reporter: Vineet Garg
>Assignee: Eugene Koifman
>Priority: Major
>  Labels: Branch3Candidate
> Fix For: 4.0.0
>
> Attachments: HIVE-20719.01.patch
>
>
> *Reproducer*
> {code:sql}
>  set hive.optimize.sort.dynamic.partition=true ;
> create table acid_uap(a int, b varchar(128)) partitioned by (ds string) 
> clustered by (a) into 2 buckets stored as orc TBLPROPERTIES 
> ('transactional'='true');
> insert into table acid_uap partition (ds='tomorrow') values (1, 'bah'),(2, 
> 'yah') ;
> insert into table acid_uap partition (ds='today') values (1, 'bah'),(2, 
> 'yah') ;
> select a,b,ds from acid_uap order by a,b;
> update acid_uap set b = 'fred';
> select a,b,ds from acid_uap order by a,b;
> {code}
> *Error*
> {code:java}
> Status: Failed
> Vertex failed, vertexName=Map 1, vertexId=vertex_1539123809352_0001_5_00, 
> diagnostics=[Task failed, taskId=task_1539123809352_0001_5_00_00, 
> diagnostics=[TaskAttempt 0 failed, info=[Error: Error while running task ( 
> failure ) : 
> attempt_1539123809352_0001_5_00_00_0:java.lang.RuntimeException: 
> java.lang.RuntimeException: java.io.IOException: java.io.IOException: 
> Corrupted records with different bucket ids from the containing bucket file 
> found! Expected bucket id 0, however found 
> DeleteRecordKey(2,536936448(1.1.0),0).  (OrcSplit 
> [file:/Users/vgarg/hive_temp/vgarg/hive/warehouse/dp_sort.db/acid_uap/ds=today/delta_002_002_/bucket_0,
>  start=3, length=361, isOriginal=false, fileLength=798, hasFooter=false, 
> hasBase=true, 
> deltas=2],file:/Users/vgarg/hive_temp/vgarg/hive/warehouse/dp_sort.db/acid_uap/ds=today/delete_delta_003_003_/bucket_0)
>   at 
> org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:296)
>   at 
> org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:250)
>   at 
> org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:374)
>   at 
> org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:73)
>   at 
> org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:61)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:422)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1962)
>   at 
> org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:61)
>   at 
> org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:37)
>   at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)
>   at 
> com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:108)
>   at 
> com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:41)
>   at 
> com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:77)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.RuntimeException: java.io.IOException: 
> java.io.IOException: Corrupted records with different bucket ids from the 
> containing bucket file found! Expected bucket id 0, however found 
> DeleteRecordKey(2,536936448(1.1.0),0).  (OrcSplit 
> [file:/Users/vgarg/hive_temp/vgarg/hive/warehouse/dp_sort.db/acid_uap/ds=today/delta_002_002_/bucket_0,
>  start=3, length=361, isOriginal=false, fileLength=798, hasFooter=false, 
> hasBase=true, 
> deltas=2],file:/Users/vgarg/hive_temp/vgarg/hive/warehouse/dp_sort.db/acid_uap/ds=today/delete_delta_003_003_/bucket_0)
>   at 
> org.apache.hadoop.mapred.split.TezGroupedSplitsInputFormat$TezGroupedSplitsRecordReader.initNextRecordReader(TezGroupedSplitsInputFormat.java:206)
>   at 
> org.apache.hadoop.mapred.split.TezGroupedSplitsInputFormat$TezGroupedSplitsRecord

[jira] [Updated] (HIVE-17231) ColumnizedDeleteEventRegistry.DeleteReaderValue optimization

2018-10-12 Thread Eugene Koifman (JIRA)



 [ 
https://issues.apache.org/jira/browse/HIVE-17231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eugene Koifman updated HIVE-17231:
--
Status: Open  (was: Patch Available)

cancel patch since it already ran on patch 2

> ColumnizedDeleteEventRegistry.DeleteReaderValue optimization
> 
>
> Key: HIVE-17231
> URL: https://issues.apache.org/jira/browse/HIVE-17231
> Project: Hive
>  Issue Type: Sub-task
>  Components: Transactions
>Reporter: Eugene Koifman
>Assignee: Eugene Koifman
>Priority: Major
> Attachments: HIVE-17231.01.patch, HIVE-17231.02.patch
>
>
>  For unbucketed tables DeleteReaderValue will currently return all delete 
> events.  Once we trust that
>  the N in bucketN for "base" spit is reliable, all delete events not 
> matching N can be skipped.
> This is useful to protect against extreme cases where someone runs an 
> update/delete on a partition that matches 10 billion rows thus generates very 
> many delete events.
> Since HIVE-19890 all acid data files must have bucketid/writerid in the file 
> name match bucketid/writerid in ROW__ID in the data.
> {{OrcRawRecrodMerger.getDeltaFiles()}} should only return files representing 
> the right {{bucket}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (HIVE-17231) ColumnizedDeleteEventRegistry.DeleteReaderValue optimization

2018-10-12 Thread Eugene Koifman (JIRA)



[ 
https://issues.apache.org/jira/browse/HIVE-17231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16648583#comment-16648583
 ] 

Eugene Koifman commented on HIVE-17231:
---

[~gopalv] could you review please?

> ColumnizedDeleteEventRegistry.DeleteReaderValue optimization
> 
>
> Key: HIVE-17231
> URL: https://issues.apache.org/jira/browse/HIVE-17231
> Project: Hive
>  Issue Type: Sub-task
>  Components: Transactions
>Reporter: Eugene Koifman
>Assignee: Eugene Koifman
>Priority: Major
> Attachments: HIVE-17231.01.patch, HIVE-17231.02.patch
>
>
>  For unbucketed tables DeleteReaderValue will currently return all delete 
> events.  Once we trust that
>  the N in bucketN for "base" spit is reliable, all delete events not 
> matching N can be skipped.
> This is useful to protect against extreme cases where someone runs an 
> update/delete on a partition that matches 10 billion rows thus generates very 
> many delete events.
> Since HIVE-19890 all acid data files must have bucketid/writerid in the file 
> name match bucketid/writerid in ROW__ID in the data.
> {{OrcRawRecrodMerger.getDeltaFiles()}} should only return files representing 
> the right {{bucket}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (HIVE-20699) Query based compactor for full CRUD Acid tables

2018-10-12 Thread Eugene Koifman (JIRA)



 [ 
https://issues.apache.org/jira/browse/HIVE-20699?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eugene Koifman updated HIVE-20699:
--
Description: 
Currently the Acid compactor is implemented as generated MR job 
({{CompactorMR.java}}).

It could also be expressed as a Hive query that reads from a given partition 
and writes data back to the same partition.  This will merge the deltas and 
'apply' the delete events.  The simplest would be to just use Insert Overwrite 
but that will change all ROW__IDs which we don't want.

Need to implement this in a way that preserves ROW__IDs and creates a new 
{{base_x}} directory to handle Major compaction.

Minor compaction will be investigated separately.



  was:
Currently the Acid compactor is implemented as generated MR job 
({{CompactorMR}}.

It could also be expressed as a Hive query that reads from a given partition 
and writes data back to the same partition.  This will merge the deltas and 
'apply' the delete events.  The simplest would be to just use Insert Overwrite 
but that will change all ROW__IDs which we don't want.

Need to implement this in a way that preserves ROW__IDs and creates a new 
{{base_x}} directory to handle Major compaction.

Minor compaction will be investigated separately.




> Query based compactor for full CRUD Acid tables
> ---
>
> Key: HIVE-20699
> URL: https://issues.apache.org/jira/browse/HIVE-20699
> Project: Hive
>  Issue Type: New Feature
>  Components: Transactions
>Affects Versions: 3.1.0
>Reporter: Eugene Koifman
>Assignee: Eugene Koifman
>Priority: Major
>
> Currently the Acid compactor is implemented as generated MR job 
> ({{CompactorMR.java}}).
> It could also be expressed as a Hive query that reads from a given partition 
> and writes data back to the same partition.  This will merge the deltas and 
> 'apply' the delete events.  The simplest would be to just use Insert 
> Overwrite but that will change all ROW__IDs which we don't want.
> Need to implement this in a way that preserves ROW__IDs and creates a new 
> {{base_x}} directory to handle Major compaction.
> Minor compaction will be investigated separately.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (HIVE-20699) Query based compactor for full CRUD Acid tables

2018-10-12 Thread Eugene Koifman (JIRA)



 [ 
https://issues.apache.org/jira/browse/HIVE-20699?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eugene Koifman updated HIVE-20699:
--
Description: 
Currently the Acid compactor is implemented as generated MR job 
({{CompactorMR}}.

It could also be expressed as a Hive query that reads from a given partition 
and writes data back to the same partition.  This will merge the deltas and 
'apply' the delete events.  The simplest would be to just use Insert Overwrite 
but that will change all ROW__IDs which we don't want.

Need to implement this in a way that preserves ROW__IDs and creates a new 
{{base_x}} directory to handle Major compaction.

Minor compaction will be investigated separately.



> Query based compactor for full CRUD Acid tables
> ---
>
> Key: HIVE-20699
> URL: https://issues.apache.org/jira/browse/HIVE-20699
> Project: Hive
>  Issue Type: New Feature
>  Components: Transactions
>Affects Versions: 3.1.0
>Reporter: Eugene Koifman
>Assignee: Eugene Koifman
>Priority: Major
>
> Currently the Acid compactor is implemented as generated MR job 
> ({{CompactorMR}}.
> It could also be expressed as a Hive query that reads from a given partition 
> and writes data back to the same partition.  This will merge the deltas and 
> 'apply' the delete events.  The simplest would be to just use Insert 
> Overwrite but that will change all ROW__IDs which we don't want.
> Need to implement this in a way that preserves ROW__IDs and creates a new 
> {{base_x}} directory to handle Major compaction.
> Minor compaction will be investigated separately.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Resolved] (HIVE-20579) VectorizedOrcAcidRowBatchReader.checkBucketId() should run for unbucketed tables

2018-10-12 Thread Eugene Koifman (JIRA)



 [ 
https://issues.apache.org/jira/browse/HIVE-20579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eugene Koifman resolved HIVE-20579.
---
  Resolution: Implemented
Assignee: Eugene Koifman
Release Note: n/a

fixed as part of HIVE-17231

> VectorizedOrcAcidRowBatchReader.checkBucketId() should run for unbucketed 
> tables
> 
>
> Key: HIVE-20579
> URL: https://issues.apache.org/jira/browse/HIVE-20579
> Project: Hive
>  Issue Type: Sub-task
>  Components: Transactions
>Affects Versions: 3.1.0
>Reporter: Eugene Koifman
>Assignee: Eugene Koifman
>Priority: Major
>
> VectorizedOrcAcidRowBatchReader.checkBucketId() currently bails for 
> unbucketed tables
> since HIVE-19890 all BucketCodec.decodeWriterId(ROW__ID.bucketid) should 
> match the writer ID in the file name (e.g. bucket_1)
> so it should still perform the check



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (HIVE-20694) Additional unit tests for VectorizedOrcAcidRowBatchReader min max key evaluation

2018-10-12 Thread Eugene Koifman (JIRA)



 [ 
https://issues.apache.org/jira/browse/HIVE-20694?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eugene Koifman updated HIVE-20694:
--
Issue Type: Sub-task  (was: Test)
Parent: HIVE-20738

> Additional unit tests for VectorizedOrcAcidRowBatchReader min max key 
> evaluation
> 
>
> Key: HIVE-20694
> URL: https://issues.apache.org/jira/browse/HIVE-20694
> Project: Hive
>  Issue Type: Sub-task
>  Components: Transactions
>Reporter: Saurabh Seth
>Assignee: Saurabh Seth
>Priority: Minor
> Fix For: 4.0.0
>
> Attachments: HIVE-20694.patch
>
>
> Follow up to HIVE-20664 and HIVE-20635.
> Additional unit tests for {{VectorizedOrcAcidRowBatchReader.findMinMaxKeys}} 
> and {{VectorizedOrcAcidRowBatchReader.findOriginalMinMaxKeys}} related to 
> split and stripe boundaries - particularly the case when a split is 
> completely within an ORC stripe.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (HIVE-20664) Potential ArrayIndexOutOfBoundsException in VectorizedOrcAcidRowBatchReader.findMinMaxKeys

2018-10-12 Thread Eugene Koifman (JIRA)



 [ 
https://issues.apache.org/jira/browse/HIVE-20664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eugene Koifman updated HIVE-20664:
--
Issue Type: Sub-task  (was: Bug)
Parent: HIVE-20738

> Potential ArrayIndexOutOfBoundsException in 
> VectorizedOrcAcidRowBatchReader.findMinMaxKeys
> --
>
> Key: HIVE-20664
> URL: https://issues.apache.org/jira/browse/HIVE-20664
> Project: Hive
>  Issue Type: Sub-task
>  Components: Transactions
>Reporter: Saurabh Seth
>Assignee: Saurabh Seth
>Priority: Minor
> Fix For: 4.0.0
>
> Attachments: HIVE-20664.2.patch, HIVE-20664.patch
>
>
> [~ekoifman], could you please confirm if my understanding is correct and if 
> so, review the fix?
> In the method {{VectorizedOrcAcidRowBatchReader.findMinMaxKeys}}, the code 
> snippet that identifies the first and last stripe indices in the current 
> split could result in an ArrayIndexOutOfBoundsException if a complete split 
> is within the same stripe:
> {noformat}
>     for(int i = 0; i < stripes.size(); i++) {
>   StripeInformation stripe = stripes.get(i);
>   long stripeEnd = stripe.getOffset() + stripe.getLength();
>   if(firstStripeIndex == -1 && stripe.getOffset() >= splitStart) {
> firstStripeIndex = i;
>   }
>   if(lastStripeIndex == -1 && splitEnd <= stripeEnd &&
>   stripes.get(firstStripeIndex).getOffset() <= stripe.getOffset() ) {
> //the last condition is for when both splitStart and splitEnd are in
> // the same stripe
> lastStripeIndex = i;
>   }
> }
> {noformat}
> Consider the example where there are 2 stripes - 0-500 and 500-1000 and 
> splitStart is 600 and splitEnd is 800.
> In the first iteration of the loop, stripe.getOffset() is 0 and stripeEnd is 
> 500. In this iteration, neither of the if statement conditions will be met 
> and firstSripeIndex as well as lastStripeIndex remain -1.
> In the second iteration of the loop stripe.getOffset() is 500, stripeEnd is 
> 1000, The first if statement condition will not be met in this case because 
> stripe's offset (500) is not greater than or equal to the splitStart (600). 
> However, in the second if statement, splitEnd (800) is <= stripeEnd(1000) and 
> it will try to compute the last condition 
> {{stripes.get(firstStripeIndex).getOffset() <= stripe.getOffset()}}. This 
> will throw an ArrayIndexOutOfBoundsException because firstStripeIndex is 
> still -1.
> I'm not sure if this scenario is possible at all, hence logging this as a low 
> priority issue. Perhaps block based split generation using BISplitStrategy 
> could trigger this?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (HIVE-20635) VectorizedOrcAcidRowBatchReader doesn't filter delete events for original files

2018-10-12 Thread Eugene Koifman (JIRA)



 [ 
https://issues.apache.org/jira/browse/HIVE-20635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eugene Koifman updated HIVE-20635:
--
Issue Type: Sub-task  (was: Improvement)
Parent: HIVE-20738

> VectorizedOrcAcidRowBatchReader doesn't filter delete events for original 
> files
> ---
>
> Key: HIVE-20635
> URL: https://issues.apache.org/jira/browse/HIVE-20635
> Project: Hive
>  Issue Type: Sub-task
>  Components: Transactions
>Affects Versions: 3.0.0
>Reporter: Eugene Koifman
>Assignee: Saurabh Seth
>Priority: Major
> Fix For: 4.0.0
>
> Attachments: HIVE-20635.2.patch, HIVE-20635.3.patch, HIVE-20635.patch
>
>
> this is a followup to HIVE-16812 which adds support for delete event 
> filtering for splits from native acid files
> need to add the same for {{OrcSplit.isOriginal()}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (HIVE-16812) VectorizedOrcAcidRowBatchReader doesn't filter delete events

2018-10-12 Thread Eugene Koifman (JIRA)



 [ 
https://issues.apache.org/jira/browse/HIVE-16812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eugene Koifman updated HIVE-16812:
--
Issue Type: Sub-task  (was: Improvement)
Parent: HIVE-20738

> VectorizedOrcAcidRowBatchReader doesn't filter delete events
> 
>
> Key: HIVE-16812
> URL: https://issues.apache.org/jira/browse/HIVE-16812
> Project: Hive
>  Issue Type: Sub-task
>  Components: Transactions
>Affects Versions: 2.3.0
>Reporter: Eugene Koifman
>Assignee: Eugene Koifman
>Priority: Critical
> Fix For: 4.0.0
>
> Attachments: HIVE-16812.02.patch, HIVE-16812.04.patch, 
> HIVE-16812.05.patch, HIVE-16812.06.patch, HIVE-16812.07.patch
>
>
> the c'tor of VectorizedOrcAcidRowBatchReader has
> {noformat}
> // Clone readerOptions for deleteEvents.
> Reader.Options deleteEventReaderOptions = readerOptions.clone();
> // Set the range on the deleteEventReaderOptions to 0 to INTEGER_MAX 
> because
> // we always want to read all the delete delta files.
> deleteEventReaderOptions.range(0, Long.MAX_VALUE);
> {noformat}
> This is suboptimal since base and deltas are sorted by ROW__ID.  So for each 
> split if base we can find min/max ROW_ID and only load events from delta that 
> are in [min,max] range.  This will reduce the number of delete events we load 
> in memory (to no more than there in the split).
> When we support sorting on PK, the same should apply but we'd need to make 
> sure to store PKs in ORC index
> See {{OrcRawRecordMerger.discoverKeyBounds()}}
> {{hive.acid.key.index}} in Orc footer has an index of ROW__IDs so we should 
> know min/max easily for any file written by {{OrcRecordUpdater}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (HIVE-20579) VectorizedOrcAcidRowBatchReader.checkBucketId() should run for unbucketed tables

2018-10-12 Thread Eugene Koifman (JIRA)



 [ 
https://issues.apache.org/jira/browse/HIVE-20579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eugene Koifman updated HIVE-20579:
--
Issue Type: Sub-task  (was: Improvement)
Parent: HIVE-20738

> VectorizedOrcAcidRowBatchReader.checkBucketId() should run for unbucketed 
> tables
> 
>
> Key: HIVE-20579
> URL: https://issues.apache.org/jira/browse/HIVE-20579
> Project: Hive
>  Issue Type: Sub-task
>  Components: Transactions
>Affects Versions: 3.1.0
>Reporter: Eugene Koifman
>Priority: Major
>
> VectorizedOrcAcidRowBatchReader.checkBucketId() currently bails for 
> unbucketed tables
> since HIVE-19890 all BucketCodec.decodeWriterId(ROW__ID.bucketid) should 
> match the writer ID in the file name (e.g. bucket_1)
> so it should still perform the check



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (HIVE-17231) ColumnizedDeleteEventRegistry.DeleteReaderValue optimization

2018-10-12 Thread Eugene Koifman (JIRA)



 [ 
https://issues.apache.org/jira/browse/HIVE-17231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eugene Koifman updated HIVE-17231:
--
Parent Issue: HIVE-20738  (was: HIVE-17204)

> ColumnizedDeleteEventRegistry.DeleteReaderValue optimization
> 
>
> Key: HIVE-17231
> URL: https://issues.apache.org/jira/browse/HIVE-17231
> Project: Hive
>  Issue Type: Sub-task
>  Components: Transactions
>Reporter: Eugene Koifman
>Assignee: Eugene Koifman
>Priority: Major
> Attachments: HIVE-17231.01.patch, HIVE-17231.02.patch
>
>
>  For unbucketed tables DeleteReaderValue will currently return all delete 
> events.  Once we trust that
>  the N in bucketN for "base" spit is reliable, all delete events not 
> matching N can be skipped.
> This is useful to protect against extreme cases where someone runs an 
> update/delete on a partition that matches 10 billion rows thus generates very 
> many delete events.
> Since HIVE-19890 all acid data files must have bucketid/writerid in the file 
> name match bucketid/writerid in ROW__ID in the data.
> {{OrcRawRecrodMerger.getDeltaFiles()}} should only return files representing 
> the right {{bucket}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Assigned] (HIVE-20738) Enable Delete Event filtering in VectorizedOrcAcidRowBatchReader

2018-10-12 Thread Eugene Koifman (JIRA)



 [ 
https://issues.apache.org/jira/browse/HIVE-20738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eugene Koifman reassigned HIVE-20738:
-


> Enable Delete Event filtering in VectorizedOrcAcidRowBatchReader
> 
>
> Key: HIVE-20738
> URL: https://issues.apache.org/jira/browse/HIVE-20738
> Project: Hive
>  Issue Type: Improvement
>  Components: Transactions
>Affects Versions: 3.0.0
>Reporter: Eugene Koifman
>Assignee: Eugene Koifman
>Priority: Major
>
> Currently DeleteEventRegistry loads all delete events which can take time and 
> use a lot of memory.  Should minimize the number of deletes loaded based on 
> the insert events included in the Split.
> This is an umbrella jira for several tasks that make up the work.  See 
> individual tasks for details.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (HIVE-17231) ColumnizedDeleteEventRegistry.DeleteReaderValue optimization

2018-10-12 Thread Eugene Koifman (JIRA)



 [ 
https://issues.apache.org/jira/browse/HIVE-17231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eugene Koifman updated HIVE-17231:
--
Attachment: HIVE-17231.02.patch

> ColumnizedDeleteEventRegistry.DeleteReaderValue optimization
> 
>
> Key: HIVE-17231
> URL: https://issues.apache.org/jira/browse/HIVE-17231
> Project: Hive
>  Issue Type: Sub-task
>  Components: Transactions
>Reporter: Eugene Koifman
>Assignee: Eugene Koifman
>Priority: Major
> Attachments: HIVE-17231.01.patch, HIVE-17231.02.patch
>
>
>  For unbucketed tables DeleteReaderValue will currently return all delete 
> events.  Once we trust that
>  the N in bucketN for "base" spit is reliable, all delete events not 
> matching N can be skipped.
> This is useful to protect against extreme cases where someone runs an 
> update/delete on a partition that matches 10 billion rows thus generates very 
> many delete events.
> Since HIVE-19890 all acid data files must have bucketid/writerid in the file 
> name match bucketid/writerid in ROW__ID in the data.
> {{OrcRawRecrodMerger.getDeltaFiles()}} should only return files representing 
> the right {{bucket}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (HIVE-20719) SELECT statement fails after UPDATE with hive.optimize.sort.dynamic.partition optimization and vectorization on

2018-10-11 Thread Eugene Koifman (JIRA)



 [ 
https://issues.apache.org/jira/browse/HIVE-20719?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eugene Koifman updated HIVE-20719:
--
Attachment: HIVE-20719.01.patch

> SELECT statement fails after UPDATE with hive.optimize.sort.dynamic.partition 
> optimization and vectorization on
> ---
>
> Key: HIVE-20719
> URL: https://issues.apache.org/jira/browse/HIVE-20719
> Project: Hive
>  Issue Type: Bug
>  Components: Transactions, Vectorization
>Affects Versions: 4.0.0
>Reporter: Vineet Garg
>Assignee: Eugene Koifman
>Priority: Major
> Attachments: HIVE-20719.01.patch
>
>
> *Reproducer*
> {code:sql}
>  set hive.optimize.sort.dynamic.partition=true ;
> create table acid_uap(a int, b varchar(128)) partitioned by (ds string) 
> clustered by (a) into 2 buckets stored as orc TBLPROPERTIES 
> ('transactional'='true');
> insert into table acid_uap partition (ds='tomorrow') values (1, 'bah'),(2, 
> 'yah') ;
> insert into table acid_uap partition (ds='today') values (1, 'bah'),(2, 
> 'yah') ;
> select a,b,ds from acid_uap order by a,b;
> update acid_uap set b = 'fred';
> select a,b,ds from acid_uap order by a,b;
> {code}
> *Error*
> {code:java}
> Status: Failed
> Vertex failed, vertexName=Map 1, vertexId=vertex_1539123809352_0001_5_00, 
> diagnostics=[Task failed, taskId=task_1539123809352_0001_5_00_00, 
> diagnostics=[TaskAttempt 0 failed, info=[Error: Error while running task ( 
> failure ) : 
> attempt_1539123809352_0001_5_00_00_0:java.lang.RuntimeException: 
> java.lang.RuntimeException: java.io.IOException: java.io.IOException: 
> Corrupted records with different bucket ids from the containing bucket file 
> found! Expected bucket id 0, however found 
> DeleteRecordKey(2,536936448(1.1.0),0).  (OrcSplit 
> [file:/Users/vgarg/hive_temp/vgarg/hive/warehouse/dp_sort.db/acid_uap/ds=today/delta_002_002_/bucket_0,
>  start=3, length=361, isOriginal=false, fileLength=798, hasFooter=false, 
> hasBase=true, 
> deltas=2],file:/Users/vgarg/hive_temp/vgarg/hive/warehouse/dp_sort.db/acid_uap/ds=today/delete_delta_003_003_/bucket_0)
>   at 
> org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:296)
>   at 
> org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:250)
>   at 
> org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:374)
>   at 
> org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:73)
>   at 
> org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:61)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:422)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1962)
>   at 
> org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:61)
>   at 
> org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:37)
>   at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)
>   at 
> com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:108)
>   at 
> com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:41)
>   at 
> com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:77)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.RuntimeException: java.io.IOException: 
> java.io.IOException: Corrupted records with different bucket ids from the 
> containing bucket file found! Expected bucket id 0, however found 
> DeleteRecordKey(2,536936448(1.1.0),0).  (OrcSplit 
> [file:/Users/vgarg/hive_temp/vgarg/hive/warehouse/dp_sort.db/acid_uap/ds=today/delta_002_002_/bucket_0,
>  start=3, length=361, isOriginal=false, fileLength=798, hasFooter=false, 
> hasBase=true, 
> deltas=2],file:/Users/vgarg/hive_temp/vgarg/hive/warehouse/dp_sort.db/acid_uap/ds=today/delete_delta_003_003_/bucket_0)
>   at 
> org.apache.hadoop.mapred.split.TezGroupedSplitsInputFormat$TezGroupedSplitsRecordReader.initNextRecordReader(TezGroupedSplitsInputFormat.java:206)
>   at 
> org.apache.hadoop.mapred.split.TezGroupedSplitsInputFormat$TezGroupedSplitsRecordReader.next(TezGroupedSplitsInputFormat.java:152)
>   at 
> org.apache.tez.mapreduce.lib.MRReaderMapred.next(MRReaderM

[jira] [Updated] (HIVE-20719) SELECT statement fails after UPDATE with hive.optimize.sort.dynamic.partition optimization and vectorization on

2018-10-11 Thread Eugene Koifman (JIRA)



 [ 
https://issues.apache.org/jira/browse/HIVE-20719?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eugene Koifman updated HIVE-20719:
--
Status: Patch Available  (was: Open)

> SELECT statement fails after UPDATE with hive.optimize.sort.dynamic.partition 
> optimization and vectorization on
> ---
>
> Key: HIVE-20719
> URL: https://issues.apache.org/jira/browse/HIVE-20719
> Project: Hive
>  Issue Type: Bug
>  Components: Transactions, Vectorization
>Affects Versions: 4.0.0
>Reporter: Vineet Garg
>Assignee: Eugene Koifman
>Priority: Major
> Attachments: HIVE-20719.01.patch
>
>
> *Reproducer*
> {code:sql}
>  set hive.optimize.sort.dynamic.partition=true ;
> create table acid_uap(a int, b varchar(128)) partitioned by (ds string) 
> clustered by (a) into 2 buckets stored as orc TBLPROPERTIES 
> ('transactional'='true');
> insert into table acid_uap partition (ds='tomorrow') values (1, 'bah'),(2, 
> 'yah') ;
> insert into table acid_uap partition (ds='today') values (1, 'bah'),(2, 
> 'yah') ;
> select a,b,ds from acid_uap order by a,b;
> update acid_uap set b = 'fred';
> select a,b,ds from acid_uap order by a,b;
> {code}
> *Error*
> {code:java}
> Status: Failed
> Vertex failed, vertexName=Map 1, vertexId=vertex_1539123809352_0001_5_00, 
> diagnostics=[Task failed, taskId=task_1539123809352_0001_5_00_00, 
> diagnostics=[TaskAttempt 0 failed, info=[Error: Error while running task ( 
> failure ) : 
> attempt_1539123809352_0001_5_00_00_0:java.lang.RuntimeException: 
> java.lang.RuntimeException: java.io.IOException: java.io.IOException: 
> Corrupted records with different bucket ids from the containing bucket file 
> found! Expected bucket id 0, however found 
> DeleteRecordKey(2,536936448(1.1.0),0).  (OrcSplit 
> [file:/Users/vgarg/hive_temp/vgarg/hive/warehouse/dp_sort.db/acid_uap/ds=today/delta_002_002_/bucket_0,
>  start=3, length=361, isOriginal=false, fileLength=798, hasFooter=false, 
> hasBase=true, 
> deltas=2],file:/Users/vgarg/hive_temp/vgarg/hive/warehouse/dp_sort.db/acid_uap/ds=today/delete_delta_003_003_/bucket_0)
>   at 
> org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:296)
>   at 
> org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:250)
>   at 
> org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:374)
>   at 
> org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:73)
>   at 
> org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:61)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:422)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1962)
>   at 
> org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:61)
>   at 
> org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:37)
>   at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)
>   at 
> com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:108)
>   at 
> com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:41)
>   at 
> com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:77)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.RuntimeException: java.io.IOException: 
> java.io.IOException: Corrupted records with different bucket ids from the 
> containing bucket file found! Expected bucket id 0, however found 
> DeleteRecordKey(2,536936448(1.1.0),0).  (OrcSplit 
> [file:/Users/vgarg/hive_temp/vgarg/hive/warehouse/dp_sort.db/acid_uap/ds=today/delta_002_002_/bucket_0,
>  start=3, length=361, isOriginal=false, fileLength=798, hasFooter=false, 
> hasBase=true, 
> deltas=2],file:/Users/vgarg/hive_temp/vgarg/hive/warehouse/dp_sort.db/acid_uap/ds=today/delete_delta_003_003_/bucket_0)
>   at 
> org.apache.hadoop.mapred.split.TezGroupedSplitsInputFormat$TezGroupedSplitsRecordReader.initNextRecordReader(TezGroupedSplitsInputFormat.java:206)
>   at 
> org.apache.hadoop.mapred.split.TezGroupedSplitsInputFormat$TezGroupedSplitsRecordReader.next(TezGroupedSplitsInputFormat.java:152)
>   at 
> org.apache.tez.mapreduce.lib.MRReaderMapred.next(MRRe

[jira] [Commented] (HIVE-20719) SELECT statement fails after UPDATE with hive.optimize.sort.dynamic.partition optimization and vectorization on

2018-10-11 Thread Eugene Koifman (JIRA)



[ 
https://issues.apache.org/jira/browse/HIVE-20719?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16646834#comment-16646834
 ] 

Eugene Koifman commented on HIVE-20719:
---

Disabling vectorization doesn't actually fix it.  The difference is that 
Vectorized acid reader has an assert to make sure the file name and ROW__IDs in 
the file match, non-vectorized doesn't (just produces bad data by not reading 
all relevant delete events).

Here is what the state looks like on disk after the update statement.  
Everything from the update stmt is written to bucket_0.  This creates data skew 
for insert events and will ignore delete events that apply other buckets at 
read time.

{noformat}
acid_uap/ds=tomorrow/delta_001_001_/bucket_1 [length: 702]
{"operation":0,"originalTransaction":1,"bucket":536936448,"rowId":0,"currentTransaction":1,"row":{"a":1,"b":"bah"}}


acid_uap/ds=tomorrow/delta_001_001_/bucket_0 [length: 703]
{"operation":0,"originalTransaction":1,"bucket":536870912,"rowId":0,"currentTransaction":1,"row":{"a":2,"b":"yah"}}


acid_uap/ds=today/delta_002_002_/bucket_1 [length: 711]
{"operation":0,"originalTransaction":2,"bucket":536936448,"rowId":0,"currentTransaction":2,"row":{"a":1,"b":"bah"}}


acid_uap/ds=today/delta_002_002_/bucket_0 [length: 695]
{"operation":0,"originalTransaction":2,"bucket":536870912,"rowId":0,"currentTransaction":2,"row":{"a":2,"b":"yah"}}


acid_uap/ds=tomorrow/delta_003_003_/bucket_0 [length: 733]
{"operation":0,"originalTransaction":3,"bucket":536870912,"rowId":0,"currentTransaction":3,"row":{"a":2,"b":"fred"}}
{"operation":0,"originalTransaction":3,"bucket":536870912,"rowId":1,"currentTransaction":3,"row":{"a":1,"b":"fred"}}


acid_uap/ds=today/delta_003_003_/bucket_0 [length: 733]
{"operation":0,"originalTransaction":3,"bucket":536870912,"rowId":0,"currentTransaction":3,"row":{"a":2,"b":"fred"}}
{"operation":0,"originalTransaction":3,"bucket":536870912,"rowId":1,"currentTransaction":3,"row":{"a":1,"b":"fred"}}


acid_uap/ds=tomorrow/delete_delta_003_003_/bucket_0 [length: 
713]
{"operation":2,"originalTransaction":1,"bucket":536870912,"rowId":0,"currentTransaction":3,"row":null}
{"operation":2,"originalTransaction":1,"bucket":536936448,"rowId":0,"currentTransaction":3,"row":null}


acid_uap/ds=today/delete_delta_003_003_/bucket_0 [length: 705]
{"operation":2,"originalTransaction":2,"bucket":536870912,"rowId":0,"currentTransaction":3,"row":null}
{"operation":2,"originalTransaction":2,"bucket":536936448,"rowId":0,"currentTransaction":3,"row":null}
{noformat}

> SELECT statement fails after UPDATE with hive.optimize.sort.dynamic.partition 
> optimization and vectorization on
> ---
>
> Key: HIVE-20719
> URL: https://issues.apache.org/jira/browse/HIVE-20719
> Project: Hive
>  Issue Type: Bug
>  Components: Transactions, Vectorization
>Affects Versions: 4.0.0
>Reporter: Vineet Garg
>Assignee: Eugene Koifman
>Priority: Major
>
> *Reproducer*
> {code:sql}
>  set hive.optimize.sort.dynamic.partition=true ;
> create table acid_uap(a int, b varchar(128)) partitioned by (ds string) 
> clustered by (a) into 2 buckets stored as orc TBLPROPERTIES 
> ('transactional'='true');
> insert into table acid_uap partition (ds='tomorrow') values (1, 'bah'),(2, 
> 'yah') ;
> insert into table acid_uap partition (ds='today') values (1, 'bah'),(2, 
> 'yah') ;
> select a,b,ds from acid_uap order by a,b;
> update acid_uap set b = 'fred';
> select a,b,ds from acid_uap order by a,b;
> {code}
> *Error*
> {code:java}
> Status: Failed
> Vertex failed, vertexName=Map 1, vertexId=vertex_1539123809352_0001_5_00, 
> diagnostics=[Task failed, taskId=task_1539123809352_0001_5_00_00, 
> diagnostics=[TaskAttempt 0 failed, info=[Error: Error while running task ( 
> failure ) :

[jira] [Assigned] (HIVE-20719) SELECT statement fails after UPDATE with hive.optimize.sort.dynamic.partition optimization and vectorization on

2018-10-11 Thread Eugene Koifman (JIRA)



 [ 
https://issues.apache.org/jira/browse/HIVE-20719?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eugene Koifman reassigned HIVE-20719:
-

Assignee: Eugene Koifman

> SELECT statement fails after UPDATE with hive.optimize.sort.dynamic.partition 
> optimization and vectorization on
> ---
>
> Key: HIVE-20719
> URL: https://issues.apache.org/jira/browse/HIVE-20719
> Project: Hive
>  Issue Type: Bug
>  Components: Transactions, Vectorization
>Affects Versions: 4.0.0
>Reporter: Vineet Garg
>Assignee: Eugene Koifman
>Priority: Major
>
> *Reproducer*
> {code:sql}
>  set hive.optimize.sort.dynamic.partition=true ;
> create table acid_uap(a int, b varchar(128)) partitioned by (ds string) 
> clustered by (a) into 2 buckets stored as orc TBLPROPERTIES 
> ('transactional'='true');
> insert into table acid_uap partition (ds='tomorrow') values (1, 'bah'),(2, 
> 'yah') ;
> insert into table acid_uap partition (ds='today') values (1, 'bah'),(2, 
> 'yah') ;
> select a,b,ds from acid_uap order by a,b;
> update acid_uap set b = 'fred';
> select a,b,ds from acid_uap order by a,b;
> {code}
> *Error*
> {code:java}
> Status: Failed
> Vertex failed, vertexName=Map 1, vertexId=vertex_1539123809352_0001_5_00, 
> diagnostics=[Task failed, taskId=task_1539123809352_0001_5_00_00, 
> diagnostics=[TaskAttempt 0 failed, info=[Error: Error while running task ( 
> failure ) : 
> attempt_1539123809352_0001_5_00_00_0:java.lang.RuntimeException: 
> java.lang.RuntimeException: java.io.IOException: java.io.IOException: 
> Corrupted records with different bucket ids from the containing bucket file 
> found! Expected bucket id 0, however found 
> DeleteRecordKey(2,536936448(1.1.0),0).  (OrcSplit 
> [file:/Users/vgarg/hive_temp/vgarg/hive/warehouse/dp_sort.db/acid_uap/ds=today/delta_002_002_/bucket_0,
>  start=3, length=361, isOriginal=false, fileLength=798, hasFooter=false, 
> hasBase=true, 
> deltas=2],file:/Users/vgarg/hive_temp/vgarg/hive/warehouse/dp_sort.db/acid_uap/ds=today/delete_delta_003_003_/bucket_0)
>   at 
> org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:296)
>   at 
> org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:250)
>   at 
> org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:374)
>   at 
> org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:73)
>   at 
> org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:61)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:422)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1962)
>   at 
> org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:61)
>   at 
> org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:37)
>   at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)
>   at 
> com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:108)
>   at 
> com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:41)
>   at 
> com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:77)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.RuntimeException: java.io.IOException: 
> java.io.IOException: Corrupted records with different bucket ids from the 
> containing bucket file found! Expected bucket id 0, however found 
> DeleteRecordKey(2,536936448(1.1.0),0).  (OrcSplit 
> [file:/Users/vgarg/hive_temp/vgarg/hive/warehouse/dp_sort.db/acid_uap/ds=today/delta_002_002_/bucket_0,
>  start=3, length=361, isOriginal=false, fileLength=798, hasFooter=false, 
> hasBase=true, 
> deltas=2],file:/Users/vgarg/hive_temp/vgarg/hive/warehouse/dp_sort.db/acid_uap/ds=today/delete_delta_003_003_/bucket_0)
>   at 
> org.apache.hadoop.mapred.split.TezGroupedSplitsInputFormat$TezGroupedSplitsRecordReader.initNextRecordReader(TezGroupedSplitsInputFormat.java:206)
>   at 
> org.apache.hadoop.mapred.split.TezGroupedSplitsInputFormat$TezGroupedSplitsRecordReader.next(TezGroupedSplitsInputFormat.java:152)
>   at 
> org.apache.tez.mapreduce.lib.MRReaderMapred.next(MRReaderMapred.java:116)
>   at 
> org.apache.hado

[jira] [Updated] (HIVE-17231) ColumnizedDeleteEventRegistry.DeleteReaderValue optimization

2018-10-10 Thread Eugene Koifman (JIRA)



 [ 
https://issues.apache.org/jira/browse/HIVE-17231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eugene Koifman updated HIVE-17231:
--
Attachment: HIVE-17231.01.patch

> ColumnizedDeleteEventRegistry.DeleteReaderValue optimization
> 
>
> Key: HIVE-17231
> URL: https://issues.apache.org/jira/browse/HIVE-17231
> Project: Hive
>  Issue Type: Sub-task
>  Components: Transactions
>Reporter: Eugene Koifman
>Assignee: Eugene Koifman
>Priority: Major
> Attachments: HIVE-17231.01.patch
>
>
>  For unbucketed tables DeleteReaderValue will currently return all delete 
> events.  Once we trust that
>  the N in bucketN for "base" spit is reliable, all delete events not 
> matching N can be skipped.
> This is useful to protect against extreme cases where someone runs an 
> update/delete on a partition that matches 10 billion rows thus generates very 
> many delete events.
> Since HIVE-19890 all acid data files must have bucketid/writerid in the file 
> name match bucketid/writerid in ROW__ID in the data.
> {{OrcRawRecrodMerger.getDeltaFiles()}} should only return files representing 
> the right {{bucket}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (HIVE-17231) ColumnizedDeleteEventRegistry.DeleteReaderValue optimization

2018-10-10 Thread Eugene Koifman (JIRA)



 [ 
https://issues.apache.org/jira/browse/HIVE-17231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eugene Koifman updated HIVE-17231:
--
Status: Patch Available  (was: Open)

> ColumnizedDeleteEventRegistry.DeleteReaderValue optimization
> 
>
> Key: HIVE-17231
> URL: https://issues.apache.org/jira/browse/HIVE-17231
> Project: Hive
>  Issue Type: Sub-task
>  Components: Transactions
>Reporter: Eugene Koifman
>Assignee: Eugene Koifman
>Priority: Major
> Attachments: HIVE-17231.01.patch
>
>
>  For unbucketed tables DeleteReaderValue will currently return all delete 
> events.  Once we trust that
>  the N in bucketN for "base" spit is reliable, all delete events not 
> matching N can be skipped.
> This is useful to protect against extreme cases where someone runs an 
> update/delete on a partition that matches 10 billion rows thus generates very 
> many delete events.
> Since HIVE-19890 all acid data files must have bucketid/writerid in the file 
> name match bucketid/writerid in ROW__ID in the data.
> {{OrcRawRecrodMerger.getDeltaFiles()}} should only return files representing 
> the right {{bucket}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (HIVE-17231) ColumnizedDeleteEventRegistry.DeleteReaderValue optimization

2018-10-10 Thread Eugene Koifman (JIRA)



[ 
https://issues.apache.org/jira/browse/HIVE-17231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16645598#comment-16645598
 ] 

Eugene Koifman commented on HIVE-17231:
---

todo: roll HIVE-20579 into this

> ColumnizedDeleteEventRegistry.DeleteReaderValue optimization
> 
>
> Key: HIVE-17231
> URL: https://issues.apache.org/jira/browse/HIVE-17231
> Project: Hive
>  Issue Type: Sub-task
>  Components: Transactions
>Reporter: Eugene Koifman
>Priority: Major
> Attachments: HIVE-17231.01.patch
>
>
>  For unbucketed tables DeleteReaderValue will currently return all delete 
> events.  Once we trust that
>  the N in bucketN for "base" spit is reliable, all delete events not 
> matching N can be skipped.
> This is useful to protect against extreme cases where someone runs an 
> update/delete on a partition that matches 10 billion rows thus generates very 
> many delete events.
> Since HIVE-19890 all acid data files must have bucketid/writerid in the file 
> name match bucketid/writerid in ROW__ID in the data.
> {{OrcRawRecrodMerger.getDeltaFiles()}} should only return files representing 
> the right {{bucket}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Assigned] (HIVE-17231) ColumnizedDeleteEventRegistry.DeleteReaderValue optimization

2018-10-10 Thread Eugene Koifman (JIRA)



 [ 
https://issues.apache.org/jira/browse/HIVE-17231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eugene Koifman reassigned HIVE-17231:
-

Assignee: Eugene Koifman

> ColumnizedDeleteEventRegistry.DeleteReaderValue optimization
> 
>
> Key: HIVE-17231
> URL: https://issues.apache.org/jira/browse/HIVE-17231
> Project: Hive
>  Issue Type: Sub-task
>  Components: Transactions
>Reporter: Eugene Koifman
>Assignee: Eugene Koifman
>Priority: Major
> Attachments: HIVE-17231.01.patch
>
>
>  For unbucketed tables DeleteReaderValue will currently return all delete 
> events.  Once we trust that
>  the N in bucketN for "base" spit is reliable, all delete events not 
> matching N can be skipped.
> This is useful to protect against extreme cases where someone runs an 
> update/delete on a partition that matches 10 billion rows thus generates very 
> many delete events.
> Since HIVE-19890 all acid data files must have bucketid/writerid in the file 
> name match bucketid/writerid in ROW__ID in the data.
> {{OrcRawRecrodMerger.getDeltaFiles()}} should only return files representing 
> the right {{bucket}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (HIVE-20723) Allow per table specification of compaction yarn queue

2018-10-10 Thread Eugene Koifman (JIRA)



[ 
https://issues.apache.org/jira/browse/HIVE-20723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16645420#comment-16645420
 ] 

Eugene Koifman commented on HIVE-20723:
---

FYI, [~ikryvenko], [~saurabhseth]

> Allow per table specification of compaction yarn queue
> --
>
> Key: HIVE-20723
> URL: https://issues.apache.org/jira/browse/HIVE-20723
> Project: Hive
>  Issue Type: New Feature
>  Components: Transactions
>Affects Versions: 2.0.0
>Reporter: Eugene Koifman
>Priority: Major
>
> Currently compactions of full CRUD transactional tables are Map-Reduce jobs 
> submitted to a yarn queue defined by hive.compactor.job.queue property.
> If would be useful to be able to override this on per table basis by putting 
> it into table properties so that compactions for different tables can use 
> different queues.
>  
> There is already ability to override other compaction related configs via 
> table props, though this will need additional handling to set the queue name 
> {{CompactorMr.createBaseJobConf}}
> [https://cwiki.apache.org/confluence/display/Hive/Hive+Transactions#HiveTransactions-TableProperties]
>  
> See {{CopactorMR.COMPACTOR_PREFIX}} and 
> {{Initiator.COMPACTORTHRESHOLD_PREFIX}}
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (HIVE-20723) Allow per table specification of compaction yarn queue

2018-10-10 Thread Eugene Koifman (JIRA)



 [ 
https://issues.apache.org/jira/browse/HIVE-20723?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eugene Koifman updated HIVE-20723:
--
Description: 
Currently compactions of full CRUD transactional tables are Map-Reduce jobs 
submitted to a yarn queue defined by hive.compactor.job.queue property.

If would be useful to be able to override this on per table basis by putting it 
into table properties so that compactions for different tables can use 
different queues.

 

There is already ability to override other compaction related configs via table 
props, though this will need additional handling to set the queue name 
{{CompactorMr.createBaseJobConf}}

[https://cwiki.apache.org/confluence/display/Hive/Hive+Transactions#HiveTransactions-TableProperties]

 

See {{CopactorMR.COMPACTOR_PREFIX}} and {{Initiator.COMPACTORTHRESHOLD_PREFIX}}

 

 

 

  was:
Currently compactions of full CRUD transactional tables are Map-Reduce jobs 
submitted to a yarn queue defined by hive.compactor.job.queue property.

If would be useful to be able to override this on per table basis by putting it 
into table properties so that compactions for different tables can use 
different queues.

 

There is already ability to override other compaction related configs via table 
props, though this will need additional handling.

[https://cwiki.apache.org/confluence/display/Hive/Hive+Transactions#HiveTransactions-TableProperties]

 

See {{CopactorMR.COMPACTOR_PREFIX}} and {{Initiator.COMPACTORTHRESHOLD_PREFIX}}

 

 

 


> Allow per table specification of compaction yarn queue
> --
>
> Key: HIVE-20723
> URL: https://issues.apache.org/jira/browse/HIVE-20723
> Project: Hive
>  Issue Type: New Feature
>  Components: Transactions
>Affects Versions: 2.0.0
>Reporter: Eugene Koifman
>Priority: Major
>
> Currently compactions of full CRUD transactional tables are Map-Reduce jobs 
> submitted to a yarn queue defined by hive.compactor.job.queue property.
> If would be useful to be able to override this on per table basis by putting 
> it into table properties so that compactions for different tables can use 
> different queues.
>  
> There is already ability to override other compaction related configs via 
> table props, though this will need additional handling to set the queue name 
> {{CompactorMr.createBaseJobConf}}
> [https://cwiki.apache.org/confluence/display/Hive/Hive+Transactions#HiveTransactions-TableProperties]
>  
> See {{CopactorMR.COMPACTOR_PREFIX}} and 
> {{Initiator.COMPACTORTHRESHOLD_PREFIX}}
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (HIVE-20694) Additional unit tests for VectorizedOrcAcidRowBatchReader min max key evaluation

2018-10-09 Thread Eugene Koifman (JIRA)



 [ 
https://issues.apache.org/jira/browse/HIVE-20694?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eugene Koifman updated HIVE-20694:
--
   Resolution: Fixed
Fix Version/s: 4.0.0
 Release Note: n/a
   Status: Resolved  (was: Patch Available)

committed to master

thanks Saurabh for the contribution

> Additional unit tests for VectorizedOrcAcidRowBatchReader min max key 
> evaluation
> 
>
> Key: HIVE-20694
> URL: https://issues.apache.org/jira/browse/HIVE-20694
> Project: Hive
>  Issue Type: Test
>  Components: Transactions
>Reporter: Saurabh Seth
>Assignee: Saurabh Seth
>Priority: Minor
> Fix For: 4.0.0
>
> Attachments: HIVE-20694.patch
>
>
> Follow up to HIVE-20664 and HIVE-20635.
> Additional unit tests for {{VectorizedOrcAcidRowBatchReader.findMinMaxKeys}} 
> and {{VectorizedOrcAcidRowBatchReader.findOriginalMinMaxKeys}} related to 
> split and stripe boundaries - particularly the case when a split is 
> completely within an ORC stripe.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (HIVE-20694) Additional unit tests for VectorizedOrcAcidRowBatchReader min max key evaluation

2018-10-09 Thread Eugene Koifman (JIRA)



[ 
https://issues.apache.org/jira/browse/HIVE-20694?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16644208#comment-16644208
 ] 

Eugene Koifman commented on HIVE-20694:
---

+1

> Additional unit tests for VectorizedOrcAcidRowBatchReader min max key 
> evaluation
> 
>
> Key: HIVE-20694
> URL: https://issues.apache.org/jira/browse/HIVE-20694
> Project: Hive
>  Issue Type: Test
>  Components: Transactions
>Reporter: Saurabh Seth
>Assignee: Saurabh Seth
>Priority: Minor
> Attachments: HIVE-20694.patch
>
>
> Follow up to HIVE-20664 and HIVE-20635.
> Additional unit tests for {{VectorizedOrcAcidRowBatchReader.findMinMaxKeys}} 
> and {{VectorizedOrcAcidRowBatchReader.findOriginalMinMaxKeys}} related to 
> split and stripe boundaries - particularly the case when a split is 
> completely within an ORC stripe.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (HIVE-20719) SELECT statement fails after UPDATE with hive.optimize.sort.dynamic.partition optimization and vectorization on

2018-10-09 Thread Eugene Koifman (JIRA)



 [ 
https://issues.apache.org/jira/browse/HIVE-20719?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eugene Koifman updated HIVE-20719:
--
Component/s: Transactions

> SELECT statement fails after UPDATE with hive.optimize.sort.dynamic.partition 
> optimization and vectorization on
> ---
>
> Key: HIVE-20719
> URL: https://issues.apache.org/jira/browse/HIVE-20719
> Project: Hive
>  Issue Type: Bug
>  Components: Transactions, Vectorization
>Affects Versions: 4.0.0
>Reporter: Vineet Garg
>Priority: Major
>
> *Reproducer*
> {code:sql}
>  set hive.optimize.sort.dynamic.partition=true ;
> create table acid_uap(a int, b varchar(128)) partitioned by (ds string) 
> clustered by (a) into 2 buckets stored as orc TBLPROPERTIES 
> ('transactional'='true');
> insert into table acid_uap partition (ds='today') select cint, cast(cstring1 
> as varchar(128)) as cs from alltypesorc where cint is not null and cint < 0 
> order by cint, cs limit 10;
> insert into table acid_uap partition (ds='tomorrow') select cint, 
> cast(cstring1 as varchar(128)) as cs from alltypesorc where cint is not null 
> and cint > 10 order by cint, cs limit 10;
> select a,b,ds from acid_uap order by a,b;
> update acid_uap set b = 'fred';
> select a,b,ds from acid_uap order by a,b;
> {code}
> *Error*
> {code:java}
> Status: Failed
> Vertex failed, vertexName=Map 1, vertexId=vertex_1539123809352_0001_5_00, 
> diagnostics=[Task failed, taskId=task_1539123809352_0001_5_00_00, 
> diagnostics=[TaskAttempt 0 failed, info=[Error: Error while running task ( 
> failure ) : 
> attempt_1539123809352_0001_5_00_00_0:java.lang.RuntimeException: 
> java.lang.RuntimeException: java.io.IOException: java.io.IOException: 
> Corrupted records with different bucket ids from the containing bucket file 
> found! Expected bucket id 0, however found 
> DeleteRecordKey(2,536936448(1.1.0),0).  (OrcSplit 
> [file:/Users/vgarg/hive_temp/vgarg/hive/warehouse/dp_sort.db/acid_uap/ds=today/delta_002_002_/bucket_0,
>  start=3, length=361, isOriginal=false, fileLength=798, hasFooter=false, 
> hasBase=true, 
> deltas=2],file:/Users/vgarg/hive_temp/vgarg/hive/warehouse/dp_sort.db/acid_uap/ds=today/delete_delta_003_003_/bucket_0)
>   at 
> org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:296)
>   at 
> org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:250)
>   at 
> org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:374)
>   at 
> org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:73)
>   at 
> org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:61)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:422)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1962)
>   at 
> org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:61)
>   at 
> org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:37)
>   at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)
>   at 
> com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:108)
>   at 
> com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:41)
>   at 
> com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:77)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.RuntimeException: java.io.IOException: 
> java.io.IOException: Corrupted records with different bucket ids from the 
> containing bucket file found! Expected bucket id 0, however found 
> DeleteRecordKey(2,536936448(1.1.0),0).  (OrcSplit 
> [file:/Users/vgarg/hive_temp/vgarg/hive/warehouse/dp_sort.db/acid_uap/ds=today/delta_002_002_/bucket_0,
>  start=3, length=361, isOriginal=false, fileLength=798, hasFooter=false, 
> hasBase=true, 
> deltas=2],file:/Users/vgarg/hive_temp/vgarg/hive/warehouse/dp_sort.db/acid_uap/ds=today/delete_delta_003_003_/bucket_0)
>   at 
> org.apache.hadoop.mapred.split.TezGroupedSplitsInputFormat$TezGroupedSplitsRecordReader.initNextRecordReader(TezGroupedSplitsInputFormat.java:206)
>   at 
> org.apache.hadoop.mapred.split.TezGroupedSplitsInputFormat$TezGroupedSplitsRecordRead

[jira] [Updated] (HIVE-20635) VectorizedOrcAcidRowBatchReader doesn't filter delete events for original files

2018-10-06 Thread Eugene Koifman (JIRA)



 [ 
https://issues.apache.org/jira/browse/HIVE-20635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eugene Koifman updated HIVE-20635:
--
   Resolution: Fixed
Fix Version/s: 4.0.0
 Release Note: n/a
   Status: Resolved  (was: Patch Available)

> VectorizedOrcAcidRowBatchReader doesn't filter delete events for original 
> files
> ---
>
> Key: HIVE-20635
> URL: https://issues.apache.org/jira/browse/HIVE-20635
> Project: Hive
>  Issue Type: Improvement
>  Components: Transactions
>Affects Versions: 3.0.0
>Reporter: Eugene Koifman
>Assignee: Saurabh Seth
>Priority: Major
> Fix For: 4.0.0
>
> Attachments: HIVE-20635.2.patch, HIVE-20635.3.patch, HIVE-20635.patch
>
>
> this is a followup to HIVE-16812 which adds support for delete event 
> filtering for splits from native acid files
> need to add the same for {{OrcSplit.isOriginal()}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (HIVE-20635) VectorizedOrcAcidRowBatchReader doesn't filter delete events for original files

2018-10-06 Thread Eugene Koifman (JIRA)



[ 
https://issues.apache.org/jira/browse/HIVE-20635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16640876#comment-16640876
 ] 

Eugene Koifman commented on HIVE-20635:
---

committed patch 3 to master

thanks Saurabh for the contribution

> VectorizedOrcAcidRowBatchReader doesn't filter delete events for original 
> files
> ---
>
> Key: HIVE-20635
> URL: https://issues.apache.org/jira/browse/HIVE-20635
> Project: Hive
>  Issue Type: Improvement
>  Components: Transactions
>Affects Versions: 3.0.0
>Reporter: Eugene Koifman
>Assignee: Saurabh Seth
>Priority: Major
> Fix For: 4.0.0
>
> Attachments: HIVE-20635.2.patch, HIVE-20635.3.patch, HIVE-20635.patch
>
>
> this is a followup to HIVE-16812 which adds support for delete event 
> filtering for splits from native acid files
> need to add the same for {{OrcSplit.isOriginal()}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Assigned] (HIVE-20699) Query based compactor for full CRUD Acid tables

2018-10-05 Thread Eugene Koifman (JIRA)



 [ 
https://issues.apache.org/jira/browse/HIVE-20699?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eugene Koifman reassigned HIVE-20699:
-


> Query based compactor for full CRUD Acid tables
> ---
>
> Key: HIVE-20699
> URL: https://issues.apache.org/jira/browse/HIVE-20699
> Project: Hive
>  Issue Type: New Feature
>  Components: Transactions
>Affects Versions: 3.1.0
>Reporter: Eugene Koifman
>Assignee: Eugene Koifman
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Assigned] (HIVE-19550) Enable TestAcidOnTez#testNonStandardConversion01

2018-10-05 Thread Eugene Koifman (JIRA)



 [ 
https://issues.apache.org/jira/browse/HIVE-19550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eugene Koifman reassigned HIVE-19550:
-

Assignee: Eugene Koifman

> Enable TestAcidOnTez#testNonStandardConversion01
> 
>
> Key: HIVE-19550
> URL: https://issues.apache.org/jira/browse/HIVE-19550
> Project: Hive
>  Issue Type: Test
>  Components: Test
>Affects Versions: 3.1.0
>Reporter: Jesus Camacho Rodriguez
>Assignee: Eugene Koifman
>Priority: Critical
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Assigned] (HIVE-19549) Enable TestAcidOnTez#testCtasTezUnion

2018-10-05 Thread Eugene Koifman (JIRA)



 [ 
https://issues.apache.org/jira/browse/HIVE-19549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eugene Koifman reassigned HIVE-19549:
-

Assignee: Eugene Koifman

> Enable TestAcidOnTez#testCtasTezUnion
> -
>
> Key: HIVE-19549
> URL: https://issues.apache.org/jira/browse/HIVE-19549
> Project: Hive
>  Issue Type: Test
>  Components: Test
>Affects Versions: 3.1.0
>Reporter: Jesus Camacho Rodriguez
>Assignee: Eugene Koifman
>Priority: Critical
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (HIVE-20635) VectorizedOrcAcidRowBatchReader doesn't filter delete events for original files

2018-10-05 Thread Eugene Koifman (JIRA)



[ 
https://issues.apache.org/jira/browse/HIVE-20635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16640053#comment-16640053
 ] 

Eugene Koifman commented on HIVE-20635:
---

+1 patch 3 pending tests

> VectorizedOrcAcidRowBatchReader doesn't filter delete events for original 
> files
> ---
>
> Key: HIVE-20635
> URL: https://issues.apache.org/jira/browse/HIVE-20635
> Project: Hive
>  Issue Type: Improvement
>  Components: Transactions
>Affects Versions: 3.0.0
>Reporter: Eugene Koifman
>Assignee: Saurabh Seth
>Priority: Major
> Attachments: HIVE-20635.2.patch, HIVE-20635.3.patch, HIVE-20635.patch
>
>
> this is a followup to HIVE-16812 which adds support for delete event 
> filtering for splits from native acid files
> need to add the same for {{OrcSplit.isOriginal()}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (HIVE-20693) Case-sensitivity for column names when reading from ORC

2018-10-05 Thread Eugene Koifman (JIRA)



[ 
https://issues.apache.org/jira/browse/HIVE-20693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16640046#comment-16640046
 ] 

Eugene Koifman commented on HIVE-20693:
---

it's probably better to ask on u...@orc.apache.org 

> Case-sensitivity for column names when reading from ORC
> ---
>
> Key: HIVE-20693
> URL: https://issues.apache.org/jira/browse/HIVE-20693
> Project: Hive
>  Issue Type: Bug
>  Components: Hive, ORC
>Affects Versions: 2.3.2
>Reporter: Alexandre Crayssac
>Priority: Major
>
> Hello everyone,
> I observed a different behavior between version 1.2.1 and 2.3.2 (that's the 
> only two versions I've been able to test).
> When creating an external table pointing to ORC files and having upper cased 
> column names in the ORC files metadata I'm able to read the data on 1.2.1 but 
> not on 2.3.2.
> I tested with both upper cased and lower cased column names in my CREATE 
> TABLE statement and it does not work in both cases. Looks like normal since 
> column names are normalized to lower case in Hive.
> So, I would like to know if this is a feature or a bug in Hive 2.3.2 ?
> In fact, if this is a feature it would be impossible to have upper case 
> column names in ORC files with Hive 2.3.2.
> Please, let me know if you need more informations.
> Kind regards,
> Alexandre



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (HIVE-20664) Potential ArrayIndexOutOfBoundsException in VectorizedOrcAcidRowBatchReader.findMinMaxKeys

2018-10-05 Thread Eugene Koifman (JIRA)



 [ 
https://issues.apache.org/jira/browse/HIVE-20664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eugene Koifman updated HIVE-20664:
--
  Resolution: Fixed
   Fix Version/s: 4.0.0
Release Note: n/a
Target Version/s: 4.0.0
  Status: Resolved  (was: Patch Available)

committed to master

thanks Saurabh for the contribution

> Potential ArrayIndexOutOfBoundsException in 
> VectorizedOrcAcidRowBatchReader.findMinMaxKeys
> --
>
> Key: HIVE-20664
> URL: https://issues.apache.org/jira/browse/HIVE-20664
> Project: Hive
>  Issue Type: Bug
>  Components: Transactions
>Reporter: Saurabh Seth
>Assignee: Saurabh Seth
>Priority: Minor
> Fix For: 4.0.0
>
> Attachments: HIVE-20664.2.patch, HIVE-20664.patch
>
>
> [~ekoifman], could you please confirm if my understanding is correct and if 
> so, review the fix?
> In the method {{VectorizedOrcAcidRowBatchReader.findMinMaxKeys}}, the code 
> snippet that identifies the first and last stripe indices in the current 
> split could result in an ArrayIndexOutOfBoundsException if a complete split 
> is within the same stripe:
> {noformat}
>     for(int i = 0; i < stripes.size(); i++) {
>   StripeInformation stripe = stripes.get(i);
>   long stripeEnd = stripe.getOffset() + stripe.getLength();
>   if(firstStripeIndex == -1 && stripe.getOffset() >= splitStart) {
> firstStripeIndex = i;
>   }
>   if(lastStripeIndex == -1 && splitEnd <= stripeEnd &&
>   stripes.get(firstStripeIndex).getOffset() <= stripe.getOffset() ) {
> //the last condition is for when both splitStart and splitEnd are in
> // the same stripe
> lastStripeIndex = i;
>   }
> }
> {noformat}
> Consider the example where there are 2 stripes - 0-500 and 500-1000 and 
> splitStart is 600 and splitEnd is 800.
> In the first iteration of the loop, stripe.getOffset() is 0 and stripeEnd is 
> 500. In this iteration, neither of the if statement conditions will be met 
> and firstSripeIndex as well as lastStripeIndex remain -1.
> In the second iteration of the loop stripe.getOffset() is 500, stripeEnd is 
> 1000, The first if statement condition will not be met in this case because 
> stripe's offset (500) is not greater than or equal to the splitStart (600). 
> However, in the second if statement, splitEnd (800) is <= stripeEnd(1000) and 
> it will try to compute the last condition 
> {{stripes.get(firstStripeIndex).getOffset() <= stripe.getOffset()}}. This 
> will throw an ArrayIndexOutOfBoundsException because firstStripeIndex is 
> still -1.
> I'm not sure if this scenario is possible at all, hence logging this as a low 
> priority issue. Perhaps block based split generation using BISplitStrategy 
> could trigger this?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Comment Edited] (HIVE-20635) VectorizedOrcAcidRowBatchReader doesn't filter delete events for original files

2018-10-04 Thread Eugene Koifman (JIRA)



[ 
https://issues.apache.org/jira/browse/HIVE-20635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16639013#comment-16639013
 ] 

Eugene Koifman edited comment on HIVE-20635 at 10/4/18 11:11 PM:
-

A nit: {{LOG.info("findOriginalMinMaxKeys(): " + keyIntervalTmp);}} is only 
logged {{if (minRowId >= maxRowId) {}}

would be better to log it always

otherwise LGTM


was (Author: ekoifman):
A nit: {{  LOG.info("findOriginalMinMaxKeys(): " + keyIntervalTmp);}} is 
only logged {{if (minRowId >= maxRowId) {}}

would be better to log it always

otherwise LGTM

> VectorizedOrcAcidRowBatchReader doesn't filter delete events for original 
> files
> ---
>
> Key: HIVE-20635
> URL: https://issues.apache.org/jira/browse/HIVE-20635
> Project: Hive
>  Issue Type: Improvement
>  Components: Transactions
>Affects Versions: 3.0.0
>Reporter: Eugene Koifman
>Assignee: Saurabh Seth
>Priority: Major
> Attachments: HIVE-20635.2.patch, HIVE-20635.patch
>
>
> this is a followup to HIVE-16812 which adds support for delete event 
> filtering for splits from native acid files
> need to add the same for {{OrcSplit.isOriginal()}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (HIVE-20635) VectorizedOrcAcidRowBatchReader doesn't filter delete events for original files

2018-10-04 Thread Eugene Koifman (JIRA)



[ 
https://issues.apache.org/jira/browse/HIVE-20635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16639013#comment-16639013
 ] 

Eugene Koifman commented on HIVE-20635:
---

A nit: {{  LOG.info("findOriginalMinMaxKeys(): " + keyIntervalTmp);}} is 
only logged {{if (minRowId >= maxRowId) {}}

would be better to log it always

otherwise LGTM

> VectorizedOrcAcidRowBatchReader doesn't filter delete events for original 
> files
> ---
>
> Key: HIVE-20635
> URL: https://issues.apache.org/jira/browse/HIVE-20635
> Project: Hive
>  Issue Type: Improvement
>  Components: Transactions
>Affects Versions: 3.0.0
>Reporter: Eugene Koifman
>Assignee: Saurabh Seth
>Priority: Major
> Attachments: HIVE-20635.2.patch, HIVE-20635.patch
>
>
> this is a followup to HIVE-16812 which adds support for delete event 
> filtering for splits from native acid files
> need to add the same for {{OrcSplit.isOriginal()}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (HIVE-17296) Acid tests with multiple splits

2018-10-04 Thread Eugene Koifman (JIRA)



[ 
https://issues.apache.org/jira/browse/HIVE-17296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16638973#comment-16638973
 ] 

Eugene Koifman commented on HIVE-17296:
---

Hive is now on ORC 1.5.3

> Acid tests with multiple splits
> ---
>
> Key: HIVE-17296
> URL: https://issues.apache.org/jira/browse/HIVE-17296
> Project: Hive
>  Issue Type: Test
>  Components: Transactions
>Affects Versions: 3.0.0
>Reporter: Eugene Koifman
>Assignee: Eugene Koifman
>Priority: Blocker
>
> data files in an Acid table are ORC files which may have multiple stripes
> for such files in base/ or delta/ (and original files with non acid to acid 
> conversion) are split by OrcInputFormat into multiple (stripe sized) chunks.
> There is additional logic in in OrcRawRecordMerger 
> (discoverKeyBounds/discoverOriginalKeyBounds) that is not tested by any E2E 
> tests since none of the have enough data to generate multiple stripes in a 
> single file.
> testRecordReaderOldBaseAndDelta/testRecordReaderNewBaseAndDelta/testOriginalReaderPair
> in TestOrcRawRecordMerger has some logic to test this but it really needs e2e 
> tests.
> With ORC-228 it will be possible to write such tests.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (HIVE-20664) Potential ArrayIndexOutOfBoundsException in VectorizedOrcAcidRowBatchReader.findMinMaxKeys

2018-10-04 Thread Eugene Koifman (JIRA)



[ 
https://issues.apache.org/jira/browse/HIVE-20664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16638970#comment-16638970
 ] 

Eugene Koifman commented on HIVE-20664:
---

+1 on patch 2

regarding tests, take a look at ORC-228.  The idea was precisely to enable 
ability to create small stripes for tests.
Hive is now on Orc 1.5.3.

Let me know if you want to do that as part of this ticket or I can commit this 
and you can do that in a follow up jira

> Potential ArrayIndexOutOfBoundsException in 
> VectorizedOrcAcidRowBatchReader.findMinMaxKeys
> --
>
> Key: HIVE-20664
> URL: https://issues.apache.org/jira/browse/HIVE-20664
> Project: Hive
>  Issue Type: Bug
>  Components: Transactions
>Reporter: Saurabh Seth
>Assignee: Saurabh Seth
>Priority: Minor
> Attachments: HIVE-20664.2.patch, HIVE-20664.patch
>
>
> [~ekoifman], could you please confirm if my understanding is correct and if 
> so, review the fix?
> In the method {{VectorizedOrcAcidRowBatchReader.findMinMaxKeys}}, the code 
> snippet that identifies the first and last stripe indices in the current 
> split could result in an ArrayIndexOutOfBoundsException if a complete split 
> is within the same stripe:
> {noformat}
>     for(int i = 0; i < stripes.size(); i++) {
>   StripeInformation stripe = stripes.get(i);
>   long stripeEnd = stripe.getOffset() + stripe.getLength();
>   if(firstStripeIndex == -1 && stripe.getOffset() >= splitStart) {
> firstStripeIndex = i;
>   }
>   if(lastStripeIndex == -1 && splitEnd <= stripeEnd &&
>   stripes.get(firstStripeIndex).getOffset() <= stripe.getOffset() ) {
> //the last condition is for when both splitStart and splitEnd are in
> // the same stripe
> lastStripeIndex = i;
>   }
> }
> {noformat}
> Consider the example where there are 2 stripes - 0-500 and 500-1000 and 
> splitStart is 600 and splitEnd is 800.
> In the first iteration of the loop, stripe.getOffset() is 0 and stripeEnd is 
> 500. In this iteration, neither of the if statement conditions will be met 
> and firstSripeIndex as well as lastStripeIndex remain -1.
> In the second iteration of the loop stripe.getOffset() is 500, stripeEnd is 
> 1000, The first if statement condition will not be met in this case because 
> stripe's offset (500) is not greater than or equal to the splitStart (600). 
> However, in the second if statement, splitEnd (800) is <= stripeEnd(1000) and 
> it will try to compute the last condition 
> {{stripes.get(firstStripeIndex).getOffset() <= stripe.getOffset()}}. This 
> will throw an ArrayIndexOutOfBoundsException because firstStripeIndex is 
> still -1.
> I'm not sure if this scenario is possible at all, hence logging this as a low 
> priority issue. Perhaps block based split generation using BISplitStrategy 
> could trigger this?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (HIVE-17296) Acid tests with multiple splits

2018-10-04 Thread Eugene Koifman (JIRA)



 [ 
https://issues.apache.org/jira/browse/HIVE-17296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eugene Koifman updated HIVE-17296:
--
Priority: Blocker  (was: Critical)

> Acid tests with multiple splits
> ---
>
> Key: HIVE-17296
> URL: https://issues.apache.org/jira/browse/HIVE-17296
> Project: Hive
>  Issue Type: Test
>  Components: Transactions
>Affects Versions: 3.0.0
>Reporter: Eugene Koifman
>Assignee: Eugene Koifman
>Priority: Blocker
>
> data files in an Acid table are ORC files which may have multiple stripes
> for such files in base/ or delta/ (and original files with non acid to acid 
> conversion) are split by OrcInputFormat into multiple (stripe sized) chunks.
> There is additional logic in in OrcRawRecordMerger 
> (discoverKeyBounds/discoverOriginalKeyBounds) that is not tested by any E2E 
> tests since none of the have enough data to generate multiple stripes in a 
> single file.
> testRecordReaderOldBaseAndDelta/testRecordReaderNewBaseAndDelta/testOriginalReaderPair
> in TestOrcRawRecordMerger has some logic to test this but it really needs e2e 
> tests.
> With ORC-228 it will be possible to write such tests.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (HIVE-20556) Expose an API to retrieve the TBL_ID from TBLS in the metastore tables

2018-10-04 Thread Eugene Koifman (JIRA)



 [ 
https://issues.apache.org/jira/browse/HIVE-20556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eugene Koifman updated HIVE-20556:
--
   Resolution: Fixed
Fix Version/s: 4.0.0
 Release Note: n/a
   Status: Resolved  (was: Patch Available)

committed to master
thanks Jaume for the contribution

> Expose an API to retrieve the TBL_ID from TBLS in the metastore tables
> --
>
> Key: HIVE-20556
> URL: https://issues.apache.org/jira/browse/HIVE-20556
> Project: Hive
>  Issue Type: New Feature
>  Components: Metastore, Standalone Metastore
>Reporter: Jaume M
>Assignee: Jaume M
>Priority: Major
> Fix For: 4.0.0
>
> Attachments: HIVE-20556.1.patch, HIVE-20556.10.patch, 
> HIVE-20556.11.patch, HIVE-20556.12.patch, HIVE-20556.13.patch, 
> HIVE-20556.14.patch, HIVE-20556.15.patch, HIVE-20556.16.patch, 
> HIVE-20556.17.patch, HIVE-20556.18.patch, HIVE-20556.2.patch, 
> HIVE-20556.3.patch, HIVE-20556.4.patch, HIVE-20556.5.patch, 
> HIVE-20556.6.patch, HIVE-20556.7.patch, HIVE-20556.8.patch, HIVE-20556.9.patch
>
>
> We have two options to do this
> 1) Use the current MTable and add a field for this value
> 2) Add an independent API call to the metastore that would return the TBL_ID.
> Option 1 is preferable.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (HIVE-20556) Expose an API to retrieve the TBL_ID from TBLS in the metastore tables

2018-10-04 Thread Eugene Koifman (JIRA)



[ 
https://issues.apache.org/jira/browse/HIVE-20556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16638726#comment-16638726
 ] 

Eugene Koifman commented on HIVE-20556:
---

+1 patch 18

> Expose an API to retrieve the TBL_ID from TBLS in the metastore tables
> --
>
> Key: HIVE-20556
> URL: https://issues.apache.org/jira/browse/HIVE-20556
> Project: Hive
>  Issue Type: New Feature
>  Components: Metastore, Standalone Metastore
>Reporter: Jaume M
>Assignee: Jaume M
>Priority: Major
> Attachments: HIVE-20556.1.patch, HIVE-20556.10.patch, 
> HIVE-20556.11.patch, HIVE-20556.12.patch, HIVE-20556.13.patch, 
> HIVE-20556.14.patch, HIVE-20556.15.patch, HIVE-20556.16.patch, 
> HIVE-20556.17.patch, HIVE-20556.18.patch, HIVE-20556.2.patch, 
> HIVE-20556.3.patch, HIVE-20556.4.patch, HIVE-20556.5.patch, 
> HIVE-20556.6.patch, HIVE-20556.7.patch, HIVE-20556.8.patch, HIVE-20556.9.patch
>
>
> We have two options to do this
> 1) Use the current MTable and add a field for this value
> 2) Add an independent API call to the metastore that would return the TBL_ID.
> Option 1 is preferable.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (HIVE-20681) Support custom path filter for ORC tables

2018-10-03 Thread Eugene Koifman (JIRA)



[ 
https://issues.apache.org/jira/browse/HIVE-20681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16637099#comment-16637099
 ] 

Eugene Koifman commented on HIVE-20681:
---

Could you give a concrete example of some files on disk and what filter you'd 
like to generate?

> Support custom path filter for ORC tables
> -
>
> Key: HIVE-20681
> URL: https://issues.apache.org/jira/browse/HIVE-20681
> Project: Hive
>  Issue Type: Improvement
>  Components: Transactions
>Reporter: Igor Kryvenko
>Assignee: Igor Kryvenko
>Priority: Minor
>
> Currently, Orc file input format does not take in path filters set in the 
> property "mapreduce.input.pathfilter.class" OR " 
> mapred.input.pathfilter.class ". So, we cannot use custom filters with Orc 
> files.
> AcidUtils class has a static filter called "hiddenFilters" which is used by 
> ORC to filter input paths. If we can pass the custom filter classes(set in 
> the property mentioned above) to AcidUtils and replace hiddenFilter with a 
> filter that does an "and" operation over hiddenFilter+customFilters, the 
> filters would work well.
> It would be useful to have the ability to filter out rows based on 
> path/filenames, current ORC features like bloom filters and indexes are not 
> good enough for them to minimize the number of disk read operations.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (HIVE-20435) Failed Dynamic Partition Insert into insert only table may loose transaction metadata

2018-10-02 Thread Eugene Koifman (JIRA)



 [ 
https://issues.apache.org/jira/browse/HIVE-20435?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eugene Koifman updated HIVE-20435:
--
Summary: Failed Dynamic Partition Insert into insert only table may loose 
transaction metadata  (was: Failed Dynamic Partition Insert into insert only 
table may loose transaction metadat)

> Failed Dynamic Partition Insert into insert only table may loose transaction 
> metadata
> -
>
> Key: HIVE-20435
> URL: https://issues.apache.org/jira/browse/HIVE-20435
> Project: Hive
>  Issue Type: Bug
>  Components: Transactions
>Affects Versions: 3.0.0
>Reporter: Eugene Koifman
>Assignee: Eugene Koifman
>Priority: Critical
>
> {{TxnHandler.enqueueLockWithRetry()}} has an optimization where it doesn't 
> writ to {{TXN_COMPONENTS}} if the write is a dynamic partition insert because 
> it expects to write to this table from {{addDynamicPartitions()}}.  
> For insert-only, transactional tables, we create the target dir and start 
> writing to it before {{addDynamicPartitions()}} is called.  So if a txn is 
> aborted, we may have a delta dir in the partition but no corresponding entry 
> in {{TXN_COMPONENTS}}.  This means {{TxnStore.cleanEmptyAbortedTxns()}} may 
> clean up {{TXNS}} entry for the aborted transaction before Compactor removes 
> this delta dir, at which point it looks like committed data.
> Full CRUD are currently immune to this since they rely on "move" operation in 
> MoveTask but longer term they should follow the same model as insert-only 
> tables.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Comment Edited] (HIVE-20664) Potential ArrayIndexOutOfBoundsException in VectorizedOrcAcidRowBatchReader.findMinMaxKeys

2018-10-01 Thread Eugene Koifman (JIRA)



[ 
https://issues.apache.org/jira/browse/HIVE-20664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16634482#comment-16634482
 ] 

Eugene Koifman edited comment on HIVE-20664 at 10/1/18 6:58 PM:


I agree that this can produce {{ArrayIndexOutOfBounds}}.
In your example, the ORC {{Reader}} should return no rows.  It should always 
read whole stripes.

So I think you patch will cause firstStripeIndex = -1 and lastStripIndex = 1

did you mean {{if (firstStripeIndex == -1 && lastStripeIndex != -1 || 
firstStripeIndex > lastStripeIndex) {}}
to be {{if (firstStripeIndex == -1 && lastStripeIndex != -1 || lastStripeIndex 
> firstStripeIndex) {}}

Also, I think you need to log something like
LOG.info("findMinMaxKeys(): " + keyInterval +
" stripes(" + firstStripeIndex + "," + lastStripeIndex + ")");

before returning a KeyInterval that will actually cause some Delete events to 
be skipped.


was (Author: ekoifman):
I agree that this can produce {{ArrayIndexOutOfBounds}}.
In your example, the ORC {{Reader}} should return no rows.  It should always 
read whole stripes.

So I think you patch will cause firstStripeIndex = -1 and lastStripIndex = 1

did you mean {{if (firstStripeIndex == -1 && lastStripeIndex != -1 || 
firstStripeIndex > lastStripeIndex) {}}
to be {{if (firstStripeIndex == -1 && lastStripeIndex != -1 lastStripeIndex || 
> firstStripeIndex) {}}

Also, I think you need to log something like
LOG.info("findMinMaxKeys(): " + keyInterval +
" stripes(" + firstStripeIndex + "," + lastStripeIndex + ")");

before returning a KeyInterval that will actually cause some Delete events to 
be skipped.

> Potential ArrayIndexOutOfBoundsException in 
> VectorizedOrcAcidRowBatchReader.findMinMaxKeys
> --
>
> Key: HIVE-20664
> URL: https://issues.apache.org/jira/browse/HIVE-20664
> Project: Hive
>  Issue Type: Bug
>  Components: Transactions
>Reporter: Saurabh Seth
>Assignee: Saurabh Seth
>Priority: Minor
> Attachments: HIVE-20664.patch
>
>
> [~ekoifman], could you please confirm if my understanding is correct and if 
> so, review the fix?
> In the method {{VectorizedOrcAcidRowBatchReader.findMinMaxKeys}}, the code 
> snippet that identifies the first and last stripe indices in the current 
> split could result in an ArrayIndexOutOfBoundsException if a complete split 
> is within the same stripe:
> {noformat}
>     for(int i = 0; i < stripes.size(); i++) {
>   StripeInformation stripe = stripes.get(i);
>   long stripeEnd = stripe.getOffset() + stripe.getLength();
>   if(firstStripeIndex == -1 && stripe.getOffset() >= splitStart) {
> firstStripeIndex = i;
>   }
>   if(lastStripeIndex == -1 && splitEnd <= stripeEnd &&
>   stripes.get(firstStripeIndex).getOffset() <= stripe.getOffset() ) {
> //the last condition is for when both splitStart and splitEnd are in
> // the same stripe
> lastStripeIndex = i;
>   }
> }
> {noformat}
> Consider the example where there are 2 stripes - 0-500 and 500-1000 and 
> splitStart is 600 and splitEnd is 800.
> In the first iteration of the loop, stripe.getOffset() is 0 and stripeEnd is 
> 500. In this iteration, neither of the if statement conditions will be met 
> and firstSripeIndex as well as lastStripeIndex remain -1.
> In the second iteration of the loop stripe.getOffset() is 500, stripeEnd is 
> 1000, The first if statement condition will not be met in this case because 
> stripe's offset (500) is not greater than or equal to the splitStart (600). 
> However, in the second if statement, splitEnd (800) is <= stripeEnd(1000) and 
> it will try to compute the last condition 
> {{stripes.get(firstStripeIndex).getOffset() <= stripe.getOffset()}}. This 
> will throw an ArrayIndexOutOfBoundsException because firstStripeIndex is 
> still -1.
> I'm not sure if this scenario is possible at all, hence logging this as a low 
> priority issue. Perhaps block based split generation using BISplitStrategy 
> could trigger this?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (HIVE-20664) Potential ArrayIndexOutOfBoundsException in VectorizedOrcAcidRowBatchReader.findMinMaxKeys

2018-10-01 Thread Eugene Koifman (JIRA)



[ 
https://issues.apache.org/jira/browse/HIVE-20664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16634482#comment-16634482
 ] 

Eugene Koifman commented on HIVE-20664:
---

I agree that this can produce {{ArrayIndexOutOfBounds}}.
In your example, the ORC {{Reader}} should return no rows.  It should always 
read whole stripes.

So I think you patch will cause firstStripeIndex = -1 and lastStripIndex = 1

did you mean {{if (firstStripeIndex == -1 && lastStripeIndex != -1 || 
firstStripeIndex > lastStripeIndex) {}}
to be {{if (firstStripeIndex == -1 && lastStripeIndex != -1 lastStripeIndex || 
> firstStripeIndex) {}}

Also, I think you need to log something like
LOG.info("findMinMaxKeys(): " + keyInterval +
" stripes(" + firstStripeIndex + "," + lastStripeIndex + ")");

before returning a KeyInterval that will actually cause some Delete events to 
be skipped.

> Potential ArrayIndexOutOfBoundsException in 
> VectorizedOrcAcidRowBatchReader.findMinMaxKeys
> --
>
> Key: HIVE-20664
> URL: https://issues.apache.org/jira/browse/HIVE-20664
> Project: Hive
>  Issue Type: Bug
>  Components: Transactions
>Reporter: Saurabh Seth
>Assignee: Saurabh Seth
>Priority: Minor
> Attachments: HIVE-20664.patch
>
>
> [~ekoifman], could you please confirm if my understanding is correct and if 
> so, review the fix?
> In the method {{VectorizedOrcAcidRowBatchReader.findMinMaxKeys}}, the code 
> snippet that identifies the first and last stripe indices in the current 
> split could result in an ArrayIndexOutOfBoundsException if a complete split 
> is within the same stripe:
> {noformat}
>     for(int i = 0; i < stripes.size(); i++) {
>   StripeInformation stripe = stripes.get(i);
>   long stripeEnd = stripe.getOffset() + stripe.getLength();
>   if(firstStripeIndex == -1 && stripe.getOffset() >= splitStart) {
> firstStripeIndex = i;
>   }
>   if(lastStripeIndex == -1 && splitEnd <= stripeEnd &&
>   stripes.get(firstStripeIndex).getOffset() <= stripe.getOffset() ) {
> //the last condition is for when both splitStart and splitEnd are in
> // the same stripe
> lastStripeIndex = i;
>   }
> }
> {noformat}
> Consider the example where there are 2 stripes - 0-500 and 500-1000 and 
> splitStart is 600 and splitEnd is 800.
> In the first iteration of the loop, stripe.getOffset() is 0 and stripeEnd is 
> 500. In this iteration, neither of the if statement conditions will be met 
> and firstSripeIndex as well as lastStripeIndex remain -1.
> In the second iteration of the loop stripe.getOffset() is 500, stripeEnd is 
> 1000, The first if statement condition will not be met in this case because 
> stripe's offset (500) is not greater than or equal to the splitStart (600). 
> However, in the second if statement, splitEnd (800) is <= stripeEnd(1000) and 
> it will try to compute the last condition 
> {{stripes.get(firstStripeIndex).getOffset() <= stripe.getOffset()}}. This 
> will throw an ArrayIndexOutOfBoundsException because firstStripeIndex is 
> still -1.
> I'm not sure if this scenario is possible at all, hence logging this as a low 
> priority issue. Perhaps block based split generation using BISplitStrategy 
> could trigger this?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Resolved] (HIVE-19985) ACID: Skip decoding the ROW__ID sections for read-only queries

2018-09-29 Thread Eugene Koifman (JIRA)



 [ 
https://issues.apache.org/jira/browse/HIVE-19985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eugene Koifman resolved HIVE-19985.
---
  Resolution: Fixed
   Fix Version/s: 4.0.0
Target Version/s: 4.0.0

committed patch  7 to master
thanks Gopal for the review

> ACID: Skip decoding the ROW__ID sections for read-only queries 
> ---
>
> Key: HIVE-19985
> URL: https://issues.apache.org/jira/browse/HIVE-19985
> Project: Hive
>  Issue Type: Improvement
>  Components: Transactions
>Affects Versions: 3.0.0
>Reporter: Gopal V
>Assignee: Eugene Koifman
>Priority: Major
>  Labels: Branch3Candidate
> Fix For: 4.0.0
>
> Attachments: HIVE-19985.01.patch, HIVE-19985.04.patch, 
> HIVE-19985.05.patch, HIVE-19985.06.patch, HIVE-19985.07.patch
>
>
> For a base_n file there are no aborted transactions within the file and if 
> there are no pending delete deltas, the entire ACID ROW__ID can be skipped 
> for all read-only queries (i.e SELECT), though it still needs to be projected 
> out for MERGE, UPDATE and DELETE queries.
> This patch tries to entirely ignore the ACID ROW__ID fields for all tables 
> where there are no possible deletes or aborted transactions for an ACID split.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (HIVE-19985) ACID: Skip decoding the ROW__ID sections for read-only queries

2018-09-29 Thread Eugene Koifman (JIRA)



 [ 
https://issues.apache.org/jira/browse/HIVE-19985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eugene Koifman updated HIVE-19985:
--
Attachment: HIVE-19985.07.patch

> ACID: Skip decoding the ROW__ID sections for read-only queries 
> ---
>
> Key: HIVE-19985
> URL: https://issues.apache.org/jira/browse/HIVE-19985
> Project: Hive
>  Issue Type: Improvement
>  Components: Transactions
>Affects Versions: 3.0.0
>Reporter: Gopal V
>Assignee: Eugene Koifman
>Priority: Major
>  Labels: Branch3Candidate
> Attachments: HIVE-19985.01.patch, HIVE-19985.04.patch, 
> HIVE-19985.05.patch, HIVE-19985.06.patch, HIVE-19985.07.patch
>
>
> For a base_n file there are no aborted transactions within the file and if 
> there are no pending delete deltas, the entire ACID ROW__ID can be skipped 
> for all read-only queries (i.e SELECT), though it still needs to be projected 
> out for MERGE, UPDATE and DELETE queries.
> This patch tries to entirely ignore the ACID ROW__ID fields for all tables 
> where there are no possible deletes or aborted transactions for an ACID split.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (HIVE-19985) ACID: Skip decoding the ROW__ID sections for read-only queries

2018-09-29 Thread Eugene Koifman (JIRA)



 [ 
https://issues.apache.org/jira/browse/HIVE-19985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eugene Koifman updated HIVE-19985:
--
Attachment: (was: HIVE-19985.07.patch)

> ACID: Skip decoding the ROW__ID sections for read-only queries 
> ---
>
> Key: HIVE-19985
> URL: https://issues.apache.org/jira/browse/HIVE-19985
> Project: Hive
>  Issue Type: Improvement
>  Components: Transactions
>Affects Versions: 3.0.0
>Reporter: Gopal V
>Assignee: Eugene Koifman
>Priority: Major
>  Labels: Branch3Candidate
> Attachments: HIVE-19985.01.patch, HIVE-19985.04.patch, 
> HIVE-19985.05.patch, HIVE-19985.06.patch
>
>
> For a base_n file there are no aborted transactions within the file and if 
> there are no pending delete deltas, the entire ACID ROW__ID can be skipped 
> for all read-only queries (i.e SELECT), though it still needs to be projected 
> out for MERGE, UPDATE and DELETE queries.
> This patch tries to entirely ignore the ACID ROW__ID fields for all tables 
> where there are no possible deletes or aborted transactions for an ACID split.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (HIVE-19985) ACID: Skip decoding the ROW__ID sections for read-only queries

2018-09-29 Thread Eugene Koifman (JIRA)



 [ 
https://issues.apache.org/jira/browse/HIVE-19985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eugene Koifman updated HIVE-19985:
--
Attachment: (was: HIVE-19985.07.patch)

> ACID: Skip decoding the ROW__ID sections for read-only queries 
> ---
>
> Key: HIVE-19985
> URL: https://issues.apache.org/jira/browse/HIVE-19985
> Project: Hive
>  Issue Type: Improvement
>  Components: Transactions
>Affects Versions: 3.0.0
>Reporter: Gopal V
>Assignee: Eugene Koifman
>Priority: Major
>  Labels: Branch3Candidate
> Attachments: HIVE-19985.01.patch, HIVE-19985.04.patch, 
> HIVE-19985.05.patch, HIVE-19985.06.patch
>
>
> For a base_n file there are no aborted transactions within the file and if 
> there are no pending delete deltas, the entire ACID ROW__ID can be skipped 
> for all read-only queries (i.e SELECT), though it still needs to be projected 
> out for MERGE, UPDATE and DELETE queries.
> This patch tries to entirely ignore the ACID ROW__ID fields for all tables 
> where there are no possible deletes or aborted transactions for an ACID split.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (HIVE-19985) ACID: Skip decoding the ROW__ID sections for read-only queries

2018-09-29 Thread Eugene Koifman (JIRA)



 [ 
https://issues.apache.org/jira/browse/HIVE-19985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eugene Koifman updated HIVE-19985:
--
Attachment: HIVE-19985.07.patch

> ACID: Skip decoding the ROW__ID sections for read-only queries 
> ---
>
> Key: HIVE-19985
> URL: https://issues.apache.org/jira/browse/HIVE-19985
> Project: Hive
>  Issue Type: Improvement
>  Components: Transactions
>Affects Versions: 3.0.0
>Reporter: Gopal V
>Assignee: Eugene Koifman
>Priority: Major
>  Labels: Branch3Candidate
> Attachments: HIVE-19985.01.patch, HIVE-19985.04.patch, 
> HIVE-19985.05.patch, HIVE-19985.06.patch, HIVE-19985.07.patch, 
> HIVE-19985.07.patch
>
>
> For a base_n file there are no aborted transactions within the file and if 
> there are no pending delete deltas, the entire ACID ROW__ID can be skipped 
> for all read-only queries (i.e SELECT), though it still needs to be projected 
> out for MERGE, UPDATE and DELETE queries.
> This patch tries to entirely ignore the ACID ROW__ID fields for all tables 
> where there are no possible deletes or aborted transactions for an ACID split.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (HIVE-19985) ACID: Skip decoding the ROW__ID sections for read-only queries

2018-09-29 Thread Eugene Koifman (JIRA)



 [ 
https://issues.apache.org/jira/browse/HIVE-19985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eugene Koifman updated HIVE-19985:
--
Attachment: HIVE-19985.07.patch

> ACID: Skip decoding the ROW__ID sections for read-only queries 
> ---
>
> Key: HIVE-19985
> URL: https://issues.apache.org/jira/browse/HIVE-19985
> Project: Hive
>  Issue Type: Improvement
>  Components: Transactions
>Affects Versions: 3.0.0
>Reporter: Gopal V
>Assignee: Eugene Koifman
>Priority: Major
>  Labels: Branch3Candidate
> Attachments: HIVE-19985.01.patch, HIVE-19985.04.patch, 
> HIVE-19985.05.patch, HIVE-19985.06.patch, HIVE-19985.07.patch, 
> HIVE-19985.07.patch
>
>
> For a base_n file there are no aborted transactions within the file and if 
> there are no pending delete deltas, the entire ACID ROW__ID can be skipped 
> for all read-only queries (i.e SELECT), though it still needs to be projected 
> out for MERGE, UPDATE and DELETE queries.
> This patch tries to entirely ignore the ACID ROW__ID fields for all tables 
> where there are no possible deletes or aborted transactions for an ACID split.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (HIVE-19985) ACID: Skip decoding the ROW__ID sections for read-only queries

2018-09-29 Thread Eugene Koifman (JIRA)



 [ 
https://issues.apache.org/jira/browse/HIVE-19985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eugene Koifman updated HIVE-19985:
--
Status: Open  (was: Patch Available)

> ACID: Skip decoding the ROW__ID sections for read-only queries 
> ---
>
> Key: HIVE-19985
> URL: https://issues.apache.org/jira/browse/HIVE-19985
> Project: Hive
>  Issue Type: Improvement
>  Components: Transactions
>Affects Versions: 3.0.0
>Reporter: Gopal V
>Assignee: Eugene Koifman
>Priority: Major
>  Labels: Branch3Candidate
> Attachments: HIVE-19985.01.patch, HIVE-19985.04.patch, 
> HIVE-19985.05.patch, HIVE-19985.06.patch
>
>
> For a base_n file there are no aborted transactions within the file and if 
> there are no pending delete deltas, the entire ACID ROW__ID can be skipped 
> for all read-only queries (i.e SELECT), though it still needs to be projected 
> out for MERGE, UPDATE and DELETE queries.
> This patch tries to entirely ignore the ACID ROW__ID fields for all tables 
> where there are no possible deletes or aborted transactions for an ACID split.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (HIVE-20556) Expose an API to retrieve the TBL_ID from TBLS in the metastore tables

2018-09-28 Thread Eugene Koifman (JIRA)



[ 
https://issues.apache.org/jira/browse/HIVE-20556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16632248#comment-16632248
 ] 

Eugene Koifman commented on HIVE-20556:
---

I added a few minor things on RB, otherwise it makes sense

> Expose an API to retrieve the TBL_ID from TBLS in the metastore tables
> --
>
> Key: HIVE-20556
> URL: https://issues.apache.org/jira/browse/HIVE-20556
> Project: Hive
>  Issue Type: New Feature
>  Components: Metastore, Standalone Metastore
>Reporter: Jaume M
>Assignee: Jaume M
>Priority: Major
> Attachments: HIVE-20556.1.patch, HIVE-20556.10.patch, 
> HIVE-20556.11.patch, HIVE-20556.12.patch, HIVE-20556.13.patch, 
> HIVE-20556.14.patch, HIVE-20556.15.patch, HIVE-20556.2.patch, 
> HIVE-20556.3.patch, HIVE-20556.4.patch, HIVE-20556.5.patch, 
> HIVE-20556.6.patch, HIVE-20556.7.patch, HIVE-20556.8.patch, HIVE-20556.9.patch
>
>
> We have two options to do this
> 1) Use the current MTable and add a field for this value
> 2) Add an independent API call to the metastore that would return the TBL_ID.
> Option 1 is preferable.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Resolved] (HIVE-18774) ACID: Use the _copy_N files copyNumber as the implicit statement-id

2018-09-28 Thread Eugene Koifman (JIRA)



 [ 
https://issues.apache.org/jira/browse/HIVE-18774?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eugene Koifman resolved HIVE-18774.
---
Resolution: Won't Fix

since HIVE-17917 reading footers for all copy_N files is done once per file 
during split gen

> ACID: Use the _copy_N files copyNumber as the implicit statement-id
> ---
>
> Key: HIVE-18774
> URL: https://issues.apache.org/jira/browse/HIVE-18774
> Project: Hive
>  Issue Type: Bug
>  Components: Transactions
> Environment: if this is not done in 3.0 it cannot be done at all
>Reporter: Gopal V
>Assignee: Eugene Koifman
>Priority: Blocker
> Attachments: HIVE-18774.03.wip.patch
>
>
> When upgrading flat ORC files to ACID, use the _copy_N numbering as a 
> statement-id to avoid having to align the row numbering between _copy_1 and 
> _copy_2 files.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

< 1 2 3 4 5 6 7 8 9 10 >

301 - 400 of 4472 matches

Mail list logo