[jira] [Commented] (HIVE-20901) running compactor when there is nothing to do produces duplicate data

2019-04-09 Thread Eugene Koifman (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-20901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16813942#comment-16813942
 ] 

Eugene Koifman commented on HIVE-20901:
---

I'd suggest that {{msg.append("Skipping minor compaction as");}} should include 
compaction ID and db.table.partition info.

> running compactor when there is nothing to do produces duplicate data
> -
>
> Key: HIVE-20901
> URL: https://issues.apache.org/jira/browse/HIVE-20901
> Project: Hive
>  Issue Type: Bug
>  Components: Transactions
>Affects Versions: 4.0.0
>Reporter: Eugene Koifman
>Assignee: Abhishek Somani
>Priority: Major
> Attachments: HIVE-20901.1.patch, HIVE-20901.2.patch
>
>
> suppose we run minor compaction 2 times, via alter table
> The 2nd request to compaction should have nothing to do but I don't think 
> there is a check for that.  It's visible in the context of HIVE-20823, where 
> each compactor run produces a delta with new visibility suffix so we end up 
> with something like
> {noformat}
> target/tmp/org.apache.hadoop.hive.ql.TestTxnCommands3-1541810844849/warehouse/t/
> ├── delete_delta_001_002_v019
> │   ├── _orc_acid_version
> │   └── bucket_0
> ├── delete_delta_001_002_v021
> │   ├── _orc_acid_version
> │   └── bucket_0
> ├── delta_001_001_
> │   ├── _orc_acid_version
> │   └── bucket_0
> ├── delta_001_002_v019
> │   ├── _orc_acid_version
> │   └── bucket_0
> ├── delta_001_002_v021
> │   ├── _orc_acid_version
> │   └── bucket_0
> └── delta_002_002_
>     ├── _orc_acid_version
>     └── bucket_0{noformat}
> i.e. 2 deltas with the same write ID range
> this is bad.  Probably happens today as well but new run produces a delta 
> with the same name and clobbers the previous one, which may interfere with 
> writers
>  
> need to investigate
>  
> -The issue (I think) is that {{AcidUtils.getAcidState()}} then returns both 
> deltas as if they were distinct and it effectively duplicates data.-  There 
> is no data duplication - {{getAcidState()}} will not use 2 deltas with the 
> same {{writeid}} range
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-20901) running compactor when there is nothing to do produces duplicate data

2019-04-04 Thread Eugene Koifman (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-20901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16810093#comment-16810093
 ] 

Eugene Koifman commented on HIVE-20901:
---

go ahead.  Looking at the description, there is no data duplication issue here 
and now that compactor runs in a transaction the 2 compactor runs will output 
distinct directories.  

> running compactor when there is nothing to do produces duplicate data
> -
>
> Key: HIVE-20901
> URL: https://issues.apache.org/jira/browse/HIVE-20901
> Project: Hive
>  Issue Type: Bug
>  Components: Transactions
>Affects Versions: 4.0.0
>Reporter: Eugene Koifman
>Assignee: Eugene Koifman
>Priority: Major
>
> suppose we run minor compaction 2 times, via alter table
> The 2nd request to compaction should have nothing to do but I don't think 
> there is a check for that.  It's visible in the context of HIVE-20823, where 
> each compactor run produces a delta with new visibility suffix so we end up 
> with something like
> {noformat}
> target/tmp/org.apache.hadoop.hive.ql.TestTxnCommands3-1541810844849/warehouse/t/
> ├── delete_delta_001_002_v019
> │   ├── _orc_acid_version
> │   └── bucket_0
> ├── delete_delta_001_002_v021
> │   ├── _orc_acid_version
> │   └── bucket_0
> ├── delta_001_001_
> │   ├── _orc_acid_version
> │   └── bucket_0
> ├── delta_001_002_v019
> │   ├── _orc_acid_version
> │   └── bucket_0
> ├── delta_001_002_v021
> │   ├── _orc_acid_version
> │   └── bucket_0
> └── delta_002_002_
>     ├── _orc_acid_version
>     └── bucket_0{noformat}
> i.e. 2 deltas with the same write ID range
> this is bad.  Probably happens today as well but new run produces a delta 
> with the same name and clobbers the previous one, which may interfere with 
> writers
>  
> need to investigate
>  
> -The issue (I think) is that {{AcidUtils.getAcidState()}} then returns both 
> deltas as if they were distinct and it effectively duplicates data.-  There 
> is no data duplication - {{getAcidState()}} will not use 2 deltas with the 
> same {{writeid}} range
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-13479) Relax sorting requirement in ACID tables

2019-03-20 Thread Eugene Koifman (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-13479?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16797769#comment-16797769
 ] 

Eugene Koifman commented on HIVE-13479:
---

There is no sorting restriction on insert-only ACID tables.
Delete event filtering (HIVE-20738) for full-crud tables relies on the fact 
that data is ordered by ROW__ID.
I don't think there is anything that precludes INSERT INTO T  SORT BY ...  
for full-crud table
That should be enough to make min/max in ORC useful for predicate push-down in 
a lot of cases.

IOW is supported and I think could be used to re-sort the table by any column 
(and will generate new row_id) but it's currently an operation with X lock.  
With some work, IOW could run with less strict lock, that allows reads but not 
any other writes.  Compaction that does overwrite would have the same issue 
which is likely too restrictive.  
IOW (directly from user or compactor) is also problematic since it will 
invalidate all result set caches and materialized views.

Incidentally, {{hive.optimize.sort.dynamic.partition=true}} was fixed on ACID 
tables long ago.








> Relax sorting requirement in ACID tables
> 
>
> Key: HIVE-13479
> URL: https://issues.apache.org/jira/browse/HIVE-13479
> Project: Hive
>  Issue Type: New Feature
>  Components: Transactions
>Affects Versions: 1.2.0
>Reporter: Eugene Koifman
>Assignee: Eugene Koifman
>Priority: Major
>   Original Estimate: 160h
>  Remaining Estimate: 160h
>
> Currently ACID tables require data to be sorted according to internal primary 
> key.  This is that base + delta files can be efficiently sort/merged to 
> produce the snapshot for current transaction.
> This prevents the user to make the table sorted based on any other criteria 
> which can be useful.  One example is using dynamic partition insert (which 
> also occurs for update/delete SQL).  This may create lots of writers 
> (buckets*partitions) and tax cluster resources.
> The usual solution is hive.optimize.sort.dynamic.partition=true which won't 
> be honored for ACID tables.
> We could rely on hash table based algorithm to merge delta files and then not 
> require any particular sort on Acid tables.  One way to do that is to treat 
> each update event as an Insert (new internal PK) + delete (old PK).  Delete 
> events are very small since they just need to contain PKs.  So the hash table 
> would just need to contain Delete events and be reasonably memory efficient.
> This is a significant amount of work but worth doing.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-21165) ACID: pass query hint to the writers to write hive.acid.key.index

2019-03-18 Thread Eugene Koifman (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-21165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16795644#comment-16795644
 ] 

Eugene Koifman commented on HIVE-21165:
---

The logic in HIVE-20738 relies on this index.

> ACID: pass query hint to the writers to write hive.acid.key.index
> -
>
> Key: HIVE-21165
> URL: https://issues.apache.org/jira/browse/HIVE-21165
> Project: Hive
>  Issue Type: Bug
>  Components: Transactions
>Affects Versions: 3.1.1
>Reporter: Vaibhav Gumashta
>Assignee: Vaibhav Gumashta
>Priority: Major
>
> For the query based compactor from HIVE-20699, the compaction runs as a sql 
> query. However, this mechanism skips over writing hive.acid.key.index for 
> each stripe, which is used to skip over stripes that are not supposed to be 
> read. We need a way to pass a query hint to the writer so that it can write 
> this index data, when invoked from a sql query.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-20580) OrcInputFormat.isOriginal() should not rely on hive.acid.key.index

2019-03-18 Thread Eugene Koifman (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-20580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16795638#comment-16795638
 ] 

Eugene Koifman commented on HIVE-20580:
---

isOriginal should mean files w/o acid metadata columns in them.  You can have 
them in a transactional table table because it started out as a flat table and 
was ALTER TABLE'd to a transactional or they were added via LOAD DATA for 
example.  I think you said the wrong version of the isOriginal() is not used - 
I'd get rid of it.

> OrcInputFormat.isOriginal() should not rely on hive.acid.key.index
> --
>
> Key: HIVE-20580
> URL: https://issues.apache.org/jira/browse/HIVE-20580
> Project: Hive
>  Issue Type: Improvement
>  Components: Transactions
>Affects Versions: 3.1.0
>Reporter: Eugene Koifman
>Assignee: Peter Vary
>Priority: Major
> Attachments: HIVE-20580.2.patch, HIVE-20580.3.patch, 
> HIVE-20580.4.patch, HIVE-20580.5.patch, HIVE-20580.6.patch, HIVE-20580.patch
>
>
> {{org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.isOriginal()}} is checking 
> for presence of {{hive.acid.key.index}} in the footer.  This is only created 
> when the file is written by {{OrcRecordUpdater}}.  It should instead check 
> for presence of Acid metadata columns so that a file can be produced by 
> something other than {{OrcRecordUpater}}.
> Also, {{hive.acid.key.index}} counts number of different type of events which 
> is not really useful for Acid V2 (as of Hive 3) since each file only has 1 
> type of event.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-20580) OrcInputFormat.isOriginal() should not rely on hive.acid.key.index

2019-03-17 Thread Eugene Koifman (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-20580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16794511#comment-16794511
 ] 

Eugene Koifman commented on HIVE-20580:
---

note that Query based compactor doesn't produce hive.acid.index so this jira is 
important once that is enabled. cc [~vgumashta]

> OrcInputFormat.isOriginal() should not rely on hive.acid.key.index
> --
>
> Key: HIVE-20580
> URL: https://issues.apache.org/jira/browse/HIVE-20580
> Project: Hive
>  Issue Type: Improvement
>  Components: Transactions
>Affects Versions: 3.1.0
>Reporter: Eugene Koifman
>Assignee: Peter Vary
>Priority: Major
> Attachments: HIVE-20580.2.patch, HIVE-20580.3.patch, 
> HIVE-20580.4.patch, HIVE-20580.5.patch, HIVE-20580.6.patch, HIVE-20580.patch
>
>
> {{org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.isOriginal()}} is checking 
> for presence of {{hive.acid.key.index}} in the footer.  This is only created 
> when the file is written by {{OrcRecordUpdater}}.  It should instead check 
> for presence of Acid metadata columns so that a file can be produced by 
> something other than {{OrcRecordUpater}}.
> Also, {{hive.acid.key.index}} counts number of different type of events which 
> is not really useful for Acid V2 (as of Hive 3) since each file only has 1 
> type of event.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-20436) Lock Manager scalability - linear

2019-02-19 Thread Eugene Koifman (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-20436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16772344#comment-16772344
 ] 

Eugene Koifman commented on HIVE-20436:
---

{{RewriteSemanticAnalyzer.updateOutputs}} is relevant

> Lock Manager scalability - linear
> -
>
> Key: HIVE-20436
> URL: https://issues.apache.org/jira/browse/HIVE-20436
> Project: Hive
>  Issue Type: Improvement
>  Components: Transactions
>Affects Versions: 3.0.0
>Reporter: Eugene Koifman
>Assignee: Eugene Koifman
>Priority: Major
>
> Hive TransactionManager currently has a mix of lock based and optimistic 
> concurrency management techniques (which at times overlap).
> For inserts with Dynamic Partitions that represents update/merge it acquires 
> locks on each existing partition which can flood the metastore DB.
> Need to clean up the logical model and the implementation.
> This will be an umbrella Jira for this



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-21266) Issue with single delta file

2019-02-13 Thread Eugene Koifman (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-21266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eugene Koifman updated HIVE-21266:
--
Component/s: Transactions

> Issue with single delta file
> 
>
> Key: HIVE-21266
> URL: https://issues.apache.org/jira/browse/HIVE-21266
> Project: Hive
>  Issue Type: Sub-task
>  Components: Transactions
>Affects Versions: 4.0.0
>Reporter: Eugene Koifman
>Assignee: Vaibhav Gumashta
>Priority: Major
>
> [https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/txn/compactor/CompactorMR.java#L353-L357]
>  
> {noformat}
> if ((deltaCount + (dir.getBaseDirectory() == null ? 0 : 1)) + origCount <= 1) 
> {
>   LOG.debug("Not compacting {}; current base is {} and there are {} 
> deltas and {} originals", sd.getLocation(), dir
>   .getBaseDirectory(), deltaCount, origCount);
>   return;
> }
>  {noformat}
> Is problematic.
> Suppose you have 1 delta file from streaming ingest: {{delta_11_20}} where 
> {{txnid:13}} was aborted.  The code above will not rewrite the delta (which 
> drops anything that belongs to the aborted txn) and transition the compaction 
> to "ready_for_cleaning" state which will drop the metadata about the aborted 
> txn in {{markCleaned()}}.  Now aborted data will come back as committed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-21254) Pre-upgrade tool should handle exceptions and skip db/tables

2019-02-13 Thread Eugene Koifman (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-21254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16767735#comment-16767735
 ] 

Eugene Koifman commented on HIVE-21254:
---

+1 patch 5 pending tests

> Pre-upgrade tool should handle exceptions and skip db/tables
> 
>
> Key: HIVE-21254
> URL: https://issues.apache.org/jira/browse/HIVE-21254
> Project: Hive
>  Issue Type: Bug
>  Components: Transactions
>Affects Versions: 4.0.0
>Reporter: Prasanth Jayachandran
>Assignee: Prasanth Jayachandran
>Priority: Major
> Attachments: HIVE-21254.1.patch, HIVE-21254.2.patch, 
> HIVE-21254.3.patch, HIVE-21254.4.patch, HIVE-21254.5.patch
>
>
> When exceptions like AccessControlException is thrown, pre-upgrade tool 
> fails. If hive user does not have read access to database or tables (some 
> external tables denies read access to hive), pre-upgrade tool should just 
> assume they are external tables and move on without failing pre-upgrade 
> process. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-21177) Optimize AcidUtils.getLogicalLength()

2019-02-13 Thread Eugene Koifman (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-21177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eugene Koifman updated HIVE-21177:
--
Release Note: 
I messed up the commit msg for this.  The Jira number is correct, but the bug 
title is from another issue

{noformat}
commit 07b76f413cb174413f0530a6aae5ae442a301b46
Author: Eugene Koifman 
Date:   Thu Feb 7 09:49:19 2019 -0800

HIVE-21177: ACID: When there are no delete deltas skip finding min max keys 
(Eugene Koifman, reviewed by Prasanth Jayachandran)
{noformat}

  was:n/a


> Optimize AcidUtils.getLogicalLength()
> -
>
> Key: HIVE-21177
> URL: https://issues.apache.org/jira/browse/HIVE-21177
> Project: Hive
>  Issue Type: Bug
>  Components: Transactions
>Affects Versions: 3.0.0
>Reporter: Eugene Koifman
>Assignee: Eugene Koifman
>Priority: Major
> Fix For: 4.0.0
>
> Attachments: HIVE-21177.01.patch, HIVE-21177.02.patch, 
> HIVE-21177.03.patch
>
>
> {{AcidUtils.getLogicalLength()}} - tries look for the side file 
> {{OrcAcidUtils.getSideFile()}} on the file system even when the file couldn't 
> possibly be there, e.g. when the path is delta_x_x or base_x.  It could only 
> be there in delta_x_y, x != y.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-21254) Pre-upgrade tool should handle exceptions and skip db/tables

2019-02-13 Thread Eugene Koifman (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-21254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16767572#comment-16767572
 ] 

Eugene Koifman commented on HIVE-21254:
---

I would think it's a security hole if you can set this from client.
Perhaps the utility should fail if it gets an ACL exception and include this 
prop in the msg...

> Pre-upgrade tool should handle exceptions and skip db/tables
> 
>
> Key: HIVE-21254
> URL: https://issues.apache.org/jira/browse/HIVE-21254
> Project: Hive
>  Issue Type: Bug
>  Components: Transactions
>Affects Versions: 4.0.0
>Reporter: Prasanth Jayachandran
>Assignee: Prasanth Jayachandran
>Priority: Major
> Attachments: HIVE-21254.1.patch, HIVE-21254.2.patch, 
> HIVE-21254.3.patch, HIVE-21254.4.patch
>
>
> When exceptions like AccessControlException is thrown, pre-upgrade tool 
> fails. If hive user does not have read access to database or tables (some 
> external tables denies read access to hive), pre-upgrade tool should just 
> assume they are external tables and move on without failing pre-upgrade 
> process. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-21266) Issue with single delta file

2019-02-13 Thread Eugene Koifman (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-21266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eugene Koifman updated HIVE-21266:
--
Description: 
[https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/txn/compactor/CompactorMR.java#L353-L357]

 
{noformat}
if ((deltaCount + (dir.getBaseDirectory() == null ? 0 : 1)) + origCount <= 1) {
  LOG.debug("Not compacting {}; current base is {} and there are {} deltas 
and {} originals", sd.getLocation(), dir
  .getBaseDirectory(), deltaCount, origCount);
  return;
}
 {noformat}

Is problematic.
Suppose you have 1 delta file from streaming ingest: {{delta_11_20}} where 
{{txnid:13}} was aborted.  The code above will not rewrite the delta (which 
drops anything that belongs to the aborted txn) and transition the compaction 
to "ready_for_cleaning" state which will drop the metadata about the aborted 
txn in {{markCleaned()}}.  Now aborted data will come back as committed.



  was:
[https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/txn/compactor/CompactorMR.java#L353-L357]

 
{noformat}
if ((deltaCount + (dir.getBaseDirectory() == null ? 0 : 1)) + origCount <= 1) {
  LOG.debug("Not compacting {}; current base is {} and there are {} deltas 
and {} originals", sd.getLocation(), dir
  .getBaseDirectory(), deltaCount, origCount);
  return;
}
 {noformat}

Is problematic.
Suppose you have 1 delta file from streaming ingest: {{delta_11_20}} where 
{{txnid:13}} was aborted.  The code above will not rewrite the delta (which 
drops anything that belongs to the aborted txn) and transition the compaction 
to "ready_for_cleaning" which will drop the metadata about the aborted txn.  
Now aborted data will come back as committed.




> Issue with single delta file
> 
>
> Key: HIVE-21266
> URL: https://issues.apache.org/jira/browse/HIVE-21266
> Project: Hive
>  Issue Type: Sub-task
>Affects Versions: 4.0.0
>Reporter: Eugene Koifman
>Assignee: Vaibhav Gumashta
>Priority: Major
>
> [https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/txn/compactor/CompactorMR.java#L353-L357]
>  
> {noformat}
> if ((deltaCount + (dir.getBaseDirectory() == null ? 0 : 1)) + origCount <= 1) 
> {
>   LOG.debug("Not compacting {}; current base is {} and there are {} 
> deltas and {} originals", sd.getLocation(), dir
>   .getBaseDirectory(), deltaCount, origCount);
>   return;
> }
>  {noformat}
> Is problematic.
> Suppose you have 1 delta file from streaming ingest: {{delta_11_20}} where 
> {{txnid:13}} was aborted.  The code above will not rewrite the delta (which 
> drops anything that belongs to the aborted txn) and transition the compaction 
> to "ready_for_cleaning" state which will drop the metadata about the aborted 
> txn in {{markCleaned()}}.  Now aborted data will come back as committed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (HIVE-21266) Issue with single delta file

2019-02-13 Thread Eugene Koifman (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-21266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eugene Koifman reassigned HIVE-21266:
-


> Issue with single delta file
> 
>
> Key: HIVE-21266
> URL: https://issues.apache.org/jira/browse/HIVE-21266
> Project: Hive
>  Issue Type: Sub-task
>Affects Versions: 4.0.0
>Reporter: Eugene Koifman
>Assignee: Vaibhav Gumashta
>Priority: Major
>
> [https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/txn/compactor/CompactorMR.java#L353-L357]
>  
> {noformat}
> if ((deltaCount + (dir.getBaseDirectory() == null ? 0 : 1)) + origCount <= 1) 
> {
>   LOG.debug("Not compacting {}; current base is {} and there are {} 
> deltas and {} originals", sd.getLocation(), dir
>   .getBaseDirectory(), deltaCount, origCount);
>   return;
> }
>  {noformat}
> Is problematic.
> Suppose you have 1 delta file from streaming ingest: {{delta_11_20}} where 
> {{txnid:13}} was aborted.  The code above will not rewrite the delta (which 
> drops anything that belongs to the aborted txn) and transition the compaction 
> to "ready_for_cleaning" which will drop the metadata about the aborted txn.  
> Now aborted data will come back as committed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-21254) Pre-upgrade tool should handle exceptions and skip db/tables

2019-02-13 Thread Eugene Koifman (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-21254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16767539#comment-16767539
 ] 

Eugene Koifman commented on HIVE-21254:
---

Compactor has to have {{Table}} object to do anything.  If it cannot do that, 
it will fail the compaction.

> Pre-upgrade tool should handle exceptions and skip db/tables
> 
>
> Key: HIVE-21254
> URL: https://issues.apache.org/jira/browse/HIVE-21254
> Project: Hive
>  Issue Type: Bug
>  Components: Transactions
>Affects Versions: 4.0.0
>Reporter: Prasanth Jayachandran
>Assignee: Prasanth Jayachandran
>Priority: Major
> Attachments: HIVE-21254.1.patch, HIVE-21254.2.patch, 
> HIVE-21254.3.patch, HIVE-21254.4.patch
>
>
> When exceptions like AccessControlException is thrown, pre-upgrade tool 
> fails. If hive user does not have read access to database or tables (some 
> external tables denies read access to hive), pre-upgrade tool should just 
> assume they are external tables and move on without failing pre-upgrade 
> process. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-21254) Pre-upgrade tool should handle exceptions and skip db/tables

2019-02-13 Thread Eugene Koifman (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-21254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16767507#comment-16767507
 ] 

Eugene Koifman commented on HIVE-21254:
---

this seems dangerous.  If the tools misses any tables that were actually Acid 
and need compacting, all the user sees is a WARN in the log which is easy to 
miss.  And try to use Acid V1 table from Hive 3 will result in data loss (and 
perhaps corruption)

> Pre-upgrade tool should handle exceptions and skip db/tables
> 
>
> Key: HIVE-21254
> URL: https://issues.apache.org/jira/browse/HIVE-21254
> Project: Hive
>  Issue Type: Bug
>  Components: Transactions
>Affects Versions: 4.0.0
>Reporter: Prasanth Jayachandran
>Assignee: Prasanth Jayachandran
>Priority: Major
> Attachments: HIVE-21254.1.patch, HIVE-21254.2.patch, 
> HIVE-21254.3.patch, HIVE-21254.4.patch
>
>
> When exceptions like AccessControlException is thrown, pre-upgrade tool 
> fails. If hive user does not have read access to database or tables (some 
> external tables denies read access to hive), pre-upgrade tool should just 
> assume they are external tables and move on without failing pre-upgrade 
> process. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-21058) Make Compactor run in a transaction (Umbrella)

2019-02-13 Thread Eugene Koifman (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-21058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16767499#comment-16767499
 ] 

Eugene Koifman commented on HIVE-21058:
---

[~asomani] - no concrete plans

> Make Compactor run in a transaction (Umbrella)
> --
>
> Key: HIVE-21058
> URL: https://issues.apache.org/jira/browse/HIVE-21058
> Project: Hive
>  Issue Type: New Feature
>  Components: Transactions
>Affects Versions: 3.0.0
>Reporter: Eugene Koifman
>Assignee: Eugene Koifman
>Priority: Major
> Fix For: 4.0.0
>
>
> Ensure that files produced by the compactor have their visibility controlled 
> via Hive transaction commit like any other write to an ACID table.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-9995) ACID compaction tries to compact a single file

2019-02-08 Thread Eugene Koifman (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-9995?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eugene Koifman updated HIVE-9995:
-
Target Version/s: 4.0.0
  Status: Patch Available  (was: Open)

> ACID compaction tries to compact a single file
> --
>
> Key: HIVE-9995
> URL: https://issues.apache.org/jira/browse/HIVE-9995
> Project: Hive
>  Issue Type: Bug
>  Components: Transactions
>Affects Versions: 1.0.0
>Reporter: Eugene Koifman
>Assignee: Eugene Koifman
>Priority: Major
> Attachments: HIVE-9995.01.patch, HIVE-9995.WIP.patch
>
>
> Consider TestWorker.minorWithOpenInMiddle()
> since there is an open txnId=23, this doesn't have any meaningful minor 
> compaction work to do.  The system still tries to compact a single delta file 
> for 21-22 id range, and effectively copies the file onto itself.
> This is 1. inefficient and 2. can potentially affect a reader.
> (from a real cluster)
> Suppose we start with 
> {noformat}
> drwxr-xr-x   - ekoifman staff  0 2016-06-09 16:03 
> /user/hive/warehouse/t/base_016
> -rw-r--r--   1 ekoifman staff602 2016-06-09 16:03 
> /user/hive/warehouse/t/base_016/bucket_0
> drwxr-xr-x   - ekoifman staff  0 2016-06-09 16:07 
> /user/hive/warehouse/t/base_017
> -rw-r--r--   1 ekoifman staff588 2016-06-09 16:07 
> /user/hive/warehouse/t/base_017/bucket_0
> drwxr-xr-x   - ekoifman staff  0 2016-06-09 16:07 
> /user/hive/warehouse/t/delta_017_017_
> -rw-r--r--   1 ekoifman staff514 2016-06-09 16:06 
> /user/hive/warehouse/t/delta_017_017_/bucket_0
> drwxr-xr-x   - ekoifman staff  0 2016-06-09 16:07 
> /user/hive/warehouse/t/delta_018_018_
> -rw-r--r--   1 ekoifman staff612 2016-06-09 16:07 
> /user/hive/warehouse/t/delta_018_018_/bucket_0
> {noformat}
> then do _alter table T compact 'minor';_
> then we end up with 
> {noformat}
> drwxr-xr-x   - ekoifman staff  0 2016-06-09 16:07 
> /user/hive/warehouse/t/base_017
> -rw-r--r--   1 ekoifman staff588 2016-06-09 16:07 
> /user/hive/warehouse/t/base_017/bucket_0
> drwxr-xr-x   - ekoifman staff  0 2016-06-09 16:11 
> /user/hive/warehouse/t/delta_018_018
> -rw-r--r--   1 ekoifman staff500 2016-06-09 16:11 
> /user/hive/warehouse/t/delta_018_018/bucket_0
> drwxr-xr-x   - ekoifman staff  0 2016-06-09 16:07 
> /user/hive/warehouse/t/delta_018_018_
> -rw-r--r--   1 ekoifman staff612 2016-06-09 16:07 
> /user/hive/warehouse/t/delta_018_018_/bucket_0
> {noformat}
> So compaction created a new dir _/user/hive/warehouse/t/delta_018_018_



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-20901) running compactor when there is nothing to do produces duplicate data

2019-02-08 Thread Eugene Koifman (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-20901?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eugene Koifman updated HIVE-20901:
--
Description: 
suppose we run minor compaction 2 times, via alter table

The 2nd request to compaction should have nothing to do but I don't think there 
is a check for that.  It's visible in the context of HIVE-20823, where each 
compactor run produces a delta with new visibility suffix so we end up with 
something like
{noformat}
target/tmp/org.apache.hadoop.hive.ql.TestTxnCommands3-1541810844849/warehouse/t/

├── delete_delta_001_002_v019
│   ├── _orc_acid_version
│   └── bucket_0
├── delete_delta_001_002_v021
│   ├── _orc_acid_version
│   └── bucket_0
├── delta_001_001_
│   ├── _orc_acid_version
│   └── bucket_0
├── delta_001_002_v019
│   ├── _orc_acid_version
│   └── bucket_0
├── delta_001_002_v021
│   ├── _orc_acid_version
│   └── bucket_0
└── delta_002_002_
    ├── _orc_acid_version
    └── bucket_0{noformat}
i.e. 2 deltas with the same write ID range

this is bad.  Probably happens today as well but new run produces a delta with 
the same name and clobbers the previous one, which may interfere with writers

 

need to investigate

 

-The issue (I think) is that {{AcidUtils.getAcidState()}} then returns both 
deltas as if they were distinct and it effectively duplicates data.-  There is 
no data duplication - {{getAcidState()}} will not use 2 deltas with the same 
{{writeid}} range

 

 

  was:
suppose we run minor compaction 2 times, via alter table

The 2nd request to compaction should have nothing to do but I don't think there 
is a check for that.  It's visible in the context of HIVE-20823, where each 
compactor run produces a delta with new visibility suffix so we end up with 
something like
{noformat}
target/tmp/org.apache.hadoop.hive.ql.TestTxnCommands3-1541810844849/warehouse/t/

├── delete_delta_001_002_v019
│   ├── _orc_acid_version
│   └── bucket_0
├── delete_delta_001_002_v021
│   ├── _orc_acid_version
│   └── bucket_0
├── delta_001_001_
│   ├── _orc_acid_version
│   └── bucket_0
├── delta_001_002_v019
│   ├── _orc_acid_version
│   └── bucket_0
├── delta_001_002_v021
│   ├── _orc_acid_version
│   └── bucket_0
└── delta_002_002_
    ├── _orc_acid_version
    └── bucket_0{noformat}
i.e. 2 deltas with the same write ID range

this is bad.  Probably happens today as well but new run produces a delta with 
the same name and clobbers the previous one, which may interfere with writers

 

need to investigate

 

-The issue (I think) is that {{AcidUtils.getAcidState()}} then returns both 
deltas as if they were distinct and it effectively duplicates data.-  There is 
no data duplication - {{getAcidState()}} will use 2 deltas with the same 
\{{writeid}} range

 

 


> running compactor when there is nothing to do produces duplicate data
> -
>
> Key: HIVE-20901
> URL: https://issues.apache.org/jira/browse/HIVE-20901
> Project: Hive
>  Issue Type: Bug
>  Components: Transactions
>Affects Versions: 4.0.0
>Reporter: Eugene Koifman
>Assignee: Eugene Koifman
>Priority: Major
>
> suppose we run minor compaction 2 times, via alter table
> The 2nd request to compaction should have nothing to do but I don't think 
> there is a check for that.  It's visible in the context of HIVE-20823, where 
> each compactor run produces a delta with new visibility suffix so we end up 
> with something like
> {noformat}
> target/tmp/org.apache.hadoop.hive.ql.TestTxnCommands3-1541810844849/warehouse/t/
> ├── delete_delta_001_002_v019
> │   ├── _orc_acid_version
> │   └── bucket_0
> ├── delete_delta_001_002_v021
> │   ├── _orc_acid_version
> │   └── bucket_0
> ├── delta_001_001_
> │   ├── _orc_acid_version
> │   └── bucket_0
> ├── delta_001_002_v019
> │   ├── _orc_acid_version
> │   └── bucket_0
> ├── delta_001_002_v021
> │   ├── _orc_acid_version
> │   └── bucket_0
> └── delta_002_002_
>     ├── _orc_acid_version
>     └── bucket_0{noformat}
> i.e. 2 deltas with the same write ID range
> this is bad.  Probably happens today as well but new run produces a delta 
> with the same name and clobbers the previous one, which may interfere with 
> writers
>  
> need to investigate
>  
> -The issue (I think) is that {{AcidUtils.getAcidState()}} then returns both 
> deltas as if they were distinct and it effectively duplicates data.-  There 
> is no data duplication - {{getAcidState()}} will not use 2 deltas with the 
> same {{writeid}} range
>  
>  



--
This message was sent by 

[jira] [Updated] (HIVE-9995) ACID compaction tries to compact a single file

2019-02-08 Thread Eugene Koifman (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-9995?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eugene Koifman updated HIVE-9995:
-
Attachment: HIVE-9995.01.patch

> ACID compaction tries to compact a single file
> --
>
> Key: HIVE-9995
> URL: https://issues.apache.org/jira/browse/HIVE-9995
> Project: Hive
>  Issue Type: Bug
>  Components: Transactions
>Affects Versions: 1.0.0
>Reporter: Eugene Koifman
>Assignee: Eugene Koifman
>Priority: Major
> Attachments: HIVE-9995.01.patch, HIVE-9995.WIP.patch
>
>
> Consider TestWorker.minorWithOpenInMiddle()
> since there is an open txnId=23, this doesn't have any meaningful minor 
> compaction work to do.  The system still tries to compact a single delta file 
> for 21-22 id range, and effectively copies the file onto itself.
> This is 1. inefficient and 2. can potentially affect a reader.
> (from a real cluster)
> Suppose we start with 
> {noformat}
> drwxr-xr-x   - ekoifman staff  0 2016-06-09 16:03 
> /user/hive/warehouse/t/base_016
> -rw-r--r--   1 ekoifman staff602 2016-06-09 16:03 
> /user/hive/warehouse/t/base_016/bucket_0
> drwxr-xr-x   - ekoifman staff  0 2016-06-09 16:07 
> /user/hive/warehouse/t/base_017
> -rw-r--r--   1 ekoifman staff588 2016-06-09 16:07 
> /user/hive/warehouse/t/base_017/bucket_0
> drwxr-xr-x   - ekoifman staff  0 2016-06-09 16:07 
> /user/hive/warehouse/t/delta_017_017_
> -rw-r--r--   1 ekoifman staff514 2016-06-09 16:06 
> /user/hive/warehouse/t/delta_017_017_/bucket_0
> drwxr-xr-x   - ekoifman staff  0 2016-06-09 16:07 
> /user/hive/warehouse/t/delta_018_018_
> -rw-r--r--   1 ekoifman staff612 2016-06-09 16:07 
> /user/hive/warehouse/t/delta_018_018_/bucket_0
> {noformat}
> then do _alter table T compact 'minor';_
> then we end up with 
> {noformat}
> drwxr-xr-x   - ekoifman staff  0 2016-06-09 16:07 
> /user/hive/warehouse/t/base_017
> -rw-r--r--   1 ekoifman staff588 2016-06-09 16:07 
> /user/hive/warehouse/t/base_017/bucket_0
> drwxr-xr-x   - ekoifman staff  0 2016-06-09 16:11 
> /user/hive/warehouse/t/delta_018_018
> -rw-r--r--   1 ekoifman staff500 2016-06-09 16:11 
> /user/hive/warehouse/t/delta_018_018/bucket_0
> drwxr-xr-x   - ekoifman staff  0 2016-06-09 16:07 
> /user/hive/warehouse/t/delta_018_018_
> -rw-r--r--   1 ekoifman staff612 2016-06-09 16:07 
> /user/hive/warehouse/t/delta_018_018_/bucket_0
> {noformat}
> So compaction created a new dir _/user/hive/warehouse/t/delta_018_018_



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-20901) running compactor when there is nothing to do produces duplicate data

2019-02-08 Thread Eugene Koifman (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-20901?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eugene Koifman updated HIVE-20901:
--
Description: 
suppose we run minor compaction 2 times, via alter table

The 2nd request to compaction should have nothing to do but I don't think there 
is a check for that.  It's visible in the context of HIVE-20823, where each 
compactor run produces a delta with new visibility suffix so we end up with 
something like
{noformat}
target/tmp/org.apache.hadoop.hive.ql.TestTxnCommands3-1541810844849/warehouse/t/

├── delete_delta_001_002_v019
│   ├── _orc_acid_version
│   └── bucket_0
├── delete_delta_001_002_v021
│   ├── _orc_acid_version
│   └── bucket_0
├── delta_001_001_
│   ├── _orc_acid_version
│   └── bucket_0
├── delta_001_002_v019
│   ├── _orc_acid_version
│   └── bucket_0
├── delta_001_002_v021
│   ├── _orc_acid_version
│   └── bucket_0
└── delta_002_002_
    ├── _orc_acid_version
    └── bucket_0{noformat}
i.e. 2 deltas with the same write ID range

this is bad.  Probably happens today as well but new run produces a delta with 
the same name and clobbers the previous one, which may interfere with writers

 

need to investigate

 

-The issue (I think) is that {{AcidUtils.getAcidState()}} then returns both 
deltas as if they were distinct and it effectively duplicates data.-  There is 
no data duplication - {{getAcidState()}} will use 2 deltas with the same 
\{{writeid}} range

 

 

  was:
suppose we run minor compaction 2 times, via alter table

The 2nd request to compaction should have nothing to do but I don't think there 
is a check for that.  It's visible in the context of HIVE-20823, where each 
compactor run produces a delta with new visibility suffix so we end up with 
something like
{noformat}
target/tmp/org.apache.hadoop.hive.ql.TestTxnCommands3-1541810844849/warehouse/t/

├── delete_delta_001_002_v019
│   ├── _orc_acid_version
│   └── bucket_0
├── delete_delta_001_002_v021
│   ├── _orc_acid_version
│   └── bucket_0
├── delta_001_001_
│   ├── _orc_acid_version
│   └── bucket_0
├── delta_001_002_v019
│   ├── _orc_acid_version
│   └── bucket_0
├── delta_001_002_v021
│   ├── _orc_acid_version
│   └── bucket_0
└── delta_002_002_
    ├── _orc_acid_version
    └── bucket_0{noformat}
i.e. 2 deltas with the same write ID range

this is bad.  Probably happens today as well but new run produces a delta with 
the same name and clobbers the previous one, which may interfere with writers

 

need to investigate

 

The issue (I think) is that {{AcidUtils.getAcidState()}} then returns both 
deltas as if they were distinct and it effectively duplicates data. 


> running compactor when there is nothing to do produces duplicate data
> -
>
> Key: HIVE-20901
> URL: https://issues.apache.org/jira/browse/HIVE-20901
> Project: Hive
>  Issue Type: Bug
>  Components: Transactions
>Affects Versions: 4.0.0
>Reporter: Eugene Koifman
>Assignee: Eugene Koifman
>Priority: Major
>
> suppose we run minor compaction 2 times, via alter table
> The 2nd request to compaction should have nothing to do but I don't think 
> there is a check for that.  It's visible in the context of HIVE-20823, where 
> each compactor run produces a delta with new visibility suffix so we end up 
> with something like
> {noformat}
> target/tmp/org.apache.hadoop.hive.ql.TestTxnCommands3-1541810844849/warehouse/t/
> ├── delete_delta_001_002_v019
> │   ├── _orc_acid_version
> │   └── bucket_0
> ├── delete_delta_001_002_v021
> │   ├── _orc_acid_version
> │   └── bucket_0
> ├── delta_001_001_
> │   ├── _orc_acid_version
> │   └── bucket_0
> ├── delta_001_002_v019
> │   ├── _orc_acid_version
> │   └── bucket_0
> ├── delta_001_002_v021
> │   ├── _orc_acid_version
> │   └── bucket_0
> └── delta_002_002_
>     ├── _orc_acid_version
>     └── bucket_0{noformat}
> i.e. 2 deltas with the same write ID range
> this is bad.  Probably happens today as well but new run produces a delta 
> with the same name and clobbers the previous one, which may interfere with 
> writers
>  
> need to investigate
>  
> -The issue (I think) is that {{AcidUtils.getAcidState()}} then returns both 
> deltas as if they were distinct and it effectively duplicates data.-  There 
> is no data duplication - {{getAcidState()}} will use 2 deltas with the same 
> \{{writeid}} range
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-21225) ACID: getAcidState() should cache a recursive dir listing locally

2019-02-08 Thread Eugene Koifman (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-21225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16763911#comment-16763911
 ] 

Eugene Koifman commented on HIVE-21225:
---

perhaps an easier/better solution is to add another suffix to the base/delta 
dir name to indicate the "type" - i.e. acid or raw.  Then {{isRawFormat}} would 
just look at dir name.


> ACID: getAcidState() should cache a recursive dir listing locally
> -
>
> Key: HIVE-21225
> URL: https://issues.apache.org/jira/browse/HIVE-21225
> Project: Hive
>  Issue Type: Improvement
>  Components: Transactions
>Reporter: Gopal V
>Priority: Major
>
> Currently getAcidState() makes 3 calls into the FS api which could be 
> answered by making a single recursive listDir call and reusing the same data 
> to check for isRawFormat() and isValidBase().
> All delta operations for a single partition can go against a single listed 
> directory snapshot instead of interacting with the NameNode or ObjectStore 
> within the inner loop.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-21177) Optimize AcidUtils.getLogicalLength()

2019-02-07 Thread Eugene Koifman (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-21177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eugene Koifman updated HIVE-21177:
--
   Resolution: Fixed
Fix Version/s: 4.0.0
 Release Note: n/a
   Status: Resolved  (was: Patch Available)

> Optimize AcidUtils.getLogicalLength()
> -
>
> Key: HIVE-21177
> URL: https://issues.apache.org/jira/browse/HIVE-21177
> Project: Hive
>  Issue Type: Bug
>  Components: Transactions
>Affects Versions: 3.0.0
>Reporter: Eugene Koifman
>Assignee: Eugene Koifman
>Priority: Major
> Fix For: 4.0.0
>
> Attachments: HIVE-21177.01.patch, HIVE-21177.02.patch, 
> HIVE-21177.03.patch
>
>
> {{AcidUtils.getLogicalLength()}} - tries look for the side file 
> {{OrcAcidUtils.getSideFile()}} on the file system even when the file couldn't 
> possibly be there, e.g. when the path is delta_x_x or base_x.  It could only 
> be there in delta_x_y, x != y.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-21177) Optimize AcidUtils.getLogicalLength()

2019-02-07 Thread Eugene Koifman (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-21177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16762905#comment-16762905
 ] 

Eugene Koifman commented on HIVE-21177:
---

failures not related
committed to master
thanks Prasanth for the review

> Optimize AcidUtils.getLogicalLength()
> -
>
> Key: HIVE-21177
> URL: https://issues.apache.org/jira/browse/HIVE-21177
> Project: Hive
>  Issue Type: Bug
>  Components: Transactions
>Affects Versions: 3.0.0
>Reporter: Eugene Koifman
>Assignee: Eugene Koifman
>Priority: Major
> Attachments: HIVE-21177.01.patch, HIVE-21177.02.patch, 
> HIVE-21177.03.patch
>
>
> {{AcidUtils.getLogicalLength()}} - tries look for the side file 
> {{OrcAcidUtils.getSideFile()}} on the file system even when the file couldn't 
> possibly be there, e.g. when the path is delta_x_x or base_x.  It could only 
> be there in delta_x_y, x != y.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-21222) ACID: When there are no delete deltas skip finding min max keys

2019-02-06 Thread Eugene Koifman (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-21222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16762394#comment-16762394
 ] 

Eugene Koifman commented on HIVE-21222:
---

failure not related


> ACID: When there are no delete deltas skip finding min max keys
> ---
>
> Key: HIVE-21222
> URL: https://issues.apache.org/jira/browse/HIVE-21222
> Project: Hive
>  Issue Type: Bug
>  Components: Transactions
>Affects Versions: 4.0.0, 3.2.0
>Reporter: Prasanth Jayachandran
>Assignee: Prasanth Jayachandran
>Priority: Major
> Attachments: HIVE-21222.1.patch, HIVE-21222.2.patch
>
>
> We create an orc reader in VectorizedOrcAcidRowBatchReader.findMinMaxKeys 
> (which will read 16K footer) even for cases where delete deltas does not 
> exist.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-21177) Optimize AcidUtils.getLogicalLength()

2019-02-06 Thread Eugene Koifman (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-21177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16762302#comment-16762302
 ] 

Eugene Koifman commented on HIVE-21177:
---

patch 3 - some more refactoring to use Path rather than FileStatus


> Optimize AcidUtils.getLogicalLength()
> -
>
> Key: HIVE-21177
> URL: https://issues.apache.org/jira/browse/HIVE-21177
> Project: Hive
>  Issue Type: Bug
>  Components: Transactions
>Affects Versions: 3.0.0
>Reporter: Eugene Koifman
>Assignee: Eugene Koifman
>Priority: Major
> Attachments: HIVE-21177.01.patch, HIVE-21177.02.patch, 
> HIVE-21177.03.patch
>
>
> {{AcidUtils.getLogicalLength()}} - tries look for the side file 
> {{OrcAcidUtils.getSideFile()}} on the file system even when the file couldn't 
> possibly be there, e.g. when the path is delta_x_x or base_x.  It could only 
> be there in delta_x_y, x != y.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-21177) Optimize AcidUtils.getLogicalLength()

2019-02-06 Thread Eugene Koifman (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-21177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eugene Koifman updated HIVE-21177:
--
Attachment: HIVE-21177.03.patch

> Optimize AcidUtils.getLogicalLength()
> -
>
> Key: HIVE-21177
> URL: https://issues.apache.org/jira/browse/HIVE-21177
> Project: Hive
>  Issue Type: Bug
>  Components: Transactions
>Affects Versions: 3.0.0
>Reporter: Eugene Koifman
>Assignee: Eugene Koifman
>Priority: Major
> Attachments: HIVE-21177.01.patch, HIVE-21177.02.patch, 
> HIVE-21177.03.patch
>
>
> {{AcidUtils.getLogicalLength()}} - tries look for the side file 
> {{OrcAcidUtils.getSideFile()}} on the file system even when the file couldn't 
> possibly be there, e.g. when the path is delta_x_x or base_x.  It could only 
> be there in delta_x_y, x != y.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-21222) ACID: When there are no delete deltas skip finding min max keys

2019-02-05 Thread Eugene Koifman (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-21222?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eugene Koifman updated HIVE-21222:
--
Component/s: Transactions

> ACID: When there are no delete deltas skip finding min max keys
> ---
>
> Key: HIVE-21222
> URL: https://issues.apache.org/jira/browse/HIVE-21222
> Project: Hive
>  Issue Type: Bug
>  Components: Transactions
>Affects Versions: 4.0.0, 3.2.0
>Reporter: Prasanth Jayachandran
>Assignee: Prasanth Jayachandran
>Priority: Major
> Attachments: HIVE-21222.1.patch
>
>
> We create an orc reader in VectorizedOrcAcidRowBatchReader.findMinMaxKeys 
> (which will read 16K footer) even for cases where delete deltas does not 
> exist.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-21222) ACID: When there are no delete deltas skip finding min max keys

2019-02-05 Thread Eugene Koifman (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-21222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16761374#comment-16761374
 ] 

Eugene Koifman commented on HIVE-21222:
---

+1

> ACID: When there are no delete deltas skip finding min max keys
> ---
>
> Key: HIVE-21222
> URL: https://issues.apache.org/jira/browse/HIVE-21222
> Project: Hive
>  Issue Type: Bug
>Affects Versions: 4.0.0, 3.2.0
>Reporter: Prasanth Jayachandran
>Assignee: Prasanth Jayachandran
>Priority: Major
> Attachments: HIVE-21222.1.patch
>
>
> We create an orc reader in VectorizedOrcAcidRowBatchReader.findMinMaxKeys 
> (which will read 16K footer) even for cases where delete deltas does not 
> exist.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-20699) Query based compactor for full CRUD Acid tables

2019-02-04 Thread Eugene Koifman (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-20699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16760381#comment-16760381
 ] 

Eugene Koifman commented on HIVE-20699:
---

+1 patch 11

> Query based compactor for full CRUD Acid tables
> ---
>
> Key: HIVE-20699
> URL: https://issues.apache.org/jira/browse/HIVE-20699
> Project: Hive
>  Issue Type: New Feature
>  Components: Transactions
>Affects Versions: 3.1.0
>Reporter: Eugene Koifman
>Assignee: Vaibhav Gumashta
>Priority: Major
> Attachments: HIVE-20699.1.patch, HIVE-20699.1.patch, 
> HIVE-20699.10.patch, HIVE-20699.11.patch, HIVE-20699.11.patch, 
> HIVE-20699.2.patch, HIVE-20699.3.patch, HIVE-20699.4.patch, 
> HIVE-20699.5.patch, HIVE-20699.6.patch, HIVE-20699.7.patch, 
> HIVE-20699.8.patch, HIVE-20699.9.patch
>
>
> Currently the Acid compactor is implemented as generated MR job 
> ({{CompactorMR.java}}).
> It could also be expressed as a Hive query that reads from a given partition 
> and writes data back to the same partition.  This will merge the deltas and 
> 'apply' the delete events.  The simplest would be to just use Insert 
> Overwrite but that will change all ROW__IDs which we don't want.
> Need to implement this in a way that preserves ROW__IDs and creates a new 
> {{base_x}} directory to handle Major compaction.
> Minor compaction will be investigated separately.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-20699) Query based compactor for full CRUD Acid tables

2019-02-04 Thread Eugene Koifman (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-20699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16760259#comment-16760259
 ] 

Eugene Koifman commented on HIVE-20699:
---

There are a few unused imports in SplitGrouper
HiveSplitGenerator has unused imports and
{noformat}
 if (HiveConf.getBoolVar(conf, HiveConf.ConfVars.HIVE_IN_TEZ_TEST)) {
taskResource = Math.max(taskResource, 1);
  }
{noformat}
what does this do?


> Query based compactor for full CRUD Acid tables
> ---
>
> Key: HIVE-20699
> URL: https://issues.apache.org/jira/browse/HIVE-20699
> Project: Hive
>  Issue Type: New Feature
>  Components: Transactions
>Affects Versions: 3.1.0
>Reporter: Eugene Koifman
>Assignee: Vaibhav Gumashta
>Priority: Major
> Attachments: HIVE-20699.1.patch, HIVE-20699.1.patch, 
> HIVE-20699.10.patch, HIVE-20699.2.patch, HIVE-20699.3.patch, 
> HIVE-20699.4.patch, HIVE-20699.5.patch, HIVE-20699.6.patch, 
> HIVE-20699.7.patch, HIVE-20699.8.patch, HIVE-20699.9.patch
>
>
> Currently the Acid compactor is implemented as generated MR job 
> ({{CompactorMR.java}}).
> It could also be expressed as a Hive query that reads from a given partition 
> and writes data back to the same partition.  This will merge the deltas and 
> 'apply' the delete events.  The simplest would be to just use Insert 
> Overwrite but that will change all ROW__IDs which we don't want.
> Need to implement this in a way that preserves ROW__IDs and creates a new 
> {{base_x}} directory to handle Major compaction.
> Minor compaction will be investigated separately.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-21172) DEFAULT keyword handling in MERGE UPDATE clause issues

2019-02-04 Thread Eugene Koifman (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-21172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16760191#comment-16760191
 ] 

Eugene Koifman commented on HIVE-21172:
---

[~vgarg], HIVE-21159 is in. thank you

> DEFAULT keyword handling in MERGE UPDATE clause issues
> --
>
> Key: HIVE-21172
> URL: https://issues.apache.org/jira/browse/HIVE-21172
> Project: Hive
>  Issue Type: Sub-task
>  Components: SQL, Transactions
>Affects Versions: 4.0.0
>Reporter: Eugene Koifman
>Priority: Major
>
> once HIVE-21159 lands, enable {{HiveConf.MERGE_SPLIT_UPDATE}} and run these 
> tests.
> TestMiniLlapLocalCliDriver.testCliDriver[sqlmerge_stats]
>  mvn test -Dtest=TestMiniLlapLocalCliDriver 
> -Dqfile=insert_into_default_keyword.q
> Merge is rewritten as a multi-insert. When Update clause has DEFAULT, it's 
> not properly replaced with a value in the muli-insert - it's treated as a 
> literal
> {noformat}
> INSERT INTO `default`.`acidTable`-- update clause(insert part)
>  SELECT `t`.`key`, `DEFAULT`, `t`.`value`
>WHERE `t`.`key` = `s`.`key` AND `s`.`key` > 3 AND NOT(`s`.`key` < 3)
> {noformat}
> See {{LOG.info("Going to reparse <" + originalQuery + "> as \n<" + 
> rewrittenQueryStr.toString() + ">");}} in hive.log
> {{MergeSemanticAnalyzer.replaceDefaultKeywordForMerge()}} is only called in 
> {{handleInsert}} but not {{handleUpdate()}}. Why does issue only show up with 
> {{MERGE_SPLIT_UPDATE}}?
> Once this is fixed, HiveConf.MERGE_SPLIT_UPDATE should be true by default



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-21159) Modify Merge statement logic to perform Update split early

2019-02-04 Thread Eugene Koifman (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-21159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eugene Koifman updated HIVE-21159:
--
   Resolution: Fixed
Fix Version/s: 4.0.0
   Status: Resolved  (was: Patch Available)

committed to master
thanks Vaibhav for the review

> Modify Merge statement logic to perform Update split early
> --
>
> Key: HIVE-21159
> URL: https://issues.apache.org/jira/browse/HIVE-21159
> Project: Hive
>  Issue Type: Sub-task
>  Components: Transactions
>Affects Versions: 3.0.0
>Reporter: Eugene Koifman
>Assignee: Eugene Koifman
>Priority: Major
> Fix For: 4.0.0
>
> Attachments: HIVE-21159.01.patch, HIVE-21159.02.patch, 
> HIVE-21159.03.patch
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-21177) Optimize AcidUtils.getLogicalLength()

2019-01-30 Thread Eugene Koifman (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-21177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16756350#comment-16756350
 ] 

Eugene Koifman commented on HIVE-21177:
---

ParsedDeltaLight pd = 
ParsedDeltaLight.parse(fs.getFileStatus(baseOrDeltaDir));

{{fs.getFileStatus(baseOrDeltaDir)}} is counted - it wasn't performed before.

I removed some of the comments since they were clearly out of date (even before 
the current patch)


> Optimize AcidUtils.getLogicalLength()
> -
>
> Key: HIVE-21177
> URL: https://issues.apache.org/jira/browse/HIVE-21177
> Project: Hive
>  Issue Type: Bug
>  Components: Transactions
>Affects Versions: 3.0.0
>Reporter: Eugene Koifman
>Assignee: Eugene Koifman
>Priority: Major
> Attachments: HIVE-21177.01.patch, HIVE-21177.02.patch
>
>
> {{AcidUtils.getLogicalLength()}} - tries look for the side file 
> {{OrcAcidUtils.getSideFile()}} on the file system even when the file couldn't 
> possibly be there, e.g. when the path is delta_x_x or base_x.  It could only 
> be there in delta_x_y, x != y.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-21177) Optimize AcidUtils.getLogicalLength()

2019-01-29 Thread Eugene Koifman (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-21177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16755600#comment-16755600
 ] 

Eugene Koifman commented on HIVE-21177:
---

fixed tests - all were test issues
all TestTriggersTezSessionPoolManager pass locally - not sure what the issue is 
- there is some Infra issue where 
https://builds.apache.org/job/PreCommit-HIVE-Build/15828/testReport is blank

[~prasanth_j]/[~gopalv] could you review please

> Optimize AcidUtils.getLogicalLength()
> -
>
> Key: HIVE-21177
> URL: https://issues.apache.org/jira/browse/HIVE-21177
> Project: Hive
>  Issue Type: Bug
>  Components: Transactions
>Affects Versions: 3.0.0
>Reporter: Eugene Koifman
>Assignee: Eugene Koifman
>Priority: Major
> Attachments: HIVE-21177.01.patch, HIVE-21177.02.patch
>
>
> {{AcidUtils.getLogicalLength()}} - tries look for the side file 
> {{OrcAcidUtils.getSideFile()}} on the file system even when the file couldn't 
> possibly be there, e.g. when the path is delta_x_x or base_x.  It could only 
> be there in delta_x_y, x != y.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-21177) Optimize AcidUtils.getLogicalLength()

2019-01-29 Thread Eugene Koifman (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-21177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eugene Koifman updated HIVE-21177:
--
Attachment: HIVE-21177.02.patch

> Optimize AcidUtils.getLogicalLength()
> -
>
> Key: HIVE-21177
> URL: https://issues.apache.org/jira/browse/HIVE-21177
> Project: Hive
>  Issue Type: Bug
>  Components: Transactions
>Affects Versions: 3.0.0
>Reporter: Eugene Koifman
>Assignee: Eugene Koifman
>Priority: Major
> Attachments: HIVE-21177.01.patch, HIVE-21177.02.patch
>
>
> {{AcidUtils.getLogicalLength()}} - tries look for the side file 
> {{OrcAcidUtils.getSideFile()}} on the file system even when the file couldn't 
> possibly be there, e.g. when the path is delta_x_x or base_x.  It could only 
> be there in delta_x_y, x != y.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (HIVE-21177) Optimize AcidUtils.getLogicalLength()

2019-01-29 Thread Eugene Koifman (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-21177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16755453#comment-16755453
 ] 

Eugene Koifman edited comment on HIVE-21177 at 1/29/19 10:54 PM:
-

I added checks so that we don't look for the side file if we don't have to.

We have another issue.  Operations like Load Data/Add Partition, create 
base/delta and place 'raw' (aka 'original' schema) files there.  Split gen and 
read path need to know what schema to expect in a given file/split.  There is 
nothing in the file path that indicates what it is so it opens one of the data 
files in base/delta to determine that: {{AcidUtils.isRawFormat()}}.

This should be less of an issue, since it does a listing first to choose the 
file, so it should never be looking for a file that is not actually there.  I 
optimized isRawFormat() some but it will do the checks a lot of the time.  It 
could be changed to rely on the file name instead but that's rather fragile.




was (Author: ekoifman):
I added checks so that we don't look for the side file if we don't have to.

We have another issue.  Operations like Load Data/Add Partition, create 
base/delta and place 'raw' (aka 'original' schema) files there.  Split gen and 
read path need to know what schema to expect in a given file/split.  There is 
nothing in the file path that indicates what it is so it opens one of the data 
files in base/delta to determine that: {{AcidUtils.isRawFormat()}}.

This should be less of an issue, since it does a listing first to choose the 
file, so it should never be looking for a file that is not actually there.  I 
optimized isRawFormat() some but it will do the checks a lot of the time.  It 
could be changed to rely of file name instead but that's rather fragile.



> Optimize AcidUtils.getLogicalLength()
> -
>
> Key: HIVE-21177
> URL: https://issues.apache.org/jira/browse/HIVE-21177
> Project: Hive
>  Issue Type: Bug
>  Components: Transactions
>Affects Versions: 3.0.0
>Reporter: Eugene Koifman
>Assignee: Eugene Koifman
>Priority: Major
> Attachments: HIVE-21177.01.patch
>
>
> {{AcidUtils.getLogicalLength()}} - tries look for the side file 
> {{OrcAcidUtils.getSideFile()}} on the file system even when the file couldn't 
> possibly be there, e.g. when the path is delta_x_x or base_x.  It could only 
> be there in delta_x_y, x != y.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-21177) Optimize AcidUtils.getLogicalLength()

2019-01-29 Thread Eugene Koifman (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-21177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eugene Koifman updated HIVE-21177:
--
Status: Patch Available  (was: Open)

I added checks so that we don't look for the side file if we don't have to.

We have another issue.  Operations like Load Data/Add Partition, create 
base/delta and place 'raw' (aka 'original' schema) files there.  Split gen and 
read path need to know what schema to expect in a given file/split.  There is 
nothing in the file path that indicates what it is so it opens one of the data 
files in base/delta to determine that: {{AcidUtils.isRawFormat()}}.

This should be less of an issue, since it does a listing first to choose the 
file, so it should never be looking for a file that is not actually there.  I 
optimized isRawFormat() some but it will do the checks a lot of the time.  It 
could be changed to rely of file name instead but that's rather fragile.



> Optimize AcidUtils.getLogicalLength()
> -
>
> Key: HIVE-21177
> URL: https://issues.apache.org/jira/browse/HIVE-21177
> Project: Hive
>  Issue Type: Bug
>  Components: Transactions
>Affects Versions: 3.0.0
>Reporter: Eugene Koifman
>Assignee: Eugene Koifman
>Priority: Major
> Attachments: HIVE-21177.01.patch
>
>
> {{AcidUtils.getLogicalLength()}} - tries look for the side file 
> {{OrcAcidUtils.getSideFile()}} on the file system even when the file couldn't 
> possibly be there, e.g. when the path is delta_x_x or base_x.  It could only 
> be there in delta_x_y, x != y.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-21177) Optimize AcidUtils.getLogicalLength()

2019-01-29 Thread Eugene Koifman (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-21177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eugene Koifman updated HIVE-21177:
--
Attachment: HIVE-21177.01.patch

> Optimize AcidUtils.getLogicalLength()
> -
>
> Key: HIVE-21177
> URL: https://issues.apache.org/jira/browse/HIVE-21177
> Project: Hive
>  Issue Type: Bug
>  Components: Transactions
>Affects Versions: 3.0.0
>Reporter: Eugene Koifman
>Assignee: Eugene Koifman
>Priority: Major
> Attachments: HIVE-21177.01.patch
>
>
> {{AcidUtils.getLogicalLength()}} - tries look for the side file 
> {{OrcAcidUtils.getSideFile()}} on the file system even when the file couldn't 
> possibly be there, e.g. when the path is delta_x_x or base_x.  It could only 
> be there in delta_x_y, x != y.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (HIVE-21177) Optimize AcidUtils.getLogicalLength()

2019-01-29 Thread Eugene Koifman (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-21177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eugene Koifman reassigned HIVE-21177:
-


> Optimize AcidUtils.getLogicalLength()
> -
>
> Key: HIVE-21177
> URL: https://issues.apache.org/jira/browse/HIVE-21177
> Project: Hive
>  Issue Type: Bug
>  Components: Transactions
>Affects Versions: 3.0.0
>Reporter: Eugene Koifman
>Assignee: Eugene Koifman
>Priority: Major
>
> {{AcidUtils.getLogicalLength()}} - tries look for the side file 
> {{OrcAcidUtils.getSideFile()}} on the file system even when the file couldn't 
> possibly be there, e.g. when the path is delta_x_x or base_x.  It could only 
> be there in delta_x_y, x != y.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-20699) Query based compactor for full CRUD Acid tables

2019-01-28 Thread Eugene Koifman (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-20699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16754507#comment-16754507
 ] 

Eugene Koifman commented on HIVE-20699:
---

I left a few of minor comments on RB for patch 9.  Overall LGTM, except for 
{{HIVE_TRANSACTIONAL_TABLE_SCAN}} not being visible in {{HiveSplitGenerator}}.  
While not caused by this patch I don't think it can properly work in production 
w/o it being fixed.

> Query based compactor for full CRUD Acid tables
> ---
>
> Key: HIVE-20699
> URL: https://issues.apache.org/jira/browse/HIVE-20699
> Project: Hive
>  Issue Type: New Feature
>  Components: Transactions
>Affects Versions: 3.1.0
>Reporter: Eugene Koifman
>Assignee: Vaibhav Gumashta
>Priority: Major
> Attachments: HIVE-20699.1.patch, HIVE-20699.1.patch, 
> HIVE-20699.2.patch, HIVE-20699.3.patch, HIVE-20699.4.patch, 
> HIVE-20699.5.patch, HIVE-20699.6.patch, HIVE-20699.7.patch, 
> HIVE-20699.8.patch, HIVE-20699.9.patch
>
>
> Currently the Acid compactor is implemented as generated MR job 
> ({{CompactorMR.java}}).
> It could also be expressed as a Hive query that reads from a given partition 
> and writes data back to the same partition.  This will merge the deltas and 
> 'apply' the delete events.  The simplest would be to just use Insert 
> Overwrite but that will change all ROW__IDs which we don't want.
> Need to implement this in a way that preserves ROW__IDs and creates a new 
> {{base_x}} directory to handle Major compaction.
> Minor compaction will be investigated separately.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-21159) Modify Merge statement logic to perform Update split early

2019-01-28 Thread Eugene Koifman (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-21159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16754349#comment-16754349
 ] 

Eugene Koifman commented on HIVE-21159:
---

fixed test failures in patch 3. [~vgumashta] could you review please

> Modify Merge statement logic to perform Update split early
> --
>
> Key: HIVE-21159
> URL: https://issues.apache.org/jira/browse/HIVE-21159
> Project: Hive
>  Issue Type: Sub-task
>  Components: Transactions
>Affects Versions: 3.0.0
>Reporter: Eugene Koifman
>Assignee: Eugene Koifman
>Priority: Major
> Attachments: HIVE-21159.01.patch, HIVE-21159.02.patch, 
> HIVE-21159.03.patch
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-21159) Modify Merge statement logic to perform Update split early

2019-01-28 Thread Eugene Koifman (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-21159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eugene Koifman updated HIVE-21159:
--
Attachment: HIVE-21159.03.patch

> Modify Merge statement logic to perform Update split early
> --
>
> Key: HIVE-21159
> URL: https://issues.apache.org/jira/browse/HIVE-21159
> Project: Hive
>  Issue Type: Sub-task
>  Components: Transactions
>Affects Versions: 3.0.0
>Reporter: Eugene Koifman
>Assignee: Eugene Koifman
>Priority: Major
> Attachments: HIVE-21159.01.patch, HIVE-21159.02.patch, 
> HIVE-21159.03.patch
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-21159) Modify Merge statement logic to perform Update split early

2019-01-25 Thread Eugene Koifman (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-21159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eugene Koifman updated HIVE-21159:
--
Attachment: HIVE-21159.02.patch

> Modify Merge statement logic to perform Update split early
> --
>
> Key: HIVE-21159
> URL: https://issues.apache.org/jira/browse/HIVE-21159
> Project: Hive
>  Issue Type: Sub-task
>  Components: Transactions
>Affects Versions: 3.0.0
>Reporter: Eugene Koifman
>Assignee: Eugene Koifman
>Priority: Major
> Attachments: HIVE-21159.01.patch, HIVE-21159.02.patch
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-21158) Perform update split early

2019-01-25 Thread Eugene Koifman (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-21158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eugene Koifman updated HIVE-21158:
--
Description: 
Currently Acid 2.0 does U=D+I in the OrcRecordUpdater. This means that all 
Updates (wide rows) are shuffled AND sorted.
 We could modify the the multi-insert statement which results from Merge 
statement so that instead of having one of the legs represent Update, we create 
2 legs - 1 representing Delete of original row and 1 representing Insert of the 
new version.
 Delete events are very small so sorting them is cheap. The Insert are written 
to disk in a sorted way by virtue of how ROW__IDs are generated.

Exactly the same idea applies to regular Update statement.

Note that the U=D+I in OrcRecordUpdater needs to be kept to keep [Streaming 
Mutate API 
|https://cwiki.apache.org/confluence/display/Hive/HCatalog+Streaming+Mutation+API]
 working on 2.0.

*This requires that TxnHandler flags 2 Deletes as a conflict - it doesn't 
currently*

Incidentally, 2.0 + early split allows updating all columns including bucketing 
and partition columns

What is lock acquisition based on? Need to make sure that conflict detection 
(write set tracking) still works

So we want to transform
{noformat}
update T set B = 7 where A=1
{noformat}
into
{noformat}
from T
insert into T select ROW__ID where a = 1 SORT BY ROW__ID
insert into T select a, 7 where a = 1
{noformat}
even better to
{noformat}
from T where a = 1
insert into T select ROW__ID SORT BY ROW__ID
insert into T select a, 7
{noformat}
but this won't parse currently.

This is very similar to how MERGE stmt is handled.

Need some though on on how WriteSet tracking works. If we don't allow updating 
partition column, then even with dynamic partitions 
TxnHandler.addDynamicPartitions() should see 1 entry (in Update type) for each 
partition since both the insert and delete land in the same partition. If part 
cols can be updated, then then we may insert a Delete event into P1 and 
corresponding Insert event into P2 so addDynamicPartitions() should see both 
parts. I guess both need to be recored in Write_Set but with different types. 
The delete as 'delete' and insert as insert so that it can conflict with some 
IOW on the 'new' partition.

  was:
Currently Acid 2.0 does U=D+I in the OrcRecordUpdater.  This means that all 
Updates (wide rows) are shuffled AND sorted.
We could modify the the multi-insert statement which results from Merge 
statement so that instead of having one of the legs represent Update, we create 
2 legs - 1 representing Delete of original row and 1 representing Insert of the 
new version.
Delete events are very small so sorting them is cheap.  The Insert are written 
to disk in a sorted way by virtue of how ROW__IDs are generated.

Exactly the same idea applies to regular Update statement.

Note that the U=D+I in OrcRecordUpdater needs to be kept to keep [Streaming 
Mutate API 
|https://cwiki.apache.org/confluence/display/Hive/HCatalog+Streaming+Mutation+API]
 working on 2.0.

*This requires that TxnHandler flags 2 Deletes as a conflict - it doesn't 
currently*

Incidentally, 2.0 + early split allows updating all columns including bucketing 
and partition columns

What is lock acquisition based on?  Need to make sure that conflict detection 
(write set tracking) still works

So we want to transform
{noformat}
update T set B = 7 where A=1
{noformat}
into 
{noformat}
from T
insert into T select ROW__ID where a = 1 SORT BY ROW__ID
insert into T select a, 7 where a = 1
{noformat}

even better to
{noformat}
from T where a = 1
insert into T select ROW__ID SORT BY ROW__ID
insert into T select a, 7
{noformat}
but this won't parse currently.

This is very similar to how MERGE stmt is handled.

Need some though on on how WriteSet tracking works.  If we don't allow updating 
partition column, then even with dynamic partitions 
TxnHandler.addDynamicPartitions() should see 1 entry (in Update type) for each 
partition since both the insert and delete land in the same partition.  If part 
cols can be updated, then then we may insert a Delete event into P1 and 
corresponding Insert event into P2 so addDynamicPartitions() should see both 
parts.  I guess both need to be recored in Write_Set but with different types.  
The delete as 'update' and insert as insert so that it can conflict with some 
IOW on the 'new' partition.


> Perform update split early
> --
>
> Key: HIVE-21158
> URL: https://issues.apache.org/jira/browse/HIVE-21158
> Project: Hive
>  Issue Type: Improvement
>  Components: Transactions
>Affects Versions: 3.0.0
>Reporter: Eugene Koifman
>Assignee: Eugene Koifman
>Priority: Major
>
> Currently Acid 2.0 does U=D+I in the OrcRecordUpdater. This means that all 
> Updates (wide rows) are shuffled AND sorted.
>  We 

[jira] [Updated] (HIVE-21172) DEFAULT keyword handling in MERGE UPDATE clause issues

2019-01-25 Thread Eugene Koifman (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-21172?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eugene Koifman updated HIVE-21172:
--
Description: 
once HIVE-21159 lands, enable {{HiveConf.MERGE_SPLIT_UPDATE}} and run these 
tests.

TestMiniLlapLocalCliDriver.testCliDriver[sqlmerge_stats]
 mvn test -Dtest=TestMiniLlapLocalCliDriver 
-Dqfile=insert_into_default_keyword.q

Merge is rewritten as a multi-insert. When Update clause has DEFAULT, it's not 
properly replaced with a value in the muli-insert - it's treated as a literal
{noformat}
INSERT INTO `default`.`acidTable`-- update clause(insert part)
 SELECT `t`.`key`, `DEFAULT`, `t`.`value`
   WHERE `t`.`key` = `s`.`key` AND `s`.`key` > 3 AND NOT(`s`.`key` < 3)
{noformat}
See {{LOG.info("Going to reparse <" + originalQuery + "> as \n<" + 
rewrittenQueryStr.toString() + ">");}} in hive.log

{{MergeSemanticAnalyzer.replaceDefaultKeywordForMerge()}} is only called in 
{{handleInsert}} but not {{handleUpdate()}}. Why does issue only show up with 
{{MERGE_SPLIT_UPDATE}}?

Once this is fixed, HiveConf.MERGE_SPLIT_UPDATE should be true by default

  was:
once HIVE-21159 lands, enable {{HiveConf.MERGE_SPLIT_UPDATE}} and run these 
tests.

TestMiniLlapLocalCliDriver.testCliDriver[sqlmerge_stats]
mvn test -Dtest=TestMiniLlapLocalCliDriver -Dqfile=insert_into_default_keyword.q

Merge is rewritten as a multi-insert.  When Update clause has DEFAULT, it's not 
properly replaced with a value in the muli-insert - it's treated as a literal
{noformat}
INSERT INTO `default`.`acidTable`-- update clause(insert part)
 SELECT `t`.`key`, `DEFAULT`, `t`.`value`
   WHERE `t`.`key` = `s`.`key` AND `s`.`key` > 3 AND NOT(`s`.`key` < 3)
{noformat}

See {{LOG.info("Going to reparse <" + originalQuery + "> as \n<" + 
rewrittenQueryStr.toString() + ">");}} in hive.log

{{MergeSemanticAnalyzer.replaceDefaultKeywordForMerge()}} is only called in 
{{handleInsert}} but not {{handleUpdate()}}.  Why does issue only show up with 
{{MERGE_SPLIT_UPDATE}}?



> DEFAULT keyword handling in MERGE UPDATE clause issues
> --
>
> Key: HIVE-21172
> URL: https://issues.apache.org/jira/browse/HIVE-21172
> Project: Hive
>  Issue Type: Sub-task
>  Components: SQL, Transactions
>Affects Versions: 4.0.0
>Reporter: Eugene Koifman
>Priority: Major
>
> once HIVE-21159 lands, enable {{HiveConf.MERGE_SPLIT_UPDATE}} and run these 
> tests.
> TestMiniLlapLocalCliDriver.testCliDriver[sqlmerge_stats]
>  mvn test -Dtest=TestMiniLlapLocalCliDriver 
> -Dqfile=insert_into_default_keyword.q
> Merge is rewritten as a multi-insert. When Update clause has DEFAULT, it's 
> not properly replaced with a value in the muli-insert - it's treated as a 
> literal
> {noformat}
> INSERT INTO `default`.`acidTable`-- update clause(insert part)
>  SELECT `t`.`key`, `DEFAULT`, `t`.`value`
>WHERE `t`.`key` = `s`.`key` AND `s`.`key` > 3 AND NOT(`s`.`key` < 3)
> {noformat}
> See {{LOG.info("Going to reparse <" + originalQuery + "> as \n<" + 
> rewrittenQueryStr.toString() + ">");}} in hive.log
> {{MergeSemanticAnalyzer.replaceDefaultKeywordForMerge()}} is only called in 
> {{handleInsert}} but not {{handleUpdate()}}. Why does issue only show up with 
> {{MERGE_SPLIT_UPDATE}}?
> Once this is fixed, HiveConf.MERGE_SPLIT_UPDATE should be true by default



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-21052) Make sure transactions get cleaned if they are aborted before addPartitions is called

2019-01-23 Thread Eugene Koifman (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-21052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16750653#comment-16750653
 ] 

Eugene Koifman commented on HIVE-21052:
---

Suppose you have p-type clean on table T that is running (i.e. has the Write 
lock) and you have 30 different partition clean requests (in T).  The 30 per 
partition cleans will get blocked but they will tie up every thread in the pool 
while they are blocked, right?  If so, no other clean (on any other table) will 
actually make progress until the p-type on T is done.

> Make sure transactions get cleaned if they are aborted before addPartitions 
> is called
> -
>
> Key: HIVE-21052
> URL: https://issues.apache.org/jira/browse/HIVE-21052
> Project: Hive
>  Issue Type: Bug
>  Components: Transactions
>Affects Versions: 3.0.0
>Reporter: Jaume M
>Assignee: Jaume M
>Priority: Critical
> Attachments: Aborted Txn w_Direct Write.pdf, HIVE-21052.1.patch, 
> HIVE-21052.2.patch, HIVE-21052.3.patch, HIVE-21052.4.patch, 
> HIVE-21052.5.patch, HIVE-21052.6.patch, HIVE-21052.7.patch, HIVE-21052.8.patch
>
>
> If the transaction is aborted between openTxn and addPartitions and data has 
> been written on the table the transaction manager will think it's an empty 
> transaction and no cleaning will be done.
> This is currently an issue in the streaming API and in micromanaged tables. 
> As proposed by [~ekoifman] this can be solved by:
> * Writing an entry with a special marker to TXN_COMPONENTS at openTxn and 
> when addPartitions is called remove this entry from TXN_COMPONENTS and add 
> the corresponding partition entry to TXN_COMPONENTS.
> * If the cleaner finds and entry with a special marker in TXN_COMPONENTS that 
> specifies that a transaction was opened and it was aborted it must generate 
> jobs for the worker for every possible partition available.
> cc [~ewohlstadter]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-21118) Make sure if dynamic partitions is true only there's only one writeId allocated

2019-01-23 Thread Eugene Koifman (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-21118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16750636#comment-16750636
 ] 

Eugene Koifman commented on HIVE-21118:
---

The current model in Acid subsystem is that there is 1 writeId per 
(table,txnid).  {{DbTxnManager.getTableWriteId()}} ensures this but perhaps it 
should also be checked on {{TxnHandler}} side

There is also a concept of statement id, so that if there is > 1 write from a 
single txn to the same table, each write creates a new delta with a different 
statement id: delta_writeId_writeId_stmtId.

This is visible in multi-statement transactions (not fully supported yet but 
there are some tests in TestTxnCommands) and multi-insert statement (especially 
one that is generated to execute Merge statement)

> Make sure if dynamic partitions is true only there's only one writeId 
> allocated
> ---
>
> Key: HIVE-21118
> URL: https://issues.apache.org/jira/browse/HIVE-21118
> Project: Hive
>  Issue Type: Improvement
>  Components: Transactions
>Reporter: Jaume M
>Assignee: Jaume M
>Priority: Critical
>
> See 
> https://issues.apache.org/jira/browse/HIVE-21052?focusedCommentId=16740528=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16740528



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-21159) Modify Merge statement logic to perform Update split early

2019-01-23 Thread Eugene Koifman (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-21159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eugene Koifman updated HIVE-21159:
--
Attachment: HIVE-21159.01.patch

> Modify Merge statement logic to perform Update split early
> --
>
> Key: HIVE-21159
> URL: https://issues.apache.org/jira/browse/HIVE-21159
> Project: Hive
>  Issue Type: Sub-task
>  Components: Transactions
>Affects Versions: 3.0.0
>Reporter: Eugene Koifman
>Assignee: Eugene Koifman
>Priority: Major
> Attachments: HIVE-21159.01.patch
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-21159) Modify Merge statement logic to perform Update split early

2019-01-23 Thread Eugene Koifman (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-21159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eugene Koifman updated HIVE-21159:
--
Status: Patch Available  (was: Open)

WIP - see what breaks

> Modify Merge statement logic to perform Update split early
> --
>
> Key: HIVE-21159
> URL: https://issues.apache.org/jira/browse/HIVE-21159
> Project: Hive
>  Issue Type: Sub-task
>  Components: Transactions
>Affects Versions: 3.0.0
>Reporter: Eugene Koifman
>Assignee: Eugene Koifman
>Priority: Major
> Attachments: HIVE-21159.01.patch
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (HIVE-21159) Modify Merge statement logic to perform Update split early

2019-01-23 Thread Eugene Koifman (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-21159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eugene Koifman reassigned HIVE-21159:
-


> Modify Merge statement logic to perform Update split early
> --
>
> Key: HIVE-21159
> URL: https://issues.apache.org/jira/browse/HIVE-21159
> Project: Hive
>  Issue Type: Sub-task
>  Components: Transactions
>Affects Versions: 3.0.0
>Reporter: Eugene Koifman
>Assignee: Eugene Koifman
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (HIVE-21158) Perform update split early

2019-01-23 Thread Eugene Koifman (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-21158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eugene Koifman reassigned HIVE-21158:
-


> Perform update split early
> --
>
> Key: HIVE-21158
> URL: https://issues.apache.org/jira/browse/HIVE-21158
> Project: Hive
>  Issue Type: Improvement
>  Components: Transactions
>Affects Versions: 3.0.0
>Reporter: Eugene Koifman
>Assignee: Eugene Koifman
>Priority: Major
>
> Currently Acid 2.0 does U=D+I in the OrcRecordUpdater.  This means that all 
> Updates (wide rows) are shuffled AND sorted.
> We could modify the the multi-insert statement which results from Merge 
> statement so that instead of having one of the legs represent Update, we 
> create 2 legs - 1 representing Delete of original row and 1 representing 
> Insert of the new version.
> Delete events are very small so sorting them is cheap.  The Insert are 
> written to disk in a sorted way by virtue of how ROW__IDs are generated.
> Exactly the same idea applies to regular Update statement.
> Note that the U=D+I in OrcRecordUpdater needs to be kept to keep [Streaming 
> Mutate API 
> |https://cwiki.apache.org/confluence/display/Hive/HCatalog+Streaming+Mutation+API]
>  working on 2.0.
> *This requires that TxnHandler flags 2 Deletes as a conflict - it doesn't 
> currently*
> Incidentally, 2.0 + early split allows updating all columns including 
> bucketing and partition columns
> What is lock acquisition based on?  Need to make sure that conflict detection 
> (write set tracking) still works
> So we want to transform
> {noformat}
> update T set B = 7 where A=1
> {noformat}
> into 
> {noformat}
> from T
> insert into T select ROW__ID where a = 1 SORT BY ROW__ID
> insert into T select a, 7 where a = 1
> {noformat}
> even better to
> {noformat}
> from T where a = 1
> insert into T select ROW__ID SORT BY ROW__ID
> insert into T select a, 7
> {noformat}
> but this won't parse currently.
> This is very similar to how MERGE stmt is handled.
> Need some though on on how WriteSet tracking works.  If we don't allow 
> updating partition column, then even with dynamic partitions 
> TxnHandler.addDynamicPartitions() should see 1 entry (in Update type) for 
> each partition since both the insert and delete land in the same partition.  
> If part cols can be updated, then then we may insert a Delete event into P1 
> and corresponding Insert event into P2 so addDynamicPartitions() should see 
> both parts.  I guess both need to be recored in Write_Set but with different 
> types.  The delete as 'update' and insert as insert so that it can conflict 
> with some IOW on the 'new' partition.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (HIVE-20699) Query based compactor for full CRUD Acid tables

2019-01-23 Thread Eugene Koifman (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-20699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16749370#comment-16749370
 ] 

Eugene Koifman edited comment on HIVE-20699 at 1/23/19 7:41 PM:


left some comments on RB (Diff Revision 7)


was (Author: ekoifman):
left some comments on RB

> Query based compactor for full CRUD Acid tables
> ---
>
> Key: HIVE-20699
> URL: https://issues.apache.org/jira/browse/HIVE-20699
> Project: Hive
>  Issue Type: New Feature
>  Components: Transactions
>Affects Versions: 3.1.0
>Reporter: Eugene Koifman
>Assignee: Vaibhav Gumashta
>Priority: Major
> Attachments: HIVE-20699.1.patch, HIVE-20699.1.patch, 
> HIVE-20699.2.patch, HIVE-20699.3.patch, HIVE-20699.4.patch, 
> HIVE-20699.5.patch, HIVE-20699.6.patch, HIVE-20699.7.patch
>
>
> Currently the Acid compactor is implemented as generated MR job 
> ({{CompactorMR.java}}).
> It could also be expressed as a Hive query that reads from a given partition 
> and writes data back to the same partition.  This will merge the deltas and 
> 'apply' the delete events.  The simplest would be to just use Insert 
> Overwrite but that will change all ROW__IDs which we don't want.
> Need to implement this in a way that preserves ROW__IDs and creates a new 
> {{base_x}} directory to handle Major compaction.
> Minor compaction will be investigated separately.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-20699) Query based compactor for full CRUD Acid tables

2019-01-22 Thread Eugene Koifman (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-20699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16749370#comment-16749370
 ] 

Eugene Koifman commented on HIVE-20699:
---

left some comments on RB

> Query based compactor for full CRUD Acid tables
> ---
>
> Key: HIVE-20699
> URL: https://issues.apache.org/jira/browse/HIVE-20699
> Project: Hive
>  Issue Type: New Feature
>  Components: Transactions
>Affects Versions: 3.1.0
>Reporter: Eugene Koifman
>Assignee: Vaibhav Gumashta
>Priority: Major
> Attachments: HIVE-20699.1.patch, HIVE-20699.1.patch, 
> HIVE-20699.2.patch, HIVE-20699.3.patch, HIVE-20699.4.patch, 
> HIVE-20699.5.patch, HIVE-20699.6.patch, HIVE-20699.7.patch
>
>
> Currently the Acid compactor is implemented as generated MR job 
> ({{CompactorMR.java}}).
> It could also be expressed as a Hive query that reads from a given partition 
> and writes data back to the same partition.  This will merge the deltas and 
> 'apply' the delete events.  The simplest would be to just use Insert 
> Overwrite but that will change all ROW__IDs which we don't want.
> Need to implement this in a way that preserves ROW__IDs and creates a new 
> {{base_x}} directory to handle Major compaction.
> Minor compaction will be investigated separately.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (HIVE-21052) Make sure transactions get cleaned if they are aborted before addPartitions is called

2019-01-22 Thread Eugene Koifman (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-21052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16749161#comment-16749161
 ] 

Eugene Koifman edited comment on HIVE-21052 at 1/22/19 9:21 PM:


not sure that this is enough.  Suppose you have 2 p-type records in 
compacton_queue for the same table.  1 {{Cleaner}} picks up the 1st one and 
sets a CLEANING_STATE.  Suppose there is another {{Cleaner}} that can run 
concurrently?  Will it start working on the other p-type request?  But then the 
2 Cleaners (or {{CleanWork}}) will both aggregate TXN_COMPONENTS entries and do 
overlapping work

I think a simple model is to mutex {{Cleaner}} instances, as they are today but 
inside the {{Cleaner}} instance maintain a collection of all active 
{{CleanWork}} items by (db/table/partition) for example.  Then if you don't 
wait for the queue (inside Cleaner) to drain, next time {{findReadToClean()}} 
is called, it can simply ignore any requests for tables/partition that are 
already being cleaned.  If it ends up with non-empty list, it enqueues more 
{{CleanWork}} items, else the outer {{run()}} goes to sleep.  It's probably 
fine to leave for a followup.

If you do allow concurrent {{Cleaner}} instances, you would have to synch via 
the DB but then it gets more complicated.  For example, what if cleaner sets 
CLEANING_STATE and dies.  How does this clean ever get completed?





was (Author: ekoifman):
not sure that this is enough.  Suppose you have 2 p-type records in 
compacton_queue for the same table.  1 Cleaner picks up the 1st one and sets a 
CLEANING_STATE.  Suppose there is another Cleaner that can run concurrently?  
Will it start working on the other p-type request?  But then the 2 Cleaners (or 
CleanWork) will both aggregate TXN_COMPONENTS entries and do overlapping 
work

I think a simple model is to mutex Cleaner instances, as they are today but 
inside the Cleaner instance maintain a collection of all active CleanWork items 
by (db/table/partition) for example.  Then if you don't wait for the queue 
(inside Cleaner) to drain, next time findReadToClean() is called, it can simply 
ignore any requests for tables/partition that are already being cleaned.  If it 
ends up with non-empty list, it enqueues more CleanWork items, else the outer 
run() goes to sleep.  It's probably fine to leave for a followup.

If you do allow concurrent Cleaner instances, you would have to synch via the 
DB but then it gets more complicated.  For example, what if cleaner sets 
CLEANING_STATE and dies.  How does this clean ever get completed?




> Make sure transactions get cleaned if they are aborted before addPartitions 
> is called
> -
>
> Key: HIVE-21052
> URL: https://issues.apache.org/jira/browse/HIVE-21052
> Project: Hive
>  Issue Type: Bug
>  Components: Transactions
>Affects Versions: 3.0.0
>Reporter: Jaume M
>Assignee: Jaume M
>Priority: Critical
> Attachments: Aborted Txn w_Direct Write.pdf, HIVE-21052.1.patch, 
> HIVE-21052.2.patch, HIVE-21052.3.patch, HIVE-21052.4.patch, 
> HIVE-21052.5.patch, HIVE-21052.6.patch, HIVE-21052.7.patch
>
>
> If the transaction is aborted between openTxn and addPartitions and data has 
> been written on the table the transaction manager will think it's an empty 
> transaction and no cleaning will be done.
> This is currently an issue in the streaming API and in micromanaged tables. 
> As proposed by [~ekoifman] this can be solved by:
> * Writing an entry with a special marker to TXN_COMPONENTS at openTxn and 
> when addPartitions is called remove this entry from TXN_COMPONENTS and add 
> the corresponding partition entry to TXN_COMPONENTS.
> * If the cleaner finds and entry with a special marker in TXN_COMPONENTS that 
> specifies that a transaction was opened and it was aborted it must generate 
> jobs for the worker for every possible partition available.
> cc [~ewohlstadter]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-21052) Make sure transactions get cleaned if they are aborted before addPartitions is called

2019-01-22 Thread Eugene Koifman (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-21052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16749161#comment-16749161
 ] 

Eugene Koifman commented on HIVE-21052:
---

not sure that this is enough.  Suppose you have 2 p-type records in 
compacton_queue for the same table.  1 Cleaner picks up the 1st one and sets a 
CLEANING_STATE.  Suppose there is another Cleaner that can run concurrently?  
Will it start working on the other p-type request?  But then the 2 Cleaners (or 
CleanWork) will both aggregate TXN_COMPONENTS entries and do overlapping 
work

I think a simple model is to mutex Cleaner instances, as they are today but 
inside the Cleaner instance maintain a collection of all active CleanWork items 
by (db/table/partition) for example.  Then if you don't wait for the queue 
(inside Cleaner) to drain, next time findReadToClean() is called, it can simply 
ignore any requests for tables/partition that are already being cleaned.  If it 
ends up with non-empty list, it enqueues more CleanWork items, else the outer 
run() goes to sleep.  It's probably fine to leave for a followup.

If you do allow concurrent Cleaner instances, you would have to synch via the 
DB but then it gets more complicated.  For example, what if cleaner sets 
CLEANING_STATE and dies.  How does this clean ever get completed?




> Make sure transactions get cleaned if they are aborted before addPartitions 
> is called
> -
>
> Key: HIVE-21052
> URL: https://issues.apache.org/jira/browse/HIVE-21052
> Project: Hive
>  Issue Type: Bug
>  Components: Transactions
>Affects Versions: 3.0.0
>Reporter: Jaume M
>Assignee: Jaume M
>Priority: Critical
> Attachments: Aborted Txn w_Direct Write.pdf, HIVE-21052.1.patch, 
> HIVE-21052.2.patch, HIVE-21052.3.patch, HIVE-21052.4.patch, 
> HIVE-21052.5.patch, HIVE-21052.6.patch, HIVE-21052.7.patch
>
>
> If the transaction is aborted between openTxn and addPartitions and data has 
> been written on the table the transaction manager will think it's an empty 
> transaction and no cleaning will be done.
> This is currently an issue in the streaming API and in micromanaged tables. 
> As proposed by [~ekoifman] this can be solved by:
> * Writing an entry with a special marker to TXN_COMPONENTS at openTxn and 
> when addPartitions is called remove this entry from TXN_COMPONENTS and add 
> the corresponding partition entry to TXN_COMPONENTS.
> * If the cleaner finds and entry with a special marker in TXN_COMPONENTS that 
> specifies that a transaction was opened and it was aborted it must generate 
> jobs for the worker for every possible partition available.
> cc [~ewohlstadter]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-21052) Make sure transactions get cleaned if they are aborted before addPartitions is called

2019-01-22 Thread Eugene Koifman (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-21052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16749022#comment-16749022
 ] 

Eugene Koifman commented on HIVE-21052:
---

You are right.  I forgot that {{findReadyToClean()}} de-dups {{CompactionInfo}} 
so if each cycle of the {{Cleaner}} waits until all tasks it forked finish (as 
you have it), these two can't happen.  

It would be worth it to not make the {{Cleaner}} wait for all tasks it started 
to finish and just go back to the DB to find more stuff to do.  In case one of 
the tasks takes a really long time, w/o this {{Cleaner}} will block on it.  It 
occurred to me that using locks for this is a bad idea since each lock will tie 
up a thread from the pool so this needs more thought...



> Make sure transactions get cleaned if they are aborted before addPartitions 
> is called
> -
>
> Key: HIVE-21052
> URL: https://issues.apache.org/jira/browse/HIVE-21052
> Project: Hive
>  Issue Type: Bug
>  Components: Transactions
>Affects Versions: 3.0.0
>Reporter: Jaume M
>Assignee: Jaume M
>Priority: Critical
> Attachments: Aborted Txn w_Direct Write.pdf, HIVE-21052.1.patch, 
> HIVE-21052.2.patch, HIVE-21052.3.patch, HIVE-21052.4.patch, 
> HIVE-21052.5.patch, HIVE-21052.6.patch, HIVE-21052.7.patch
>
>
> If the transaction is aborted between openTxn and addPartitions and data has 
> been written on the table the transaction manager will think it's an empty 
> transaction and no cleaning will be done.
> This is currently an issue in the streaming API and in micromanaged tables. 
> As proposed by [~ekoifman] this can be solved by:
> * Writing an entry with a special marker to TXN_COMPONENTS at openTxn and 
> when addPartitions is called remove this entry from TXN_COMPONENTS and add 
> the corresponding partition entry to TXN_COMPONENTS.
> * If the cleaner finds and entry with a special marker in TXN_COMPONENTS that 
> specifies that a transaction was opened and it was aborted it must generate 
> jobs for the worker for every possible partition available.
> cc [~ewohlstadter]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-21052) Make sure transactions get cleaned if they are aborted before addPartitions is called

2019-01-18 Thread Eugene Koifman (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-21052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16746903#comment-16746903
 ] 

Eugene Koifman commented on HIVE-21052:
---

We'd like to prevent 2 concurrent p-type cleans of the same table.

We'd like to prevent 2 concurrent cleans of the same partition (or the same 
unpartitioned table)

It may be ok to have a p-type clean concurrent with a normal partition clean 
(same table) if markCleaned() method for each clean operation affects disjoint 
sets of TXN_COMPONENTS entries. 

The map contains table level objects and partition level objects.  To work on a 
partition you acquire a shared lock on parent table and exclusive on the 
partition.  To work on table as a whole, you acquire a semi-shared lock on the 
table.  Semi-shared is compatible with shared but not another semi-shared.  
This gives the semantics where it's ok to do a table level clean with a 
partition level clean in parallel but not 2 concurrent table level cleans.  It 
also allows 2 different partitions in the same table to be processed in 
parallel but not the same partition in parallel.

Alternatively, you could acquire Exclusive lock on table each time you start a 
table level clean which would prevent any other table level locks thus making 
table clean block any clean on the table partitions.

 

> Make sure transactions get cleaned if they are aborted before addPartitions 
> is called
> -
>
> Key: HIVE-21052
> URL: https://issues.apache.org/jira/browse/HIVE-21052
> Project: Hive
>  Issue Type: Bug
>  Components: Transactions
>Affects Versions: 3.0.0
>Reporter: Jaume M
>Assignee: Jaume M
>Priority: Critical
> Attachments: Aborted Txn w_Direct Write.pdf, HIVE-21052.1.patch, 
> HIVE-21052.2.patch, HIVE-21052.3.patch, HIVE-21052.4.patch, 
> HIVE-21052.5.patch, HIVE-21052.6.patch, HIVE-21052.7.patch
>
>
> If the transaction is aborted between openTxn and addPartitions and data has 
> been written on the table the transaction manager will think it's an empty 
> transaction and no cleaning will be done.
> This is currently an issue in the streaming API and in micromanaged tables. 
> As proposed by [~ekoifman] this can be solved by:
> * Writing an entry with a special marker to TXN_COMPONENTS at openTxn and 
> when addPartitions is called remove this entry from TXN_COMPONENTS and add 
> the corresponding partition entry to TXN_COMPONENTS.
> * If the cleaner finds and entry with a special marker in TXN_COMPONENTS that 
> specifies that a transaction was opened and it was aborted it must generate 
> jobs for the worker for every possible partition available.
> cc [~ewohlstadter]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-21052) Make sure transactions get cleaned if they are aborted before addPartitions is called

2019-01-18 Thread Eugene Koifman (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-21052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16746897#comment-16746897
 ] 

Eugene Koifman commented on HIVE-21052:
---

I misunderstood which locks you were talking about.  I'll comment on this later.

 

 

> Make sure transactions get cleaned if they are aborted before addPartitions 
> is called
> -
>
> Key: HIVE-21052
> URL: https://issues.apache.org/jira/browse/HIVE-21052
> Project: Hive
>  Issue Type: Bug
>  Components: Transactions
>Affects Versions: 3.0.0
>Reporter: Jaume M
>Assignee: Jaume M
>Priority: Critical
> Attachments: Aborted Txn w_Direct Write.pdf, HIVE-21052.1.patch, 
> HIVE-21052.2.patch, HIVE-21052.3.patch, HIVE-21052.4.patch, 
> HIVE-21052.5.patch, HIVE-21052.6.patch, HIVE-21052.7.patch
>
>
> If the transaction is aborted between openTxn and addPartitions and data has 
> been written on the table the transaction manager will think it's an empty 
> transaction and no cleaning will be done.
> This is currently an issue in the streaming API and in micromanaged tables. 
> As proposed by [~ekoifman] this can be solved by:
> * Writing an entry with a special marker to TXN_COMPONENTS at openTxn and 
> when addPartitions is called remove this entry from TXN_COMPONENTS and add 
> the corresponding partition entry to TXN_COMPONENTS.
> * If the cleaner finds and entry with a special marker in TXN_COMPONENTS that 
> specifies that a transaction was opened and it was aborted it must generate 
> jobs for the worker for every possible partition available.
> cc [~ewohlstadter]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Issue Comment Deleted] (HIVE-21052) Make sure transactions get cleaned if they are aborted before addPartitions is called

2019-01-18 Thread Eugene Koifman (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-21052?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eugene Koifman updated HIVE-21052:
--
Comment: was deleted

(was: Compaction (any part of it) never acquires any locks - it runs completely 
asynchronous from readers/writers.

So it has to ensure not to step on itself.  

 )

> Make sure transactions get cleaned if they are aborted before addPartitions 
> is called
> -
>
> Key: HIVE-21052
> URL: https://issues.apache.org/jira/browse/HIVE-21052
> Project: Hive
>  Issue Type: Bug
>  Components: Transactions
>Affects Versions: 3.0.0
>Reporter: Jaume M
>Assignee: Jaume M
>Priority: Critical
> Attachments: Aborted Txn w_Direct Write.pdf, HIVE-21052.1.patch, 
> HIVE-21052.2.patch, HIVE-21052.3.patch, HIVE-21052.4.patch, 
> HIVE-21052.5.patch, HIVE-21052.6.patch, HIVE-21052.7.patch
>
>
> If the transaction is aborted between openTxn and addPartitions and data has 
> been written on the table the transaction manager will think it's an empty 
> transaction and no cleaning will be done.
> This is currently an issue in the streaming API and in micromanaged tables. 
> As proposed by [~ekoifman] this can be solved by:
> * Writing an entry with a special marker to TXN_COMPONENTS at openTxn and 
> when addPartitions is called remove this entry from TXN_COMPONENTS and add 
> the corresponding partition entry to TXN_COMPONENTS.
> * If the cleaner finds and entry with a special marker in TXN_COMPONENTS that 
> specifies that a transaction was opened and it was aborted it must generate 
> jobs for the worker for every possible partition available.
> cc [~ewohlstadter]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-21052) Make sure transactions get cleaned if they are aborted before addPartitions is called

2019-01-18 Thread Eugene Koifman (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-21052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16746854#comment-16746854
 ] 

Eugene Koifman commented on HIVE-21052:
---

Compaction (any part of it) never acquires any locks - it runs completely 
asynchronous from readers/writers.

So it has to ensure not to step on itself.  

 

> Make sure transactions get cleaned if they are aborted before addPartitions 
> is called
> -
>
> Key: HIVE-21052
> URL: https://issues.apache.org/jira/browse/HIVE-21052
> Project: Hive
>  Issue Type: Bug
>  Components: Transactions
>Affects Versions: 3.0.0
>Reporter: Jaume M
>Assignee: Jaume M
>Priority: Critical
> Attachments: Aborted Txn w_Direct Write.pdf, HIVE-21052.1.patch, 
> HIVE-21052.2.patch, HIVE-21052.3.patch, HIVE-21052.4.patch, 
> HIVE-21052.5.patch, HIVE-21052.6.patch, HIVE-21052.7.patch
>
>
> If the transaction is aborted between openTxn and addPartitions and data has 
> been written on the table the transaction manager will think it's an empty 
> transaction and no cleaning will be done.
> This is currently an issue in the streaming API and in micromanaged tables. 
> As proposed by [~ekoifman] this can be solved by:
> * Writing an entry with a special marker to TXN_COMPONENTS at openTxn and 
> when addPartitions is called remove this entry from TXN_COMPONENTS and add 
> the corresponding partition entry to TXN_COMPONENTS.
> * If the cleaner finds and entry with a special marker in TXN_COMPONENTS that 
> specifies that a transaction was opened and it was aborted it must generate 
> jobs for the worker for every possible partition available.
> cc [~ewohlstadter]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-21052) Make sure transactions get cleaned if they are aborted before addPartitions is called

2019-01-17 Thread Eugene Koifman (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-21052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16745713#comment-16745713
 ] 

Eugene Koifman commented on HIVE-21052:
---

I left some comments on RB.

I think the way Cleaner manages concurrency is not quite right.  
Currently there is 1 Cleaner per HMS.  You can have several HMS instances in 
the cluster for HA.  Eac Cleaner run is mutexed via {{handle = 
txnHandler.getMutexAPI().acquireLock(TxnStore.MUTEX_KEY.Cleaner.name());}} so 
only 1 is actually running at a time.

The (new) Cleaner seems to parallelize tasks too early and then has to mutex on 
the HMS access.
I would suggest resolving the paths first and then enqueue parallel tasks into 
the Priority queue to just to the deletes.  I would make sure that 2 clean 
operations of the same partition should not be allowed, nor 2 table level 
cleans.  (I'm not sure if table clean could run concurrently with partition 
level clean of the same table - I suspect yes if {{markCleaned()}} is such that 
the table clean and partition clean remove disjoint sets of TXN_COMPONENTS 
entries.  For 1st pass, I'd disallow it)

You could keep a (Concurrent) Map of locks which is thrown away at the end of 
Cleaner.run().  the locks are either named after Table or Partition.  To 
acquire Partition level lock you 1st have acquire table level lock.  This way 
each {{CleanWork}} work can runs separately as long as it's not violating above 
rules.  In other words, cleans that are guaranteed to work on entities that are 
not the same/related run in parallel - otherwise in sequence.

I suspect it may be useful to see if {{findReadyToClean()}} returns a very long 
list it may be useful to create several RawStore connections to do the 
'resolve' operations in parallel but I'd say this is pass 2 or later. This 
would actually allow these to run in parallel.

Let me know what you think.


 

> Make sure transactions get cleaned if they are aborted before addPartitions 
> is called
> -
>
> Key: HIVE-21052
> URL: https://issues.apache.org/jira/browse/HIVE-21052
> Project: Hive
>  Issue Type: Bug
>  Components: Transactions
>Affects Versions: 3.0.0
>Reporter: Jaume M
>Assignee: Jaume M
>Priority: Critical
> Attachments: Aborted Txn w_Direct Write.pdf, HIVE-21052.1.patch, 
> HIVE-21052.2.patch, HIVE-21052.3.patch, HIVE-21052.4.patch, 
> HIVE-21052.5.patch, HIVE-21052.6.patch, HIVE-21052.7.patch
>
>
> If the transaction is aborted between openTxn and addPartitions and data has 
> been written on the table the transaction manager will think it's an empty 
> transaction and no cleaning will be done.
> This is currently an issue in the streaming API and in micromanaged tables. 
> As proposed by [~ekoifman] this can be solved by:
> * Writing an entry with a special marker to TXN_COMPONENTS at openTxn and 
> when addPartitions is called remove this entry from TXN_COMPONENTS and add 
> the corresponding partition entry to TXN_COMPONENTS.
> * If the cleaner finds and entry with a special marker in TXN_COMPONENTS that 
> specifies that a transaction was opened and it was aborted it must generate 
> jobs for the worker for every possible partition available.
> cc [~ewohlstadter]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-21052) Make sure transactions get cleaned if they are aborted before addPartitions is called

2019-01-17 Thread Eugene Koifman (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-21052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16745627#comment-16745627
 ] 

Eugene Koifman commented on HIVE-21052:
---

there is 1 \{{writeId}} per (table, txnid) - HIVE-21118

> Make sure transactions get cleaned if they are aborted before addPartitions 
> is called
> -
>
> Key: HIVE-21052
> URL: https://issues.apache.org/jira/browse/HIVE-21052
> Project: Hive
>  Issue Type: Bug
>  Components: Transactions
>Affects Versions: 3.0.0
>Reporter: Jaume M
>Assignee: Jaume M
>Priority: Critical
> Attachments: Aborted Txn w_Direct Write.pdf, HIVE-21052.1.patch, 
> HIVE-21052.2.patch, HIVE-21052.3.patch, HIVE-21052.4.patch, 
> HIVE-21052.5.patch, HIVE-21052.6.patch, HIVE-21052.7.patch
>
>
> If the transaction is aborted between openTxn and addPartitions and data has 
> been written on the table the transaction manager will think it's an empty 
> transaction and no cleaning will be done.
> This is currently an issue in the streaming API and in micromanaged tables. 
> As proposed by [~ekoifman] this can be solved by:
> * Writing an entry with a special marker to TXN_COMPONENTS at openTxn and 
> when addPartitions is called remove this entry from TXN_COMPONENTS and add 
> the corresponding partition entry to TXN_COMPONENTS.
> * If the cleaner finds and entry with a special marker in TXN_COMPONENTS that 
> specifies that a transaction was opened and it was aborted it must generate 
> jobs for the worker for every possible partition available.
> cc [~ewohlstadter]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-21052) Make sure transactions get cleaned if they are aborted before addPartitions is called

2019-01-17 Thread Eugene Koifman (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-21052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16745593#comment-16745593
 ] 

Eugene Koifman commented on HIVE-21052:
---

[~jmarhuen], I'm not sure I understand your 1st 2 bullet points.  we currently 
only support auto-commit mode and all the locks for a given statement are 
processed in a single call to {{lock(LockRequest rqst)}} so you should see the 
full set of tables and corresponding {{writeID}}.  So in the absence of retries 
(of the HMS call), I'd expect TXN_COMPONENTS to have a single 'p' type row for 
a given (table, txn) combination.  (Implicitly, each table gets only 1 
{{writeID}} within a given txn.)

Are we saying the same thing?

If retries cause multiple p-type entires for (table, txn) that should be 
harmless.  As you say, Initiator would only make 1 {{COMPACTION_QUEUE}} entry 
and {{Cleaner}} will clean data for all aborted txns for a given table based on 
that queue entry.



> Make sure transactions get cleaned if they are aborted before addPartitions 
> is called
> -
>
> Key: HIVE-21052
> URL: https://issues.apache.org/jira/browse/HIVE-21052
> Project: Hive
>  Issue Type: Bug
>  Components: Transactions
>Affects Versions: 3.0.0
>Reporter: Jaume M
>Assignee: Jaume M
>Priority: Critical
> Attachments: Aborted Txn w_Direct Write.pdf, HIVE-21052.1.patch, 
> HIVE-21052.2.patch, HIVE-21052.3.patch, HIVE-21052.4.patch, 
> HIVE-21052.5.patch, HIVE-21052.6.patch, HIVE-21052.7.patch
>
>
> If the transaction is aborted between openTxn and addPartitions and data has 
> been written on the table the transaction manager will think it's an empty 
> transaction and no cleaning will be done.
> This is currently an issue in the streaming API and in micromanaged tables. 
> As proposed by [~ekoifman] this can be solved by:
> * Writing an entry with a special marker to TXN_COMPONENTS at openTxn and 
> when addPartitions is called remove this entry from TXN_COMPONENTS and add 
> the corresponding partition entry to TXN_COMPONENTS.
> * If the cleaner finds and entry with a special marker in TXN_COMPONENTS that 
> specifies that a transaction was opened and it was aborted it must generate 
> jobs for the worker for every possible partition available.
> cc [~ewohlstadter]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-20198) Constant time table drops/renames

2019-01-17 Thread Eugene Koifman (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-20198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16745538#comment-16745538
 ] 

Eugene Koifman commented on HIVE-20198:
---

FYI, {{TBLS.TBL_ID}} is exposed via Thrift since HIVE-20556.

> Constant time table drops/renames
> -
>
> Key: HIVE-20198
> URL: https://issues.apache.org/jira/browse/HIVE-20198
> Project: Hive
>  Issue Type: Improvement
>  Components: Metastore
>Affects Versions: 4.0.0
>Reporter: Alexander Kolbasov
>Assignee: Vihang Karajgaonkar
>Priority: Major
>
> Currently table drops and table renames have O(P) performance (where P is the 
> number of partitions). When a managed table is deleted, the implementation 
> deletes table metadata and then deletes all partitions in HDFS. HDFS 
> operations are optimized and only do a sequential deletes for partitions 
> outside of table prefix. This operation is O(P)where Pis the number of 
> partitions. 
> Table rename goes through the list of partitions and modifies table name (and 
> potentially db name) in each partition. It also modifies each partition 
> location to match the new db/table name and renames directories (which is a 
> non-atomic and slow operation on S3). This is O(P) operation where P is the 
> number of partitions.
> Basic idea is to do the following:
> # Assign unique ID to each table
> # Create directory name based on unique ID rather then the name
> # Table rename then becomes metadata-only operation - there is no need to 
> change any location information.
> # Table drop can become an asynchronous operation where the table is marked 
> as "deleted". Subsequent public metadata APIs should skip such tables. A 
> background cleaner thread may then go and clean up directories.
> Since the table location is unique for each table, new tables will not reuse 
> existing locations. This change isn't compatible with the current behavior 
> where there is an assumption that table location is based on table name. We 
> can get around this by providing "opt-in" mechanism - special table property 
> that tells that the table can have such new behavior, so the improvement will 
> initially work for new tables created with this feature enabled. We may later 
> provide some tool to convert existing tables to the new scheme.
> One complication is there in case where impersonation is enabled - the FS 
> operations should be performed using client UGI rather then server's, so the 
> cleaner thread should be able to use client UGIs.
> Initially we can punt on this and do standard table drops when impersonation 
> is enabled.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-20960) Make MM compactor run in a transaction and remove CompactorMR.createCompactorMarker()

2019-01-11 Thread Eugene Koifman (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-20960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eugene Koifman updated HIVE-20960:
--
   Resolution: Fixed
Fix Version/s: 4.0.0
 Release Note: n/a
   Status: Resolved  (was: Patch Available)

committed to master
thanks Vaibhav for the review

> Make MM compactor run in a transaction and remove 
> CompactorMR.createCompactorMarker()
> -
>
> Key: HIVE-20960
> URL: https://issues.apache.org/jira/browse/HIVE-20960
> Project: Hive
>  Issue Type: Sub-task
>  Components: Transactions
>Affects Versions: 4.0.0
>Reporter: Eugene Koifman
>Assignee: Eugene Koifman
>Priority: Major
> Fix For: 4.0.0
>
> Attachments: HIVE-20960.01.patch, HIVE-20960.02.patch, 
> HIVE-20960.03.patch, HIVE-20960.04.patch, HIVE-20960.05.patch, 
> HIVE-20960.06.patch, HIVE-20960.07.patch
>
>
> Now that we have HIVE-20823, we know if a dir is produced by compactor from 
> the name and {{CompactorMR.createCompactorMarker()}} can be removed.
>  
> Also includes a fix to insert_only (MM tables) ACID table so that compactor 
> produced base_X has the base_X_vY format as for full CRUD tables
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-21115) Add support for object versions in metastore

2019-01-11 Thread Eugene Koifman (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-21115?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16740793#comment-16740793
 ] 

Eugene Koifman commented on HIVE-21115:
---

Isn't there a direct SQL path somewhere that modifies HMS objects w/o using 
DataNucleus?
Could this be expressed via some on-update trigger instead?


> Add support for object versions in metastore
> 
>
> Key: HIVE-21115
> URL: https://issues.apache.org/jira/browse/HIVE-21115
> Project: Hive
>  Issue Type: Improvement
>Reporter: Vihang Karajgaonkar
>Priority: Major
>
> Currently, metastore objects are identified uniquely by their names (eg. 
> catName, dbName and tblName for a table is unique). Once a table or partition 
> is created it could be altered in many ways. There is no good way currently 
> to identify the version of the object once it is altered. For example, 
> suppose there are two clients (Hive and Impala) using the same metastore. 
> Once some alter operations are performed by a client, another client which 
> wants to do a alter operation has no good way to know if the object which it 
> has is the same as the one stored in metastore. Metastore updates the 
> {{transient_lastDdlTime}} every time there is a DDL operation on the object. 
> However, this value cannot be relied for all the clients since after 
> HIVE-1768 metastore updates the value only when it is not set in the 
> parameters. It is possible that a client which alters the object state, does 
> not remove the {{transient_lastDdlTime}} and metastore will not update it. 
> Secondly, if there is a clock skew between multiple HMS instances when HMS-HA 
> is configured, time values cannot be relied on to find out the sequence of 
> alter operations on a given object.
> This JIRA propose to use JDO versioning support by Datanucleus  
> http://www.datanucleus.org/products/accessplatform_4_2/jdo/versioning.html to 
> generate a incrementing sequence number every time a object is altered. The 
> value of this object can be set as one of the values in the parameters. The 
> advantage of using Datanucleus the versioning can be done across HMS 
> instances as part of the database transaction and it should work for all the 
> supported databases.
> In theory such a version can be used to detect if the client is presenting a 
> object which is "stale" when issuing a alter request. Metastore can choose to 
> reject such a alter request since the client may be caching a old version of 
> the object and any alter operation on such stale object can potentially 
> overwrite previous operations. However, this is can be done in a separate 
> JIRA.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-20801) ACID: Allow DbTxnManager to ignore non-ACID table locking

2019-01-11 Thread Eugene Koifman (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-20801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16740788#comment-16740788
 ] 

Eugene Koifman commented on HIVE-20801:
---

[~gopalv] what do you mean locks are advisory?  By default, (with Acid on), 
non-transactional tables use standard S/X locks so so assuming things are 
configured properly (and no partial failures), reads should be consistent.  It 
seems that the description of the property is misleading.

Also, if you are disabling all locks for readers, why acquire any locks for 
writers?


> ACID: Allow DbTxnManager to ignore non-ACID table locking
> -
>
> Key: HIVE-20801
> URL: https://issues.apache.org/jira/browse/HIVE-20801
> Project: Hive
>  Issue Type: Bug
>  Components: Locking, Transactions
>Affects Versions: 4.0.0
>Reporter: Gopal V
>Assignee: Gopal V
>Priority: Major
>  Labels: Branch3Candidate, TODOC
> Attachments: HIVE-20801.1.patch, HIVE-20801.2.patch, 
> HIVE-20801.2.patch, HIVE-20801.3.patch
>
>
> Enabling ACIDv1 on a cluster produces a central locking bottleneck for all 
> table types, which is not always the intention.
> The Hive locking for non-acid tables are advisory (i.e a client can 
> write/read without locking), which means that the implementation does not 
> offer strong consistency despite the lock manager consuming resources 
> centrally.
> Disabling this lock acquisition would improve the performance of non-ACID 
> tables co-existing with a globally configured DbTxnManager implementation.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-21052) Make sure transactions get cleaned if they are aborted before addPartitions is called

2019-01-11 Thread Eugene Koifman (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-21052?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eugene Koifman updated HIVE-21052:
--
Affects Version/s: (was: 3.1.1)
   3.0.0

> Make sure transactions get cleaned if they are aborted before addPartitions 
> is called
> -
>
> Key: HIVE-21052
> URL: https://issues.apache.org/jira/browse/HIVE-21052
> Project: Hive
>  Issue Type: Bug
>  Components: Transactions
>Affects Versions: 3.0.0
>Reporter: Jaume M
>Assignee: Jaume M
>Priority: Critical
> Attachments: Aborted Txn w_Direct Write.pdf, HIVE-21052.1.patch, 
> HIVE-21052.2.patch
>
>
> If the transaction is aborted between openTxn and addPartitions and data has 
> been written on the table the transaction manager will think it's an empty 
> transaction and no cleaning will be done.
> This is currently an issue in the streaming API and in micromanaged tables. 
> As proposed by [~ekoifman] this can be solved by:
> * Writing an entry with a special marker to TXN_COMPONENTS at openTxn and 
> when addPartitions is called remove this entry from TXN_COMPONENTS and add 
> the corresponding partition entry to TXN_COMPONENTS.
> * If the cleaner finds and entry with a special marker in TXN_COMPONENTS that 
> specifies that a transaction was opened and it was aborted it must generate 
> jobs for the worker for every possible partition available.
> cc [~ewohlstadter]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-21118) Make sure if dynamic partitions is true only there's only one writeId allocated

2019-01-11 Thread Eugene Koifman (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-21118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eugene Koifman updated HIVE-21118:
--
Component/s: Transactions

> Make sure if dynamic partitions is true only there's only one writeId 
> allocated
> ---
>
> Key: HIVE-21118
> URL: https://issues.apache.org/jira/browse/HIVE-21118
> Project: Hive
>  Issue Type: Improvement
>  Components: Transactions
>Reporter: Jaume M
>Assignee: Jaume M
>Priority: Critical
>
> See 
> https://issues.apache.org/jira/browse/HIVE-21052?focusedCommentId=16740528=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16740528



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-21052) Make sure transactions get cleaned if they are aborted before addPartitions is called

2019-01-11 Thread Eugene Koifman (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-21052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16740528#comment-16740528
 ] 

Eugene Koifman commented on HIVE-21052:
---

we should never allocate > 1 writeId per (table,txn).  That is done somewhere 
in DbTxnHandler.getTableWriteId().
(perhaps it should also be checked in TxnHandler.allocateTableWriteIds() but 
I'd do it in a separate jira)

Though the stmt/operation being executed may target > 1 table, so there there 
some minimal processing to look at all LockComponent entries to come up with a 
set of unique (table,txnid,writeid)

> Make sure transactions get cleaned if they are aborted before addPartitions 
> is called
> -
>
> Key: HIVE-21052
> URL: https://issues.apache.org/jira/browse/HIVE-21052
> Project: Hive
>  Issue Type: Bug
>  Components: Transactions
>Affects Versions: 3.1.1
>Reporter: Jaume M
>Assignee: Jaume M
>Priority: Critical
> Attachments: Aborted Txn w_Direct Write.pdf, HIVE-21052.1.patch, 
> HIVE-21052.2.patch
>
>
> If the transaction is aborted between openTxn and addPartitions and data has 
> been written on the table the transaction manager will think it's an empty 
> transaction and no cleaning will be done.
> This is currently an issue in the streaming API and in micromanaged tables. 
> As proposed by [~ekoifman] this can be solved by:
> * Writing an entry with a special marker to TXN_COMPONENTS at openTxn and 
> when addPartitions is called remove this entry from TXN_COMPONENTS and add 
> the corresponding partition entry to TXN_COMPONENTS.
> * If the cleaner finds and entry with a special marker in TXN_COMPONENTS that 
> specifies that a transaction was opened and it was aborted it must generate 
> jobs for the worker for every possible partition available.
> cc [~ewohlstadter]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-21114) Create read-only transactions

2019-01-10 Thread Eugene Koifman (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-21114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eugene Koifman updated HIVE-21114:
--
Description: 
With HIVE-21036 we have a way to indicate that a txn is read only.
We should (at least in auto-commit mode) determine if the single stmt is a read 
and mark the txn accordingly.  
Then we can optimize {{TxnHandler.commitTxn()}} so that it doesn't do any 
checks in write_set etc.

{{TxnHandler.commitTxn()}} already starts with {{lockTransactionRecord(stmt, 
txnid, TXN_OPEN)}} so it can read the txn type in the same SQL stmt.

HiveOperation only has QUERY, which includes Insert and Select, so this 
requires figuring out how to determine if a query is a SELECT.  By the time 
{{Driver.openTransaction();}} is called, we have already parsed the query so 
there should be a way to know if the statement only reads.

For multi-stmt txns (once these are supported) we should allow user to indicate 
that a txn is read-only and then not allow any statements that can make 
modifications in this txn.  This should be a different jira.

cc [~ikryvenko]

  was:
With HIVE-21036 we have a way to indicate that a txn is read only.
We should (at least in auto-commit mode) determine if the single stmt is a read 
and mark the txn accordingly.  
Then we can optimize {{TxnHandler.commitTxn()}} so that it doesn't do any 
checks in write_set etc.
HiveOperation only has QUERY, which includes Insert and Select, so this 
requires figuring out how to determine if a query is a SELECT.  By the time 
{{Driver.openTransaction();}} is called, we have already parsed the query so 
there should be a way to know if the statement only reads.

For multi-stmt txns (once these are supported) we should allow user to indicate 
that a txn is read-only and then not allow any statements that can make 
modifications in this txn.  This should be a different jira.

cc [~ikryvenko]


> Create read-only transactions
> -
>
> Key: HIVE-21114
> URL: https://issues.apache.org/jira/browse/HIVE-21114
> Project: Hive
>  Issue Type: Bug
>  Components: Transactions
>Affects Versions: 4.0.0
>Reporter: Eugene Koifman
>Priority: Major
>
> With HIVE-21036 we have a way to indicate that a txn is read only.
> We should (at least in auto-commit mode) determine if the single stmt is a 
> read and mark the txn accordingly.  
> Then we can optimize {{TxnHandler.commitTxn()}} so that it doesn't do any 
> checks in write_set etc.
> {{TxnHandler.commitTxn()}} already starts with {{lockTransactionRecord(stmt, 
> txnid, TXN_OPEN)}} so it can read the txn type in the same SQL stmt.
> HiveOperation only has QUERY, which includes Insert and Select, so this 
> requires figuring out how to determine if a query is a SELECT.  By the time 
> {{Driver.openTransaction();}} is called, we have already parsed the query so 
> there should be a way to know if the statement only reads.
> For multi-stmt txns (once these are supported) we should allow user to 
> indicate that a txn is read-only and then not allow any statements that can 
> make modifications in this txn.  This should be a different jira.
> cc [~ikryvenko]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-21036) extend OpenTxnRequest with transaction type

2019-01-10 Thread Eugene Koifman (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-21036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eugene Koifman updated HIVE-21036:
--
   Resolution: Fixed
Fix Version/s: 4.0.0
 Release Note: n/a
   Status: Resolved  (was: Patch Available)

committed to master
thanks Igor for the contribution

> extend OpenTxnRequest with transaction type
> ---
>
> Key: HIVE-21036
> URL: https://issues.apache.org/jira/browse/HIVE-21036
> Project: Hive
>  Issue Type: Bug
>  Components: Transactions
>Affects Versions: 4.0.0
>Reporter: Eugene Koifman
>Assignee: Igor Kryvenko
>Priority: Major
> Fix For: 4.0.0
>
> Attachments: HIVE-21036.01.patch, HIVE-21036.02.patch, 
> HIVE-21036.03.patch, HIVE-21036.04.patch
>
>
> There is a {{TXN_TYPE}} field in {{TXNS}} table.
> There is {{TxnHandler.TxnType}} with legal values.  It would be useful to 
> TxnType a {{Thrift}}, add a new {{COMPACTION}} type object and allow setting 
> it in {{OpenTxnRequest}}.
> Since HIVE-20823 compactor starts a txn and should set this.
> Down the road we may want to set READ_ONLY either based on parsing of the 
> query or user input which can make {{TxnHandler.commitTxn}} faster.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-21052) Make sure transactions get cleaned if they are aborted before addPartitions is called

2019-01-10 Thread Eugene Koifman (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-21052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16739823#comment-16739823
 ] 

Eugene Koifman commented on HIVE-21052:
---

it would be useful to add some tests of partitioned tables, with > 1 partition 
column

AcidUtils.list() - how does this work when there are very many files (which I 
think would be common here)?  Should it use some form of RemoteFileIterator? 
e.g. FileUtils.RemoteIteratorWithFilter

> Make sure transactions get cleaned if they are aborted before addPartitions 
> is called
> -
>
> Key: HIVE-21052
> URL: https://issues.apache.org/jira/browse/HIVE-21052
> Project: Hive
>  Issue Type: Bug
>  Components: Transactions
>Affects Versions: 3.1.1
>Reporter: Jaume M
>Assignee: Jaume M
>Priority: Critical
> Attachments: Aborted Txn w_Direct Write.pdf, HIVE-21052.1.patch, 
> HIVE-21052.2.patch
>
>
> If the transaction is aborted between openTxn and addPartitions and data has 
> been written on the table the transaction manager will think it's an empty 
> transaction and no cleaning will be done.
> This is currently an issue in the streaming API and in micromanaged tables. 
> As proposed by [~ekoifman] this can be solved by:
> * Writing an entry with a special marker to TXN_COMPONENTS at openTxn and 
> when addPartitions is called remove this entry from TXN_COMPONENTS and add 
> the corresponding partition entry to TXN_COMPONENTS.
> * If the cleaner finds and entry with a special marker in TXN_COMPONENTS that 
> specifies that a transaction was opened and it was aborted it must generate 
> jobs for the worker for every possible partition available.
> cc [~ewohlstadter]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (HIVE-21052) Make sure transactions get cleaned if they are aborted before addPartitions is called

2019-01-10 Thread Eugene Koifman (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-21052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16739799#comment-16739799
 ] 

Eugene Koifman edited comment on HIVE-21052 at 1/10/19 9:52 PM:


I don't think testing the manually is sufficient.
See {{HiveConf.HIVETESTMODEROLLBACKTXN}}, 
{{HiveConf.HIVETESTMODEFAILCOMPACTION}} etc
You can add a similar one to inject a fault during a unit test to throw at the 
start of {{TxnHandler.addDynamicPartitions()}} for example

Could you explain you overall strategy?
I would've thought {{TxnHandler.enqueueLockWithRetry()}} has all the relevant 
info about whether it's a DP write or not.

Is it not possible to check if {{lc.getOperationType()}} is I/U/D and 
{{lc.isIsDynamicPartitionWrite()}}, then get the (dbname,table name) from 
LockComonent
and create the 'p' type entry in TXN_COMPONENTS.  just make sure there is 1 per 
table.

The same method has logic (via SQL query) find writeID associated with this 
txnId.

Does this work?


was (Author: ekoifman):
I don't think testing the manually is sufficient.
See {{HiveConf.HIVETESTMODEROLLBACKTXN}}, 
{{HiveConf.HIVETESTMODEFAILCOMPACTION}} etc
You can add a similar one to inject a fault during a unit test to throw at the 
start of {{TxnHandler.addDynamicPartitions()}} for example

> Make sure transactions get cleaned if they are aborted before addPartitions 
> is called
> -
>
> Key: HIVE-21052
> URL: https://issues.apache.org/jira/browse/HIVE-21052
> Project: Hive
>  Issue Type: Bug
>  Components: Transactions
>Affects Versions: 3.1.1
>Reporter: Jaume M
>Assignee: Jaume M
>Priority: Critical
> Attachments: Aborted Txn w_Direct Write.pdf, HIVE-21052.1.patch, 
> HIVE-21052.2.patch
>
>
> If the transaction is aborted between openTxn and addPartitions and data has 
> been written on the table the transaction manager will think it's an empty 
> transaction and no cleaning will be done.
> This is currently an issue in the streaming API and in micromanaged tables. 
> As proposed by [~ekoifman] this can be solved by:
> * Writing an entry with a special marker to TXN_COMPONENTS at openTxn and 
> when addPartitions is called remove this entry from TXN_COMPONENTS and add 
> the corresponding partition entry to TXN_COMPONENTS.
> * If the cleaner finds and entry with a special marker in TXN_COMPONENTS that 
> specifies that a transaction was opened and it was aborted it must generate 
> jobs for the worker for every possible partition available.
> cc [~ewohlstadter]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-21052) Make sure transactions get cleaned if they are aborted before addPartitions is called

2019-01-10 Thread Eugene Koifman (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-21052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16739799#comment-16739799
 ] 

Eugene Koifman commented on HIVE-21052:
---

I don't think testing the manually is sufficient.
See {{HiveConf.HIVETESTMODEROLLBACKTXN}}, 
{{HiveConf.HIVETESTMODEFAILCOMPACTION}} etc
You can add a similar one to inject a fault during a unit test to throw at the 
start of {{TxnHandler.addDynamicPartitions()}} for example

> Make sure transactions get cleaned if they are aborted before addPartitions 
> is called
> -
>
> Key: HIVE-21052
> URL: https://issues.apache.org/jira/browse/HIVE-21052
> Project: Hive
>  Issue Type: Bug
>  Components: Transactions
>Affects Versions: 3.1.1
>Reporter: Jaume M
>Assignee: Jaume M
>Priority: Critical
> Attachments: Aborted Txn w_Direct Write.pdf, HIVE-21052.1.patch, 
> HIVE-21052.2.patch
>
>
> If the transaction is aborted between openTxn and addPartitions and data has 
> been written on the table the transaction manager will think it's an empty 
> transaction and no cleaning will be done.
> This is currently an issue in the streaming API and in micromanaged tables. 
> As proposed by [~ekoifman] this can be solved by:
> * Writing an entry with a special marker to TXN_COMPONENTS at openTxn and 
> when addPartitions is called remove this entry from TXN_COMPONENTS and add 
> the corresponding partition entry to TXN_COMPONENTS.
> * If the cleaner finds and entry with a special marker in TXN_COMPONENTS that 
> specifies that a transaction was opened and it was aborted it must generate 
> jobs for the worker for every possible partition available.
> cc [~ewohlstadter]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-21036) extend OpenTxnRequest with transaction type

2019-01-10 Thread Eugene Koifman (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-21036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16739772#comment-16739772
 ] 

Eugene Koifman commented on HIVE-21036:
---

+1 patch 4 pending tests

> extend OpenTxnRequest with transaction type
> ---
>
> Key: HIVE-21036
> URL: https://issues.apache.org/jira/browse/HIVE-21036
> Project: Hive
>  Issue Type: Bug
>  Components: Transactions
>Affects Versions: 4.0.0
>Reporter: Eugene Koifman
>Assignee: Igor Kryvenko
>Priority: Major
> Attachments: HIVE-21036.01.patch, HIVE-21036.02.patch, 
> HIVE-21036.03.patch, HIVE-21036.04.patch
>
>
> There is a {{TXN_TYPE}} field in {{TXNS}} table.
> There is {{TxnHandler.TxnType}} with legal values.  It would be useful to 
> TxnType a {{Thrift}}, add a new {{COMPACTION}} type object and allow setting 
> it in {{OpenTxnRequest}}.
> Since HIVE-20823 compactor starts a txn and should set this.
> Down the road we may want to set READ_ONLY either based on parsing of the 
> query or user input which can make {{TxnHandler.commitTxn}} faster.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-21036) extend OpenTxnRequest with transaction type

2019-01-10 Thread Eugene Koifman (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-21036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16739586#comment-16739586
 ] 

Eugene Koifman commented on HIVE-21036:
---

I don't understand your explanation about the constructors.

I meant something like "{{optional TxnType txn_type = TxnType.DEFAULT}}" in 
Thrift definition.
Apparently you can't have an item both required and with default.
But this way it's always set to the most general transaction type and places 
like Worker supply a more specific one.

> extend OpenTxnRequest with transaction type
> ---
>
> Key: HIVE-21036
> URL: https://issues.apache.org/jira/browse/HIVE-21036
> Project: Hive
>  Issue Type: Bug
>  Components: Transactions
>Affects Versions: 4.0.0
>Reporter: Eugene Koifman
>Assignee: Igor Kryvenko
>Priority: Major
> Attachments: HIVE-21036.01.patch, HIVE-21036.02.patch, 
> HIVE-21036.03.patch
>
>
> There is a {{TXN_TYPE}} field in {{TXNS}} table.
> There is {{TxnHandler.TxnType}} with legal values.  It would be useful to 
> TxnType a {{Thrift}}, add a new {{COMPACTION}} type object and allow setting 
> it in {{OpenTxnRequest}}.
> Since HIVE-20823 compactor starts a txn and should set this.
> Down the road we may want to set READ_ONLY either based on parsing of the 
> query or user input which can make {{TxnHandler.commitTxn}} faster.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-21052) Make sure transaction get cleaned if they are aborted before addPartitions is called

2019-01-09 Thread Eugene Koifman (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-21052?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eugene Koifman updated HIVE-21052:
--
Attachment: Aborted Txn w_Direct Write.pdf

> Make sure transaction get cleaned if they are aborted before addPartitions is 
> called
> 
>
> Key: HIVE-21052
> URL: https://issues.apache.org/jira/browse/HIVE-21052
> Project: Hive
>  Issue Type: Bug
>  Components: Transactions
>Affects Versions: 3.1.1
>Reporter: Jaume M
>Assignee: Eugene Koifman
>Priority: Critical
> Attachments: Aborted Txn w_Direct Write.pdf, HIVE-21052.1.patch, 
> HIVE-21052.2.patch
>
>
> If the transaction is aborted between openTxn and addPartitions and data has 
> been written on the table the transaction manager will think it's an empty 
> transaction and no cleaning will be done.
> This is currently an issue in the streaming API and in micromanaged tables. 
> As proposed by [~ekoifman] this can be solved by:
> * Writing an entry with a special marker to TXN_COMPONENTS at openTxn and 
> when addPartitions is called remove this entry from TXN_COMPONENTS and add 
> the corresponding partition entry to TXN_COMPONENTS.
> * If the cleaner finds and entry with a special marker in TXN_COMPONENTS that 
> specifies that a transaction was opened and it was aborted it must generate 
> jobs for the worker for every possible partition available.
> cc [~ewohlstadter]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (HIVE-21052) Make sure transaction get cleaned if they are aborted before addPartitions is called

2019-01-09 Thread Eugene Koifman (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-21052?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eugene Koifman reassigned HIVE-21052:
-

Assignee: Jaume M  (was: Eugene Koifman)

> Make sure transaction get cleaned if they are aborted before addPartitions is 
> called
> 
>
> Key: HIVE-21052
> URL: https://issues.apache.org/jira/browse/HIVE-21052
> Project: Hive
>  Issue Type: Bug
>  Components: Transactions
>Affects Versions: 3.1.1
>Reporter: Jaume M
>Assignee: Jaume M
>Priority: Critical
> Attachments: Aborted Txn w_Direct Write.pdf, HIVE-21052.1.patch, 
> HIVE-21052.2.patch
>
>
> If the transaction is aborted between openTxn and addPartitions and data has 
> been written on the table the transaction manager will think it's an empty 
> transaction and no cleaning will be done.
> This is currently an issue in the streaming API and in micromanaged tables. 
> As proposed by [~ekoifman] this can be solved by:
> * Writing an entry with a special marker to TXN_COMPONENTS at openTxn and 
> when addPartitions is called remove this entry from TXN_COMPONENTS and add 
> the corresponding partition entry to TXN_COMPONENTS.
> * If the cleaner finds and entry with a special marker in TXN_COMPONENTS that 
> specifies that a transaction was opened and it was aborted it must generate 
> jobs for the worker for every possible partition available.
> cc [~ewohlstadter]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (HIVE-21052) Make sure transaction get cleaned if they are aborted before addPartitions is called

2019-01-09 Thread Eugene Koifman (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-21052?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eugene Koifman reassigned HIVE-21052:
-

Assignee: Eugene Koifman  (was: Jaume M)

> Make sure transaction get cleaned if they are aborted before addPartitions is 
> called
> 
>
> Key: HIVE-21052
> URL: https://issues.apache.org/jira/browse/HIVE-21052
> Project: Hive
>  Issue Type: Bug
>  Components: Transactions
>Affects Versions: 3.1.1
>Reporter: Jaume M
>Assignee: Eugene Koifman
>Priority: Critical
> Attachments: Aborted Txn w_Direct Write.pdf, HIVE-21052.1.patch, 
> HIVE-21052.2.patch
>
>
> If the transaction is aborted between openTxn and addPartitions and data has 
> been written on the table the transaction manager will think it's an empty 
> transaction and no cleaning will be done.
> This is currently an issue in the streaming API and in micromanaged tables. 
> As proposed by [~ekoifman] this can be solved by:
> * Writing an entry with a special marker to TXN_COMPONENTS at openTxn and 
> when addPartitions is called remove this entry from TXN_COMPONENTS and add 
> the corresponding partition entry to TXN_COMPONENTS.
> * If the cleaner finds and entry with a special marker in TXN_COMPONENTS that 
> specifies that a transaction was opened and it was aborted it must generate 
> jobs for the worker for every possible partition available.
> cc [~ewohlstadter]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (HIVE-12451) Orc fast file merging/concatenation should be disabled for ACID tables

2019-01-09 Thread Eugene Koifman (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-12451?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eugene Koifman resolved HIVE-12451.
---
  Resolution: Fixed
   Fix Version/s: 3.0.0
Release Note: Concatenate is now supported on Acid tables
Target Version/s: 3.0.0, 1.3.0  (was: 1.3.0, 3.0.0)

> Orc fast file merging/concatenation should be disabled for ACID tables
> --
>
> Key: HIVE-12451
> URL: https://issues.apache.org/jira/browse/HIVE-12451
> Project: Hive
>  Issue Type: Bug
>  Components: ORC, Transactions
>Affects Versions: 1.3.0, 2.0.0
>Reporter: Prasanth Jayachandran
>Assignee: Eugene Koifman
>Priority: Major
> Fix For: 3.0.0
>
>
> For ACID tables merging of small files should happen only through compaction. 
> We should disable "alter table .. concatenate" for ACID tables. We should 
> also disable ConditionalMergeFileTask if destination is an ACID table.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (HIVE-12451) Orc fast file merging/concatenation should be disabled for ACID tables

2019-01-09 Thread Eugene Koifman (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-12451?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eugene Koifman reassigned HIVE-12451:
-

Assignee: Eugene Koifman  (was: Alan Gates)

> Orc fast file merging/concatenation should be disabled for ACID tables
> --
>
> Key: HIVE-12451
> URL: https://issues.apache.org/jira/browse/HIVE-12451
> Project: Hive
>  Issue Type: Bug
>  Components: ORC, Transactions
>Affects Versions: 1.3.0, 2.0.0
>Reporter: Prasanth Jayachandran
>Assignee: Eugene Koifman
>Priority: Major
>
> For ACID tables merging of small files should happen only through compaction. 
> We should disable "alter table .. concatenate" for ACID tables. We should 
> also disable ConditionalMergeFileTask if destination is an ACID table.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-16258) Suggestion: simplify type 2 SCDs with this non-standard extension to MERGE

2019-01-09 Thread Eugene Koifman (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-16258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eugene Koifman updated HIVE-16258:
--
Issue Type: New Feature  (was: Improvement)

> Suggestion: simplify type 2 SCDs with this non-standard extension to MERGE
> --
>
> Key: HIVE-16258
> URL: https://issues.apache.org/jira/browse/HIVE-16258
> Project: Hive
>  Issue Type: New Feature
>  Components: Transactions
>Affects Versions: 2.2.0
>Reporter: Carter Shanklin
>Priority: Major
>
> Some common data maintenance strategies, especially the Type 2 SCD update, 
> would become substantially easier with a small extension to the SQL standard 
> for MERGE, specifically the ability to say "when matched then insert". Per 
> the standard, matched records can only be updated or deleted.
> In the Type 2 SCD, when a new record comes in you update the old version of 
> the record and insert the new version of the same record. If this extension 
> were supported, sample Type 2 SCD code would look as follows:
> {code}
> merge into customer
> using new_customer_stage stage
> on stage.source_pk = customer.source_pk
> when not matched then insert values/* Insert a net new record */
>   (stage.source_pk, upper(substr(stage.name, 0, 3)), stage.name, stage.state, 
> true, null)
> when matched then update set   /* Update an old record to mark it as 
> out-of-date */
>   is_current = false, end_date = current_date()
> when matched then insert values/* Insert a new current record */
>   (stage.source_pk, upper(substr(stage.name, 0, 3)), stage.name, stage.state, 
> true, null);
> {code}
> Without this support, the user needs to devise some sort of workaround. A 
> common approach is to first left join the staging table against the table to 
> be updated, then to join these results to a helper table that will spit out 
> two records for each match and one record for each miss. One of the matching 
> records needs to have a join key that can never occur in the source data so 
> this requires precise knowledge of the source dataset.
> An example of this:
> {code}
> merge into customer
> using (
>   select
> *,
> coalesce(invalid_key, source_pk) as join_key
>   from (
> select
>   stage.source_pk, stage.name, stage.state,
>   case when customer.source_pk is null then 1
>   when stage.name <> customer.name or stage.state <> customer.state then 2
>   else 0 end as scd_row_type
> from
>   new_customer_stage stage
> left join
>   customer
> on (stage.source_pk = customer.source_pk and customer.is_current = true)
>   ) updates
>   join scd_types on scd_types.type = scd_row_type
> ) sub
> on sub.join_key = customer.source_pk
> when matched then update set
>   is_current = false,
>   end_date = current_date()
> when not matched then insert values
>   (sub.source_pk, upper(substr(sub.name, 0, 3)), sub.name, sub.state, true, 
> null);
> select * from customer order by source_pk;
> {code}
> This code is very complicated and will fail if the "invalid" key ever shows 
> up in the source dataset. This simple extension provides a lot of value and 
> likely very little maintenance overhead.
> /cc [~ekoifman]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (HIVE-17284) remove OrcRecordUpdater.deleteEventIndexBuilder

2019-01-09 Thread Eugene Koifman (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-17284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eugene Koifman resolved HIVE-17284.
---
Resolution: Won't Fix

hive.acid.key.index is actually need for delete event filtering HIVE-20738

> remove OrcRecordUpdater.deleteEventIndexBuilder
> ---
>
> Key: HIVE-17284
> URL: https://issues.apache.org/jira/browse/HIVE-17284
> Project: Hive
>  Issue Type: Improvement
>  Components: Transactions
>Affects Versions: 3.0.0
>Reporter: Eugene Koifman
>Assignee: Eugene Koifman
>Priority: Minor
>
> There is no point in it. We know how many rows a delete_delta file has from 
> ORC and they are all the same type - so no need for AcidStats.
>  hive.acid.key.index has no value since delete_delta files are never split 
> and are not likely to have more than 1 stripe since they are very small.
> Also can remove KeyIndexBuilder.acidStats - we only have 1 type of event per 
> file
>  
> if doing this, make sure to fix {{OrcInputFormat.isOriginal(Reader)}} and 
> {{OrcInputFormat.isOriginal(Footer)}} etc
> there is new KeyIndexBuilder("delete") and new KeyIndexBuilder("insert").  
> The later is needed in HIVE-16812, the former can be removed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (HIVE-18221) test acid default

2019-01-09 Thread Eugene Koifman (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-18221?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eugene Koifman resolved HIVE-18221.
---
Resolution: Won't Fix

this was just to run all tests

> test acid default
> -
>
> Key: HIVE-18221
> URL: https://issues.apache.org/jira/browse/HIVE-18221
> Project: Hive
>  Issue Type: Test
>  Components: Transactions
>Affects Versions: 3.0.0
>Reporter: Eugene Koifman
>Assignee: Eugene Koifman
>Priority: Major
> Attachments: HIVE-18221.01.patch, HIVE-18221.02.patch, 
> HIVE-18221.03.patch, HIVE-18221.04.patch, HIVE-18221.07.patch, 
> HIVE-18221.08.patch, HIVE-18221.09.patch, HIVE-18221.10.patch, 
> HIVE-18221.11.patch, HIVE-18221.12.patch, HIVE-18221.13.patch, 
> HIVE-18221.14.patch, HIVE-18221.16.patch, HIVE-18221.18.patch, 
> HIVE-18221.19.patch, HIVE-18221.20.patch, HIVE-18221.21.patch, 
> HIVE-18221.22.patch, HIVE-18221.23.patch, HIVE-18221.24.patch, 
> HIVE-18221.26.patch
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (HIVE-20435) Failed Dynamic Partition Insert into insert only table may loose transaction metadata

2019-01-09 Thread Eugene Koifman (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-20435?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eugene Koifman resolved HIVE-20435.
---
   Resolution: Won't Fix
Fix Version/s: 4.0.0
 Release Note: n/a

address in HIVE-21052

> Failed Dynamic Partition Insert into insert only table may loose transaction 
> metadata
> -
>
> Key: HIVE-20435
> URL: https://issues.apache.org/jira/browse/HIVE-20435
> Project: Hive
>  Issue Type: Bug
>  Components: Transactions
>Affects Versions: 3.0.0
>Reporter: Eugene Koifman
>Assignee: Eugene Koifman
>Priority: Critical
> Fix For: 4.0.0
>
>
> {{TxnHandler.enqueueLockWithRetry()}} has an optimization where it doesn't 
> writ to {{TXN_COMPONENTS}} if the write is a dynamic partition insert because 
> it expects to write to this table from {{addDynamicPartitions()}}.
> For insert-only, transactional tables, we create the target dir and start 
> writing to it before {{addDynamicPartitions()}} is called. So if a txn is 
> aborted, we may have a delta dir in the partition but no corresponding entry 
> in {{TXN_COMPONENTS}}. This means {{TxnStore.cleanEmptyAbortedTxns()}} may 
> clean up {{TXNS}} entry for the aborted transaction before Compactor removes 
> this delta dir, at which point it looks like committed data.
> Streaming API V2 with dynamic partition mode also has this problem.
> Full CRUD are currently immune to this since they rely on "move" operation in 
> MoveTask but longer term they should follow the same model as insert-only 
> tables.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (HIVE-19955) ACID: Pre-filter the delete event registry using insert delta ranges

2019-01-09 Thread Eugene Koifman (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-19955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eugene Koifman resolved HIVE-19955.
---
   Resolution: Won't Fix
Fix Version/s: 4.0.0
 Release Note: n/a

fixed as part of HIVE-20738

> ACID: Pre-filter the delete event registry using insert delta ranges
> 
>
> Key: HIVE-19955
> URL: https://issues.apache.org/jira/browse/HIVE-19955
> Project: Hive
>  Issue Type: Bug
>  Components: Transactions
>Reporter: Gopal V
>Assignee: Eugene Koifman
>Priority: Major
> Fix For: 4.0.0
>
>
> Since the delete deltas that will be used for the ACID impl is limited to the 
> txn range encoded within the insert deltas, it is not useful to load any 
> delete events for any row outside of the current file's range.
> If the insert delta has "delta_3_3_0", then the "writeid=3" can be applied to 
> the delete delta list while loading it into memory - if the file has 
> "base_3", then the filter becomes "writeid <= 3".



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (HIVE-19955) ACID: Pre-filter the delete event registry using insert delta ranges

2019-01-09 Thread Eugene Koifman (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-19955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eugene Koifman reassigned HIVE-19955:
-

Assignee: Eugene Koifman

> ACID: Pre-filter the delete event registry using insert delta ranges
> 
>
> Key: HIVE-19955
> URL: https://issues.apache.org/jira/browse/HIVE-19955
> Project: Hive
>  Issue Type: Bug
>  Components: Transactions
>Reporter: Gopal V
>Assignee: Eugene Koifman
>Priority: Major
>
> Since the delete deltas that will be used for the ACID impl is limited to the 
> txn range encoded within the insert deltas, it is not useful to load any 
> delete events for any row outside of the current file's range.
> If the insert delta has "delta_3_3_0", then the "writeid=3" can be applied to 
> the delete delta list while loading it into memory - if the file has 
> "base_3", then the filter becomes "writeid <= 3".



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (HIVE-20435) Failed Dynamic Partition Insert into insert only table may loose transaction metadata

2019-01-09 Thread Eugene Koifman (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-20435?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16738774#comment-16738774
 ] 

Eugene Koifman edited comment on HIVE-20435 at 1/9/19 11:30 PM:


addressed in HIVE-21052


was (Author: ekoifman):
address in HIVE-21052

> Failed Dynamic Partition Insert into insert only table may loose transaction 
> metadata
> -
>
> Key: HIVE-20435
> URL: https://issues.apache.org/jira/browse/HIVE-20435
> Project: Hive
>  Issue Type: Bug
>  Components: Transactions
>Affects Versions: 3.0.0
>Reporter: Eugene Koifman
>Assignee: Eugene Koifman
>Priority: Critical
> Fix For: 4.0.0
>
>
> {{TxnHandler.enqueueLockWithRetry()}} has an optimization where it doesn't 
> writ to {{TXN_COMPONENTS}} if the write is a dynamic partition insert because 
> it expects to write to this table from {{addDynamicPartitions()}}.
> For insert-only, transactional tables, we create the target dir and start 
> writing to it before {{addDynamicPartitions()}} is called. So if a txn is 
> aborted, we may have a delta dir in the partition but no corresponding entry 
> in {{TXN_COMPONENTS}}. This means {{TxnStore.cleanEmptyAbortedTxns()}} may 
> clean up {{TXNS}} entry for the aborted transaction before Compactor removes 
> this delta dir, at which point it looks like committed data.
> Streaming API V2 with dynamic partition mode also has this problem.
> Full CRUD are currently immune to this since they rely on "move" operation in 
> MoveTask but longer term they should follow the same model as insert-only 
> tables.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-20919) Break up UpdateDeleteSemanticAnalyzer

2019-01-09 Thread Eugene Koifman (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-20919?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eugene Koifman updated HIVE-20919:
--
  Resolution: Fixed
Release Note: n/a
  Status: Resolved  (was: Patch Available)

> Break up UpdateDeleteSemanticAnalyzer
> -
>
> Key: HIVE-20919
> URL: https://issues.apache.org/jira/browse/HIVE-20919
> Project: Hive
>  Issue Type: Improvement
>  Components: Transactions
>Affects Versions: 3.2.0
>Reporter: Miklos Gergely
>Assignee: Miklos Gergely
>Priority: Major
> Fix For: 4.0.0
>
> Attachments: HIVE-20919.01.patch, HIVE-20919.02.patch, 
> HIVE-20919.03.patch, HIVE-20919.04.patch, HIVE-20919.05.patch, 
> HIVE-20919.06.patch, HIVE-20919.07.patch
>
>
> UpdateDeleteSemanticAnalyzer handles update, delete, acid export and merge 
> queries by rewriting them to a different form. This is a clear violation of 
> [SRP|https://en.wikipedia.org/wiki/Single_responsibility_principle], and 
> therefore needs to be refactored. An abstract ancestor needs to take the 
> common part, and each of the specific tasks should be handled by a separate 
> class.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-20919) Break up UpdateDeleteSemanticAnalyzer

2019-01-09 Thread Eugene Koifman (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-20919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16738767#comment-16738767
 ] 

Eugene Koifman commented on HIVE-20919:
---

committed to master
thanks Miklos for the contribution

> Break up UpdateDeleteSemanticAnalyzer
> -
>
> Key: HIVE-20919
> URL: https://issues.apache.org/jira/browse/HIVE-20919
> Project: Hive
>  Issue Type: Improvement
>  Components: Transactions
>Affects Versions: 3.2.0
>Reporter: Miklos Gergely
>Assignee: Miklos Gergely
>Priority: Major
> Fix For: 4.0.0
>
> Attachments: HIVE-20919.01.patch, HIVE-20919.02.patch, 
> HIVE-20919.03.patch, HIVE-20919.04.patch, HIVE-20919.05.patch, 
> HIVE-20919.06.patch, HIVE-20919.07.patch
>
>
> UpdateDeleteSemanticAnalyzer handles update, delete, acid export and merge 
> queries by rewriting them to a different form. This is a clear violation of 
> [SRP|https://en.wikipedia.org/wiki/Single_responsibility_principle], and 
> therefore needs to be refactored. An abstract ancestor needs to take the 
> common part, and each of the specific tasks should be handled by a separate 
> class.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-20919) Break up UpdateDeleteSemanticAnalyzer

2019-01-09 Thread Eugene Koifman (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-20919?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eugene Koifman updated HIVE-20919:
--
Fix Version/s: (was: 3.2.0)
   4.0.0

> Break up UpdateDeleteSemanticAnalyzer
> -
>
> Key: HIVE-20919
> URL: https://issues.apache.org/jira/browse/HIVE-20919
> Project: Hive
>  Issue Type: Improvement
>  Components: Transactions
>Affects Versions: 3.2.0
>Reporter: Miklos Gergely
>Assignee: Miklos Gergely
>Priority: Major
> Fix For: 4.0.0
>
> Attachments: HIVE-20919.01.patch, HIVE-20919.02.patch, 
> HIVE-20919.03.patch, HIVE-20919.04.patch, HIVE-20919.05.patch, 
> HIVE-20919.06.patch, HIVE-20919.07.patch
>
>
> UpdateDeleteSemanticAnalyzer handles update, delete, acid export and merge 
> queries by rewriting them to a different form. This is a clear violation of 
> [SRP|https://en.wikipedia.org/wiki/Single_responsibility_principle], and 
> therefore needs to be refactored. An abstract ancestor needs to take the 
> common part, and each of the specific tasks should be handled by a separate 
> class.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-20919) Break up UpdateDeleteSemanticAnalyzer

2019-01-09 Thread Eugene Koifman (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-20919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16738740#comment-16738740
 ] 

Eugene Koifman commented on HIVE-20919:
---

+1

> Break up UpdateDeleteSemanticAnalyzer
> -
>
> Key: HIVE-20919
> URL: https://issues.apache.org/jira/browse/HIVE-20919
> Project: Hive
>  Issue Type: Improvement
>  Components: Transactions
>Affects Versions: 3.2.0
>Reporter: Miklos Gergely
>Assignee: Miklos Gergely
>Priority: Major
> Fix For: 3.2.0
>
> Attachments: HIVE-20919.01.patch, HIVE-20919.02.patch, 
> HIVE-20919.03.patch, HIVE-20919.04.patch, HIVE-20919.05.patch, 
> HIVE-20919.06.patch, HIVE-20919.07.patch
>
>
> UpdateDeleteSemanticAnalyzer handles update, delete, acid export and merge 
> queries by rewriting them to a different form. This is a clear violation of 
> [SRP|https://en.wikipedia.org/wiki/Single_responsibility_principle], and 
> therefore needs to be refactored. An abstract ancestor needs to take the 
> common part, and each of the specific tasks should be handled by a separate 
> class.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-20919) Break up UpdateDeleteSemanticAnalyzer

2019-01-09 Thread Eugene Koifman (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-20919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16738671#comment-16738671
 ] 

Eugene Koifman commented on HIVE-20919:
---

[~mgergely] is this complete or does more need to be done?

> Break up UpdateDeleteSemanticAnalyzer
> -
>
> Key: HIVE-20919
> URL: https://issues.apache.org/jira/browse/HIVE-20919
> Project: Hive
>  Issue Type: Improvement
>  Components: Transactions
>Affects Versions: 3.2.0
>Reporter: Miklos Gergely
>Assignee: Miklos Gergely
>Priority: Major
> Fix For: 3.2.0
>
> Attachments: HIVE-20919.01.patch, HIVE-20919.02.patch, 
> HIVE-20919.03.patch, HIVE-20919.04.patch, HIVE-20919.05.patch, 
> HIVE-20919.06.patch, HIVE-20919.07.patch
>
>
> UpdateDeleteSemanticAnalyzer handles update, delete, acid export and merge 
> queries by rewriting them to a different form. This is a clear violation of 
> [SRP|https://en.wikipedia.org/wiki/Single_responsibility_principle], and 
> therefore needs to be refactored. An abstract ancestor needs to take the 
> common part, and each of the specific tasks should be handled by a separate 
> class.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-21036) extend OpenTxnRequest with transaction type

2019-01-09 Thread Eugene Koifman (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-21036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16738653#comment-16738653
 ] 

Eugene Koifman commented on HIVE-21036:
---

n/m the RB.

could {{optional TxnType txn_type,}} in {{OpenTxnRequest}} be made required 
with a default being TxnType.DEFAULT?  Seems like you wouldn't need 
{{rqst.isSetTxn_type()}} all over in that case.

Nits:
looks like there hand written commens in OpenTxnRequest.java  which is a 
generated class
could you restore individual import stmts inIMetaStoreClient.java




> extend OpenTxnRequest with transaction type
> ---
>
> Key: HIVE-21036
> URL: https://issues.apache.org/jira/browse/HIVE-21036
> Project: Hive
>  Issue Type: Bug
>  Components: Transactions
>Affects Versions: 4.0.0
>Reporter: Eugene Koifman
>Assignee: Igor Kryvenko
>Priority: Major
> Attachments: HIVE-21036.01.patch, HIVE-21036.02.patch, 
> HIVE-21036.03.patch
>
>
> There is a {{TXN_TYPE}} field in {{TXNS}} table.
> There is {{TxnHandler.TxnType}} with legal values.  It would be useful to 
> TxnType a {{Thrift}}, add a new {{COMPACTION}} type object and allow setting 
> it in {{OpenTxnRequest}}.
> Since HIVE-20823 compactor starts a txn and should set this.
> Down the road we may want to set READ_ONLY either based on parsing of the 
> query or user input which can make {{TxnHandler.commitTxn}} faster.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-21036) extend OpenTxnRequest with transaction type

2019-01-09 Thread Eugene Koifman (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-21036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16738642#comment-16738642
 ] 

Eugene Koifman commented on HIVE-21036:
---

[~ikryvenko] could you create a ReviewBoard please

> extend OpenTxnRequest with transaction type
> ---
>
> Key: HIVE-21036
> URL: https://issues.apache.org/jira/browse/HIVE-21036
> Project: Hive
>  Issue Type: Bug
>  Components: Transactions
>Affects Versions: 4.0.0
>Reporter: Eugene Koifman
>Assignee: Igor Kryvenko
>Priority: Major
> Attachments: HIVE-21036.01.patch, HIVE-21036.02.patch, 
> HIVE-21036.03.patch
>
>
> There is a {{TXN_TYPE}} field in {{TXNS}} table.
> There is {{TxnHandler.TxnType}} with legal values.  It would be useful to 
> TxnType a {{Thrift}}, add a new {{COMPACTION}} type object and allow setting 
> it in {{OpenTxnRequest}}.
> Since HIVE-20823 compactor starts a txn and should set this.
> Down the road we may want to set READ_ONLY either based on parsing of the 
> query or user input which can make {{TxnHandler.commitTxn}} faster.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (HIVE-21106) Potential NEP in VectorizedOrcAcidRowBatchReader.ColumnizedDeleteEventRegistry

2019-01-08 Thread Eugene Koifman (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-21106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eugene Koifman resolved HIVE-21106.
---
   Resolution: Won't Fix
Fix Version/s: 4.0.0
 Release Note: n/a

already addressed on master via another commit
https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/io/orc/VectorizedOrcAcidRowBatchReader.java#L1570-L1575

> Potential NEP in VectorizedOrcAcidRowBatchReader.ColumnizedDeleteEventRegistry
> --
>
> Key: HIVE-21106
> URL: https://issues.apache.org/jira/browse/HIVE-21106
> Project: Hive
>  Issue Type: Bug
>  Components: Transactions
>Affects Versions: 3.0.0
>Reporter: Eugene Koifman
>Assignee: Eugene Koifman
>Priority: Major
> Fix For: 4.0.0
>
>
> {{VectorizedOrcAcidRowBatchReader.ColumnizedDeleteEventRegistry()}}
> {noformat}
> AcidStats acidStats = OrcAcidUtils.parseAcidStats(deleteDeltaReader);
> if (acidStats.deletes == 0) {
>  continue; // just a safe check to ensure that we are not reading empty 
> delete files.
> }
> {noformat}
> If the {{delete_delta../bucket_x}} is empty, it may not have a 
> {{hive.acid.index}} and {{OrcAcidUtils.parseAcidStats()}} will return null 
> which causes NPE.
> Even though HIVE-20941 will ensure empty files are no longer created but 
> empty files predating that fix may exist



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (HIVE-21106) Potential NEP in VectorizedOrcAcidRowBatchReader.ColumnizedDeleteEventRegistry

2019-01-08 Thread Eugene Koifman (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-21106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eugene Koifman reassigned HIVE-21106:
-

Assignee: Eugene Koifman

> Potential NEP in VectorizedOrcAcidRowBatchReader.ColumnizedDeleteEventRegistry
> --
>
> Key: HIVE-21106
> URL: https://issues.apache.org/jira/browse/HIVE-21106
> Project: Hive
>  Issue Type: Bug
>  Components: Transactions
>Affects Versions: 3.0.0
>Reporter: Eugene Koifman
>Assignee: Eugene Koifman
>Priority: Major
>
> {{VectorizedOrcAcidRowBatchReader.ColumnizedDeleteEventRegistry()}}
> {noformat}
> AcidStats acidStats = OrcAcidUtils.parseAcidStats(deleteDeltaReader);
> if (acidStats.deletes == 0) {
>  continue; // just a safe check to ensure that we are not reading empty 
> delete files.
> }
> {noformat}
> If the {{delete_delta../bucket_x}} is empty, it may not have a 
> {{hive.acid.index}} and {{OrcAcidUtils.parseAcidStats()}} will return null 
> which causes NPE.
> Even though HIVE-20941 will ensure empty files are no longer created but 
> empty files predating that fix may exist



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-21106) Potential NEP in VectorizedOrcAcidRowBatchReader.ColumnizedDeleteEventRegistry

2019-01-08 Thread Eugene Koifman (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-21106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eugene Koifman updated HIVE-21106:
--
Description: 
{{VectorizedOrcAcidRowBatchReader.ColumnizedDeleteEventRegistry()}}

{noformat}
AcidStats acidStats = OrcAcidUtils.parseAcidStats(deleteDeltaReader);
if (acidStats.deletes == 0) {
 continue; // just a safe check to ensure that we are not reading empty delete 
files.
}
{noformat}

If the {{delete_delta../bucket_x}} is empty, it may not have a 
{{hive.acid.index}} and {{OrcAcidUtils.parseAcidStats()}} will return null 
which causes NPE.

Even though HIVE-20941 will ensure empty files are no longer created but empty 
files predating that fix may exist



  was:
{{VectorizedOrcAcidRowBatchReader.ColumnizedDeleteEventRegistry()}}

{noformat}
AcidStats acidStats = OrcAcidUtils.parseAcidStats(deleteDeltaReader);
if (acidStats.deletes == 0) {
 continue; // just a safe check to ensure that we are not reading empty delete 
files.
}
{noformat}

If the {{delete_delta../bucket_x}} is empty, it may not have a 
{{hive.acid.index}} and {{OrcAcidUtils.parseAcidStats()}} will return null 
which causes NPE.






> Potential NEP in VectorizedOrcAcidRowBatchReader.ColumnizedDeleteEventRegistry
> --
>
> Key: HIVE-21106
> URL: https://issues.apache.org/jira/browse/HIVE-21106
> Project: Hive
>  Issue Type: Bug
>  Components: Transactions
>Affects Versions: 3.0.0
>Reporter: Eugene Koifman
>Priority: Major
>
> {{VectorizedOrcAcidRowBatchReader.ColumnizedDeleteEventRegistry()}}
> {noformat}
> AcidStats acidStats = OrcAcidUtils.parseAcidStats(deleteDeltaReader);
> if (acidStats.deletes == 0) {
>  continue; // just a safe check to ensure that we are not reading empty 
> delete files.
> }
> {noformat}
> If the {{delete_delta../bucket_x}} is empty, it may not have a 
> {{hive.acid.index}} and {{OrcAcidUtils.parseAcidStats()}} will return null 
> which causes NPE.
> Even though HIVE-20941 will ensure empty files are no longer created but 
> empty files predating that fix may exist



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-21100) Allow flattening of table subdirectories resulted when using TEZ engine and UNION clause

2019-01-08 Thread Eugene Koifman (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-21100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16737666#comment-16737666
 ] 

Eugene Koifman commented on HIVE-21100:
---

What is the motivation for this change?

Doing FileSystem.rename() on a system like S3 is expensive.

 

> Allow flattening of table subdirectories resulted when using TEZ engine and 
> UNION clause
> 
>
> Key: HIVE-21100
> URL: https://issues.apache.org/jira/browse/HIVE-21100
> Project: Hive
>  Issue Type: Improvement
>Reporter: George Pachitariu
>Assignee: George Pachitariu
>Priority: Minor
> Attachments: HIVE-21100.patch
>
>
> Right now, when writing data into a table with Tez engine and the clause 
> UNION ALL is the last step of the query, Hive on Tez will create a 
> subdirectory for each branch of the UNION ALL.
> With this patch the subdirectories are removed, and the files are renamed and 
> moved to the parent directory.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


  1   2   3   4   5   6   7   8   9   10   >