[jira] [Updated] (HIVE-20901) running compactor when there is nothing to do produces duplicate data

2020-01-08 Thread Peter Vary (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-20901?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Vary updated HIVE-20901:
--
Fix Version/s: 4.0.0
   Resolution: Duplicate
   Status: Resolved  (was: Patch Available)

[~asomani]: If you do not mind I close this jira as it was fixed by HIVE-9995. 
Sorry for the confusion, I have found this jira only now :(

> running compactor when there is nothing to do produces duplicate data
> -
>
> Key: HIVE-20901
> URL: https://issues.apache.org/jira/browse/HIVE-20901
> Project: Hive
>  Issue Type: Bug
>  Components: Transactions
>Affects Versions: 4.0.0
>Reporter: Eugene Koifman
>Assignee: Abhishek Somani
>Priority: Major
> Fix For: 4.0.0
>
> Attachments: HIVE-20901.1.patch, HIVE-20901.2.patch
>
>
> suppose we run minor compaction 2 times, via alter table
> The 2nd request to compaction should have nothing to do but I don't think 
> there is a check for that.  It's visible in the context of HIVE-20823, where 
> each compactor run produces a delta with new visibility suffix so we end up 
> with something like
> {noformat}
> target/tmp/org.apache.hadoop.hive.ql.TestTxnCommands3-1541810844849/warehouse/t/
> ├── delete_delta_001_002_v019
> │   ├── _orc_acid_version
> │   └── bucket_0
> ├── delete_delta_001_002_v021
> │   ├── _orc_acid_version
> │   └── bucket_0
> ├── delta_001_001_
> │   ├── _orc_acid_version
> │   └── bucket_0
> ├── delta_001_002_v019
> │   ├── _orc_acid_version
> │   └── bucket_0
> ├── delta_001_002_v021
> │   ├── _orc_acid_version
> │   └── bucket_0
> └── delta_002_002_
>     ├── _orc_acid_version
>     └── bucket_0{noformat}
> i.e. 2 deltas with the same write ID range
> this is bad.  Probably happens today as well but new run produces a delta 
> with the same name and clobbers the previous one, which may interfere with 
> writers
>  
> need to investigate
>  
> -The issue (I think) is that {{AcidUtils.getAcidState()}} then returns both 
> deltas as if they were distinct and it effectively duplicates data.-  There 
> is no data duplication - {{getAcidState()}} will not use 2 deltas with the 
> same {{writeid}} range
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HIVE-20901) running compactor when there is nothing to do produces duplicate data

2019-04-08 Thread Abhishek Somani (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-20901?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Abhishek Somani updated HIVE-20901:
---
Status: Patch Available  (was: Open)

> running compactor when there is nothing to do produces duplicate data
> -
>
> Key: HIVE-20901
> URL: https://issues.apache.org/jira/browse/HIVE-20901
> Project: Hive
>  Issue Type: Bug
>  Components: Transactions
>Affects Versions: 4.0.0
>Reporter: Eugene Koifman
>Assignee: Abhishek Somani
>Priority: Major
> Attachments: HIVE-20901.1.patch, HIVE-20901.2.patch
>
>
> suppose we run minor compaction 2 times, via alter table
> The 2nd request to compaction should have nothing to do but I don't think 
> there is a check for that.  It's visible in the context of HIVE-20823, where 
> each compactor run produces a delta with new visibility suffix so we end up 
> with something like
> {noformat}
> target/tmp/org.apache.hadoop.hive.ql.TestTxnCommands3-1541810844849/warehouse/t/
> ├── delete_delta_001_002_v019
> │   ├── _orc_acid_version
> │   └── bucket_0
> ├── delete_delta_001_002_v021
> │   ├── _orc_acid_version
> │   └── bucket_0
> ├── delta_001_001_
> │   ├── _orc_acid_version
> │   └── bucket_0
> ├── delta_001_002_v019
> │   ├── _orc_acid_version
> │   └── bucket_0
> ├── delta_001_002_v021
> │   ├── _orc_acid_version
> │   └── bucket_0
> └── delta_002_002_
>     ├── _orc_acid_version
>     └── bucket_0{noformat}
> i.e. 2 deltas with the same write ID range
> this is bad.  Probably happens today as well but new run produces a delta 
> with the same name and clobbers the previous one, which may interfere with 
> writers
>  
> need to investigate
>  
> -The issue (I think) is that {{AcidUtils.getAcidState()}} then returns both 
> deltas as if they were distinct and it effectively duplicates data.-  There 
> is no data duplication - {{getAcidState()}} will not use 2 deltas with the 
> same {{writeid}} range
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-20901) running compactor when there is nothing to do produces duplicate data

2019-04-08 Thread Abhishek Somani (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-20901?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Abhishek Somani updated HIVE-20901:
---
Attachment: HIVE-20901.2.patch

> running compactor when there is nothing to do produces duplicate data
> -
>
> Key: HIVE-20901
> URL: https://issues.apache.org/jira/browse/HIVE-20901
> Project: Hive
>  Issue Type: Bug
>  Components: Transactions
>Affects Versions: 4.0.0
>Reporter: Eugene Koifman
>Assignee: Abhishek Somani
>Priority: Major
> Attachments: HIVE-20901.1.patch, HIVE-20901.2.patch
>
>
> suppose we run minor compaction 2 times, via alter table
> The 2nd request to compaction should have nothing to do but I don't think 
> there is a check for that.  It's visible in the context of HIVE-20823, where 
> each compactor run produces a delta with new visibility suffix so we end up 
> with something like
> {noformat}
> target/tmp/org.apache.hadoop.hive.ql.TestTxnCommands3-1541810844849/warehouse/t/
> ├── delete_delta_001_002_v019
> │   ├── _orc_acid_version
> │   └── bucket_0
> ├── delete_delta_001_002_v021
> │   ├── _orc_acid_version
> │   └── bucket_0
> ├── delta_001_001_
> │   ├── _orc_acid_version
> │   └── bucket_0
> ├── delta_001_002_v019
> │   ├── _orc_acid_version
> │   └── bucket_0
> ├── delta_001_002_v021
> │   ├── _orc_acid_version
> │   └── bucket_0
> └── delta_002_002_
>     ├── _orc_acid_version
>     └── bucket_0{noformat}
> i.e. 2 deltas with the same write ID range
> this is bad.  Probably happens today as well but new run produces a delta 
> with the same name and clobbers the previous one, which may interfere with 
> writers
>  
> need to investigate
>  
> -The issue (I think) is that {{AcidUtils.getAcidState()}} then returns both 
> deltas as if they were distinct and it effectively duplicates data.-  There 
> is no data duplication - {{getAcidState()}} will not use 2 deltas with the 
> same {{writeid}} range
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-20901) running compactor when there is nothing to do produces duplicate data

2019-04-08 Thread Abhishek Somani (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-20901?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Abhishek Somani updated HIVE-20901:
---
Status: Open  (was: Patch Available)

> running compactor when there is nothing to do produces duplicate data
> -
>
> Key: HIVE-20901
> URL: https://issues.apache.org/jira/browse/HIVE-20901
> Project: Hive
>  Issue Type: Bug
>  Components: Transactions
>Affects Versions: 4.0.0
>Reporter: Eugene Koifman
>Assignee: Abhishek Somani
>Priority: Major
> Attachments: HIVE-20901.1.patch, HIVE-20901.2.patch
>
>
> suppose we run minor compaction 2 times, via alter table
> The 2nd request to compaction should have nothing to do but I don't think 
> there is a check for that.  It's visible in the context of HIVE-20823, where 
> each compactor run produces a delta with new visibility suffix so we end up 
> with something like
> {noformat}
> target/tmp/org.apache.hadoop.hive.ql.TestTxnCommands3-1541810844849/warehouse/t/
> ├── delete_delta_001_002_v019
> │   ├── _orc_acid_version
> │   └── bucket_0
> ├── delete_delta_001_002_v021
> │   ├── _orc_acid_version
> │   └── bucket_0
> ├── delta_001_001_
> │   ├── _orc_acid_version
> │   └── bucket_0
> ├── delta_001_002_v019
> │   ├── _orc_acid_version
> │   └── bucket_0
> ├── delta_001_002_v021
> │   ├── _orc_acid_version
> │   └── bucket_0
> └── delta_002_002_
>     ├── _orc_acid_version
>     └── bucket_0{noformat}
> i.e. 2 deltas with the same write ID range
> this is bad.  Probably happens today as well but new run produces a delta 
> with the same name and clobbers the previous one, which may interfere with 
> writers
>  
> need to investigate
>  
> -The issue (I think) is that {{AcidUtils.getAcidState()}} then returns both 
> deltas as if they were distinct and it effectively duplicates data.-  There 
> is no data duplication - {{getAcidState()}} will not use 2 deltas with the 
> same {{writeid}} range
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-20901) running compactor when there is nothing to do produces duplicate data

2019-04-04 Thread Abhishek Somani (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-20901?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Abhishek Somani updated HIVE-20901:
---
Status: Patch Available  (was: Open)

> running compactor when there is nothing to do produces duplicate data
> -
>
> Key: HIVE-20901
> URL: https://issues.apache.org/jira/browse/HIVE-20901
> Project: Hive
>  Issue Type: Bug
>  Components: Transactions
>Affects Versions: 4.0.0
>Reporter: Eugene Koifman
>Assignee: Abhishek Somani
>Priority: Major
> Attachments: HIVE-20901.1.patch
>
>
> suppose we run minor compaction 2 times, via alter table
> The 2nd request to compaction should have nothing to do but I don't think 
> there is a check for that.  It's visible in the context of HIVE-20823, where 
> each compactor run produces a delta with new visibility suffix so we end up 
> with something like
> {noformat}
> target/tmp/org.apache.hadoop.hive.ql.TestTxnCommands3-1541810844849/warehouse/t/
> ├── delete_delta_001_002_v019
> │   ├── _orc_acid_version
> │   └── bucket_0
> ├── delete_delta_001_002_v021
> │   ├── _orc_acid_version
> │   └── bucket_0
> ├── delta_001_001_
> │   ├── _orc_acid_version
> │   └── bucket_0
> ├── delta_001_002_v019
> │   ├── _orc_acid_version
> │   └── bucket_0
> ├── delta_001_002_v021
> │   ├── _orc_acid_version
> │   └── bucket_0
> └── delta_002_002_
>     ├── _orc_acid_version
>     └── bucket_0{noformat}
> i.e. 2 deltas with the same write ID range
> this is bad.  Probably happens today as well but new run produces a delta 
> with the same name and clobbers the previous one, which may interfere with 
> writers
>  
> need to investigate
>  
> -The issue (I think) is that {{AcidUtils.getAcidState()}} then returns both 
> deltas as if they were distinct and it effectively duplicates data.-  There 
> is no data duplication - {{getAcidState()}} will not use 2 deltas with the 
> same {{writeid}} range
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-20901) running compactor when there is nothing to do produces duplicate data

2019-04-04 Thread Abhishek Somani (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-20901?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Abhishek Somani updated HIVE-20901:
---
Attachment: HIVE-20901.1.patch

> running compactor when there is nothing to do produces duplicate data
> -
>
> Key: HIVE-20901
> URL: https://issues.apache.org/jira/browse/HIVE-20901
> Project: Hive
>  Issue Type: Bug
>  Components: Transactions
>Affects Versions: 4.0.0
>Reporter: Eugene Koifman
>Assignee: Abhishek Somani
>Priority: Major
> Attachments: HIVE-20901.1.patch
>
>
> suppose we run minor compaction 2 times, via alter table
> The 2nd request to compaction should have nothing to do but I don't think 
> there is a check for that.  It's visible in the context of HIVE-20823, where 
> each compactor run produces a delta with new visibility suffix so we end up 
> with something like
> {noformat}
> target/tmp/org.apache.hadoop.hive.ql.TestTxnCommands3-1541810844849/warehouse/t/
> ├── delete_delta_001_002_v019
> │   ├── _orc_acid_version
> │   └── bucket_0
> ├── delete_delta_001_002_v021
> │   ├── _orc_acid_version
> │   └── bucket_0
> ├── delta_001_001_
> │   ├── _orc_acid_version
> │   └── bucket_0
> ├── delta_001_002_v019
> │   ├── _orc_acid_version
> │   └── bucket_0
> ├── delta_001_002_v021
> │   ├── _orc_acid_version
> │   └── bucket_0
> └── delta_002_002_
>     ├── _orc_acid_version
>     └── bucket_0{noformat}
> i.e. 2 deltas with the same write ID range
> this is bad.  Probably happens today as well but new run produces a delta 
> with the same name and clobbers the previous one, which may interfere with 
> writers
>  
> need to investigate
>  
> -The issue (I think) is that {{AcidUtils.getAcidState()}} then returns both 
> deltas as if they were distinct and it effectively duplicates data.-  There 
> is no data duplication - {{getAcidState()}} will not use 2 deltas with the 
> same {{writeid}} range
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-20901) running compactor when there is nothing to do produces duplicate data

2019-02-08 Thread Eugene Koifman (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-20901?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eugene Koifman updated HIVE-20901:
--
Description: 
suppose we run minor compaction 2 times, via alter table

The 2nd request to compaction should have nothing to do but I don't think there 
is a check for that.  It's visible in the context of HIVE-20823, where each 
compactor run produces a delta with new visibility suffix so we end up with 
something like
{noformat}
target/tmp/org.apache.hadoop.hive.ql.TestTxnCommands3-1541810844849/warehouse/t/

├── delete_delta_001_002_v019
│   ├── _orc_acid_version
│   └── bucket_0
├── delete_delta_001_002_v021
│   ├── _orc_acid_version
│   └── bucket_0
├── delta_001_001_
│   ├── _orc_acid_version
│   └── bucket_0
├── delta_001_002_v019
│   ├── _orc_acid_version
│   └── bucket_0
├── delta_001_002_v021
│   ├── _orc_acid_version
│   └── bucket_0
└── delta_002_002_
    ├── _orc_acid_version
    └── bucket_0{noformat}
i.e. 2 deltas with the same write ID range

this is bad.  Probably happens today as well but new run produces a delta with 
the same name and clobbers the previous one, which may interfere with writers

 

need to investigate

 

-The issue (I think) is that {{AcidUtils.getAcidState()}} then returns both 
deltas as if they were distinct and it effectively duplicates data.-  There is 
no data duplication - {{getAcidState()}} will not use 2 deltas with the same 
{{writeid}} range

 

 

  was:
suppose we run minor compaction 2 times, via alter table

The 2nd request to compaction should have nothing to do but I don't think there 
is a check for that.  It's visible in the context of HIVE-20823, where each 
compactor run produces a delta with new visibility suffix so we end up with 
something like
{noformat}
target/tmp/org.apache.hadoop.hive.ql.TestTxnCommands3-1541810844849/warehouse/t/

├── delete_delta_001_002_v019
│   ├── _orc_acid_version
│   └── bucket_0
├── delete_delta_001_002_v021
│   ├── _orc_acid_version
│   └── bucket_0
├── delta_001_001_
│   ├── _orc_acid_version
│   └── bucket_0
├── delta_001_002_v019
│   ├── _orc_acid_version
│   └── bucket_0
├── delta_001_002_v021
│   ├── _orc_acid_version
│   └── bucket_0
└── delta_002_002_
    ├── _orc_acid_version
    └── bucket_0{noformat}
i.e. 2 deltas with the same write ID range

this is bad.  Probably happens today as well but new run produces a delta with 
the same name and clobbers the previous one, which may interfere with writers

 

need to investigate

 

-The issue (I think) is that {{AcidUtils.getAcidState()}} then returns both 
deltas as if they were distinct and it effectively duplicates data.-  There is 
no data duplication - {{getAcidState()}} will use 2 deltas with the same 
\{{writeid}} range

 

 


> running compactor when there is nothing to do produces duplicate data
> -
>
> Key: HIVE-20901
> URL: https://issues.apache.org/jira/browse/HIVE-20901
> Project: Hive
>  Issue Type: Bug
>  Components: Transactions
>Affects Versions: 4.0.0
>Reporter: Eugene Koifman
>Assignee: Eugene Koifman
>Priority: Major
>
> suppose we run minor compaction 2 times, via alter table
> The 2nd request to compaction should have nothing to do but I don't think 
> there is a check for that.  It's visible in the context of HIVE-20823, where 
> each compactor run produces a delta with new visibility suffix so we end up 
> with something like
> {noformat}
> target/tmp/org.apache.hadoop.hive.ql.TestTxnCommands3-1541810844849/warehouse/t/
> ├── delete_delta_001_002_v019
> │   ├── _orc_acid_version
> │   └── bucket_0
> ├── delete_delta_001_002_v021
> │   ├── _orc_acid_version
> │   └── bucket_0
> ├── delta_001_001_
> │   ├── _orc_acid_version
> │   └── bucket_0
> ├── delta_001_002_v019
> │   ├── _orc_acid_version
> │   └── bucket_0
> ├── delta_001_002_v021
> │   ├── _orc_acid_version
> │   └── bucket_0
> └── delta_002_002_
>     ├── _orc_acid_version
>     └── bucket_0{noformat}
> i.e. 2 deltas with the same write ID range
> this is bad.  Probably happens today as well but new run produces a delta 
> with the same name and clobbers the previous one, which may interfere with 
> writers
>  
> need to investigate
>  
> -The issue (I think) is that {{AcidUtils.getAcidState()}} then returns both 
> deltas as if they were distinct and it effectively duplicates data.-  There 
> is no data duplication - {{getAcidState()}} will not use 2 deltas with the 
> same {{writeid}} range
>  
>  



--
This message was sent by Atl

[jira] [Updated] (HIVE-20901) running compactor when there is nothing to do produces duplicate data

2019-02-08 Thread Eugene Koifman (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-20901?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eugene Koifman updated HIVE-20901:
--
Description: 
suppose we run minor compaction 2 times, via alter table

The 2nd request to compaction should have nothing to do but I don't think there 
is a check for that.  It's visible in the context of HIVE-20823, where each 
compactor run produces a delta with new visibility suffix so we end up with 
something like
{noformat}
target/tmp/org.apache.hadoop.hive.ql.TestTxnCommands3-1541810844849/warehouse/t/

├── delete_delta_001_002_v019
│   ├── _orc_acid_version
│   └── bucket_0
├── delete_delta_001_002_v021
│   ├── _orc_acid_version
│   └── bucket_0
├── delta_001_001_
│   ├── _orc_acid_version
│   └── bucket_0
├── delta_001_002_v019
│   ├── _orc_acid_version
│   └── bucket_0
├── delta_001_002_v021
│   ├── _orc_acid_version
│   └── bucket_0
└── delta_002_002_
    ├── _orc_acid_version
    └── bucket_0{noformat}
i.e. 2 deltas with the same write ID range

this is bad.  Probably happens today as well but new run produces a delta with 
the same name and clobbers the previous one, which may interfere with writers

 

need to investigate

 

-The issue (I think) is that {{AcidUtils.getAcidState()}} then returns both 
deltas as if they were distinct and it effectively duplicates data.-  There is 
no data duplication - {{getAcidState()}} will use 2 deltas with the same 
\{{writeid}} range

 

 

  was:
suppose we run minor compaction 2 times, via alter table

The 2nd request to compaction should have nothing to do but I don't think there 
is a check for that.  It's visible in the context of HIVE-20823, where each 
compactor run produces a delta with new visibility suffix so we end up with 
something like
{noformat}
target/tmp/org.apache.hadoop.hive.ql.TestTxnCommands3-1541810844849/warehouse/t/

├── delete_delta_001_002_v019
│   ├── _orc_acid_version
│   └── bucket_0
├── delete_delta_001_002_v021
│   ├── _orc_acid_version
│   └── bucket_0
├── delta_001_001_
│   ├── _orc_acid_version
│   └── bucket_0
├── delta_001_002_v019
│   ├── _orc_acid_version
│   └── bucket_0
├── delta_001_002_v021
│   ├── _orc_acid_version
│   └── bucket_0
└── delta_002_002_
    ├── _orc_acid_version
    └── bucket_0{noformat}
i.e. 2 deltas with the same write ID range

this is bad.  Probably happens today as well but new run produces a delta with 
the same name and clobbers the previous one, which may interfere with writers

 

need to investigate

 

The issue (I think) is that {{AcidUtils.getAcidState()}} then returns both 
deltas as if they were distinct and it effectively duplicates data. 


> running compactor when there is nothing to do produces duplicate data
> -
>
> Key: HIVE-20901
> URL: https://issues.apache.org/jira/browse/HIVE-20901
> Project: Hive
>  Issue Type: Bug
>  Components: Transactions
>Affects Versions: 4.0.0
>Reporter: Eugene Koifman
>Assignee: Eugene Koifman
>Priority: Major
>
> suppose we run minor compaction 2 times, via alter table
> The 2nd request to compaction should have nothing to do but I don't think 
> there is a check for that.  It's visible in the context of HIVE-20823, where 
> each compactor run produces a delta with new visibility suffix so we end up 
> with something like
> {noformat}
> target/tmp/org.apache.hadoop.hive.ql.TestTxnCommands3-1541810844849/warehouse/t/
> ├── delete_delta_001_002_v019
> │   ├── _orc_acid_version
> │   └── bucket_0
> ├── delete_delta_001_002_v021
> │   ├── _orc_acid_version
> │   └── bucket_0
> ├── delta_001_001_
> │   ├── _orc_acid_version
> │   └── bucket_0
> ├── delta_001_002_v019
> │   ├── _orc_acid_version
> │   └── bucket_0
> ├── delta_001_002_v021
> │   ├── _orc_acid_version
> │   └── bucket_0
> └── delta_002_002_
>     ├── _orc_acid_version
>     └── bucket_0{noformat}
> i.e. 2 deltas with the same write ID range
> this is bad.  Probably happens today as well but new run produces a delta 
> with the same name and clobbers the previous one, which may interfere with 
> writers
>  
> need to investigate
>  
> -The issue (I think) is that {{AcidUtils.getAcidState()}} then returns both 
> deltas as if they were distinct and it effectively duplicates data.-  There 
> is no data duplication - {{getAcidState()}} will use 2 deltas with the same 
> \{{writeid}} range
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-20901) running compactor when there is nothing to do produces duplicate data

2018-11-19 Thread Eugene Koifman (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-20901?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eugene Koifman updated HIVE-20901:
--
Description: 
suppose we run minor compaction 2 times, via alter table

The 2nd request to compaction should have nothing to do but I don't think there 
is a check for that.  It's visible in the context of HIVE-20823, where each 
compactor run produces a delta with new visibility suffix so we end up with 
something like
{noformat}
target/tmp/org.apache.hadoop.hive.ql.TestTxnCommands3-1541810844849/warehouse/t/

├── delete_delta_001_002_v019
│   ├── _orc_acid_version
│   └── bucket_0
├── delete_delta_001_002_v021
│   ├── _orc_acid_version
│   └── bucket_0
├── delta_001_001_
│   ├── _orc_acid_version
│   └── bucket_0
├── delta_001_002_v019
│   ├── _orc_acid_version
│   └── bucket_0
├── delta_001_002_v021
│   ├── _orc_acid_version
│   └── bucket_0
└── delta_002_002_
    ├── _orc_acid_version
    └── bucket_0{noformat}
i.e. 2 deltas with the same write ID range

this is bad.  Probably happens today as well but new run produces a delta with 
the same name and clobbers the previous one, which may interfere with writers

 

need to investigate

 

The issue (I think) is that {{AcidUtils.getAcidState()}} then returns both 
deltas as if they were distinct and it effectively duplicates data. 

  was:
suppose we run minor compaction 2 times, via alter table

The 2nd request to compaction should have nothing to do but I don't think there 
is a check for that.  It's visible in the context of HIVE-20823, where each 
compactor run produces a delta with new visibility suffix so we end up with 
something like
{noformat}
target/tmp/org.apache.hadoop.hive.ql.TestTxnCommands3-1541810844849/warehouse/t/

├── delete_delta_001_002_v019
│   ├── _orc_acid_version
│   └── bucket_0
├── delete_delta_001_002_v021
│   ├── _orc_acid_version
│   └── bucket_0
├── delta_001_001_
│   ├── _orc_acid_version
│   └── bucket_0
├── delta_001_002_v019
│   ├── _orc_acid_version
│   └── bucket_0
├── delta_001_002_v021
│   ├── _orc_acid_version
│   └── bucket_0
└── delta_002_002_
    ├── _orc_acid_version
    └── bucket_0{noformat}
i.e. 2 deltas with the same write ID range

this is bad.  Probably happens today as well but new run produces a delta with 
the same name and clobbers the previous one, which may interfere with writers

 

need to investigate


> running compactor when there is nothing to do produces duplicate data
> -
>
> Key: HIVE-20901
> URL: https://issues.apache.org/jira/browse/HIVE-20901
> Project: Hive
>  Issue Type: Bug
>  Components: Transactions
>Affects Versions: 4.0.0
>Reporter: Eugene Koifman
>Assignee: Eugene Koifman
>Priority: Major
>
> suppose we run minor compaction 2 times, via alter table
> The 2nd request to compaction should have nothing to do but I don't think 
> there is a check for that.  It's visible in the context of HIVE-20823, where 
> each compactor run produces a delta with new visibility suffix so we end up 
> with something like
> {noformat}
> target/tmp/org.apache.hadoop.hive.ql.TestTxnCommands3-1541810844849/warehouse/t/
> ├── delete_delta_001_002_v019
> │   ├── _orc_acid_version
> │   └── bucket_0
> ├── delete_delta_001_002_v021
> │   ├── _orc_acid_version
> │   └── bucket_0
> ├── delta_001_001_
> │   ├── _orc_acid_version
> │   └── bucket_0
> ├── delta_001_002_v019
> │   ├── _orc_acid_version
> │   └── bucket_0
> ├── delta_001_002_v021
> │   ├── _orc_acid_version
> │   └── bucket_0
> └── delta_002_002_
>     ├── _orc_acid_version
>     └── bucket_0{noformat}
> i.e. 2 deltas with the same write ID range
> this is bad.  Probably happens today as well but new run produces a delta 
> with the same name and clobbers the previous one, which may interfere with 
> writers
>  
> need to investigate
>  
> The issue (I think) is that {{AcidUtils.getAcidState()}} then returns both 
> deltas as if they were distinct and it effectively duplicates data. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)