Re: Review Request 72528: ValidTxnManager doesn't consider txns opened and committed between snapshot generation and locking when evaluating ValidTxnListState

2020-06-19 Thread Peter Vary via Review Board

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/72528/#review221036
---


Ship it!




+1 since the remaining issue will be fixed in HIVE-23725

- Peter Vary


On jún. 9, 2020, 8:52 de, Denys Kuzmenko wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/72528/
> ---
> 
> (Updated jún. 9, 2020, 8:52 de)
> 
> 
> Review request for hive, Jesús Camacho Rodríguez, Peter Varga, and Peter Vary.
> 
> 
> Bugs: HIVE-23503
> https://issues.apache.org/jira/browse/HIVE-23503
> 
> 
> Repository: hive-git
> 
> 
> Description
> ---
> 
> ValidTxnManager doesn't consider txns opened and committed between snapshot 
> generation and locking when evaluating ValidTxnListState. This cause issues 
> like duplicate insert in case of concurrent merge insert & insert.
> 
> 
> Diffs
> -
> 
>   ql/src/java/org/apache/hadoop/hive/ql/Driver.java e70c92eef4 
>   ql/src/java/org/apache/hadoop/hive/ql/DriverContext.java a8c83fc504 
>   ql/src/java/org/apache/hadoop/hive/ql/ValidTxnManager.java 7d49c57dda 
>   ql/src/java/org/apache/hadoop/hive/ql/lockmgr/DbTxnManager.java 71afcbdc68 
>   ql/src/java/org/apache/hadoop/hive/ql/lockmgr/DummyTxnManager.java 
> 0383881acc 
>   ql/src/java/org/apache/hadoop/hive/ql/lockmgr/HiveTxnManager.java 
> 600289f837 
>   ql/src/test/org/apache/hadoop/hive/ql/lockmgr/TestDbTxnManager2.java 
> 8a15b7cc5d 
>   
> standalone-metastore/metastore-common/src/main/java/org/apache/hadoop/hive/metastore/HiveMetaStoreClient.java
>  65df9c2ba9 
>   
> standalone-metastore/metastore-common/src/main/java/org/apache/hadoop/hive/metastore/IMetaStoreClient.java
>  887d4303f4 
>   
> standalone-metastore/metastore-server/src/test/java/org/apache/hadoop/hive/metastore/HiveMetaStoreClientPreCatalog.java
>  312936efa8 
>   storage-api/src/java/org/apache/hadoop/hive/common/ValidReadTxnList.java 
> b8ff03f9c4 
>   storage-api/src/java/org/apache/hadoop/hive/common/ValidTxnList.java 
> d4c3b09730 
> 
> 
> Diff: https://reviews.apache.org/r/72528/diff/1/
> 
> 
> Testing
> ---
> 
> DbTxnManager tests.
> 
> Faulty scenario:
> 1. open and generate snapshot for t1 that merge inserts data from a source 
> table into the target one.
> 2. Open, run and commit t2 that inserts source table data into the target 
> table.
> 3. Run t1 - duplicate date would be inserted into target table as t2 changes 
> won't be visible by t1.
> 
> 
> Thanks,
> 
> Denys Kuzmenko
> 
>



Re: Review Request 72528: ValidTxnManager doesn't consider txns opened and committed between snapshot generation and locking when evaluating ValidTxnListState

2020-06-09 Thread Denys Kuzmenko via Review Board

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/72528/
---

(Updated June 9, 2020, 8:52 a.m.)


Review request for hive, Jesús Camacho Rodríguez, Peter Varga, and Peter Vary.


Bugs: HIVE-23503
https://issues.apache.org/jira/browse/HIVE-23503


Repository: hive-git


Description
---

ValidTxnManager doesn't consider txns opened and committed between snapshot 
generation and locking when evaluating ValidTxnListState. This cause issues 
like duplicate insert in case of concurrent merge insert & insert.


Diffs
-

  ql/src/java/org/apache/hadoop/hive/ql/Driver.java e70c92eef4 
  ql/src/java/org/apache/hadoop/hive/ql/DriverContext.java a8c83fc504 
  ql/src/java/org/apache/hadoop/hive/ql/ValidTxnManager.java 7d49c57dda 
  ql/src/java/org/apache/hadoop/hive/ql/lockmgr/DbTxnManager.java 71afcbdc68 
  ql/src/java/org/apache/hadoop/hive/ql/lockmgr/DummyTxnManager.java 0383881acc 
  ql/src/java/org/apache/hadoop/hive/ql/lockmgr/HiveTxnManager.java 600289f837 
  ql/src/test/org/apache/hadoop/hive/ql/lockmgr/TestDbTxnManager2.java 
8a15b7cc5d 
  
standalone-metastore/metastore-common/src/main/java/org/apache/hadoop/hive/metastore/HiveMetaStoreClient.java
 65df9c2ba9 
  
standalone-metastore/metastore-common/src/main/java/org/apache/hadoop/hive/metastore/IMetaStoreClient.java
 887d4303f4 
  
standalone-metastore/metastore-server/src/test/java/org/apache/hadoop/hive/metastore/HiveMetaStoreClientPreCatalog.java
 312936efa8 
  storage-api/src/java/org/apache/hadoop/hive/common/ValidReadTxnList.java 
b8ff03f9c4 
  storage-api/src/java/org/apache/hadoop/hive/common/ValidTxnList.java 
d4c3b09730 


Diff: https://reviews.apache.org/r/72528/diff/1/


Testing (updated)
---

DbTxnManager tests.

Faulty scenario:
1. open and generate snapshot for t1 that merge inserts data from a source 
table into the target one.
2. Open, run and commit t2 that inserts source table data into the target table.
3. Run t1 - duplicate date would be inserted into target table as t2 changes 
won't be visible by t1.


Thanks,

Denys Kuzmenko



Re: Review Request 72528: ValidTxnManager doesn't consider txns opened and committed between snapshot generation and locking when evaluating ValidTxnListState

2020-06-08 Thread Jesús Camacho Rodríguez


> On May 20, 2020, 3:16 p.m., Peter Varga wrote:
> > ql/src/java/org/apache/hadoop/hive/ql/Driver.java
> > Lines 686 (patched)
> > 
> >
> > I have concerns here, but I am not sure if they are well founded or 
> > not. I think this will break what the outside world thinks of snapshot 
> > isolation. I might have a hypothetical client that inserts lots of data in 
> > a source table and sometimes issue a merge statement from the source to the 
> > target table. They have some requirement that the target table can not have 
> > partial data regarding some property. Example they inserting sales data, 
> > and the target table can not contain half the data of a day, it can either 
> > have all or none. So what the clients does, it will issue the inserts into 
> > the source table synchronously ordered by the date and when it gets to a 
> > next day it issue a merge statement asynchronously and continues to inserts 
> > the data for the next day synchronously. And it might think that it is save 
> > to do so, since the merge statement has a snapshot it will not see the data 
> > inserted afterwards. But with this change it will break.
> > It might not be the best example, since how would the client know when 
> > the snapshot is actually captured. But I am not familiar enough with the 
> > ecosystem, does anything use the Hive by issuing the compile and run 
> > separately? Because there you could be sure before this change, that the 
> > compilation order also meant snapshot order. So summarized, I don't know 
> > what the outside world excepts of the snapshot isolation.
> 
> Denys Kuzmenko wrote:
> insert into source and merge from source into target won't conflict with 
> each other, they touch different tables. Maybe I missing something here...
> 
> Peter Varga wrote:
> My example was not perfect. I don't mean that it will conflict with the 
> insert into the source table. It can conflict with some other client's 
> transaction. My main point is, after the conflict is noticed and you 
> regenerate the snapshot it will starts to read results from transactions that 
> were opened and committed after the original query was compiled, and I'm just 
> trying to figure out, what kinf of problems can it cause, if any. In my 
> example you start to read records inserted later, but what if somebody added 
> a new partition since the compilation, wouldn't it cause problem?
> 
> Denys Kuzmenko wrote:
> probably there might be an issue as we won't create any locks for the 
> newly created partition, however we'll start reading it.
> instead of rollback & retry on Hive side we might consider to just fail 
> and let the user re-try.
> 
> Denys Kuzmenko wrote:
> however it still leaves the question what happens now in Hive when 
> somebody adds a new partition (insert with dynamic partitioning) since the 
> compilation (merge insert). I'll test this out.
> 
> Peter Varga wrote:
> You could construct an example like this:
> 1. open and compile transaction 1 that merge inserts data from a 
> partitioned source table that has a few partition.
> 2. Open, run and commit transaction 2 that inserts data to an old and a 
> new partition to the source table.
> 3. Open, run and commit transaction 3 that inserts data to the target 
> table of the merge statement, that will retrigger a snapshot generation in 
> transaction 1.
> 4. Run transaction 1, the snapshot will be regenerated, and I think it 
> will read partial data from transaction 2 breaking the ACID properties.
> 
> But we should try the test without your patch with a little different 
> setup.
> Switch the transaction order:
> 1. compile transaction 1 that inserts data to an old and a new partition 
> of the source table.
> 2. compile transaction 2 that insert data to the target table
> 2. compile transaction 3 that merge inserts data from the source table to 
> the target table
> 3. run and commit transaction 1
> 4. run and commit transaction 2
> 5. run transaction 3, since it cointains 1 and 2 in its snaphot the 
> isValidTxnListState will be triggered without your patch and I think it 
> possible that we do a partial read of the transaction 1 for the same reasons.

For the first scenario, afaik we should not read the list of partitions 
incorrectly in item 4. In fact, if a new snapshot needs to be generated, we 
recompile the query, and thus, the list of partitions should be regenerated 
too. If it is not, then we are keeping some state unintentionally (since the 
semantic analyzer is created again in the recompilation, I am not sure where 
that state going be coming from though).


- Jesús


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/72528/#review220838

Re: Review Request 72528: ValidTxnManager doesn't consider txns opened and committed between snapshot generation and locking when evaluating ValidTxnListState

2020-06-08 Thread Peter Varga via Review Board


> On May 20, 2020, 3:16 p.m., Peter Varga wrote:
> > ql/src/java/org/apache/hadoop/hive/ql/Driver.java
> > Lines 686 (patched)
> > 
> >
> > I have concerns here, but I am not sure if they are well founded or 
> > not. I think this will break what the outside world thinks of snapshot 
> > isolation. I might have a hypothetical client that inserts lots of data in 
> > a source table and sometimes issue a merge statement from the source to the 
> > target table. They have some requirement that the target table can not have 
> > partial data regarding some property. Example they inserting sales data, 
> > and the target table can not contain half the data of a day, it can either 
> > have all or none. So what the clients does, it will issue the inserts into 
> > the source table synchronously ordered by the date and when it gets to a 
> > next day it issue a merge statement asynchronously and continues to inserts 
> > the data for the next day synchronously. And it might think that it is save 
> > to do so, since the merge statement has a snapshot it will not see the data 
> > inserted afterwards. But with this change it will break.
> > It might not be the best example, since how would the client know when 
> > the snapshot is actually captured. But I am not familiar enough with the 
> > ecosystem, does anything use the Hive by issuing the compile and run 
> > separately? Because there you could be sure before this change, that the 
> > compilation order also meant snapshot order. So summarized, I don't know 
> > what the outside world excepts of the snapshot isolation.
> 
> Denys Kuzmenko wrote:
> insert into source and merge from source into target won't conflict with 
> each other, they touch different tables. Maybe I missing something here...
> 
> Peter Varga wrote:
> My example was not perfect. I don't mean that it will conflict with the 
> insert into the source table. It can conflict with some other client's 
> transaction. My main point is, after the conflict is noticed and you 
> regenerate the snapshot it will starts to read results from transactions that 
> were opened and committed after the original query was compiled, and I'm just 
> trying to figure out, what kinf of problems can it cause, if any. In my 
> example you start to read records inserted later, but what if somebody added 
> a new partition since the compilation, wouldn't it cause problem?
> 
> Denys Kuzmenko wrote:
> probably there might be an issue as we won't create any locks for the 
> newly created partition, however we'll start reading it.
> instead of rollback & retry on Hive side we might consider to just fail 
> and let the user re-try.
> 
> Denys Kuzmenko wrote:
> however it still leaves the question what happens now in Hive when 
> somebody adds a new partition (insert with dynamic partitioning) since the 
> compilation (merge insert). I'll test this out.

You could construct an example like this:
1. open and compile transaction 1 that merge inserts data from a partitioned 
source table that has a few partition.
2. Open, run and commit transaction 2 that inserts data to an old and a new 
partition to the source table.
3. Open, run and commit transaction 3 that inserts data to the target table of 
the merge statement, that will retrigger a snapshot generation in transaction 1.
4. Run transaction 1, the snapshot will be regenerated, and I think it will 
read partial data from transaction 2 breaking the ACID properties.

But we should try the test without your patch with a little different setup.
Switch the transaction order:
1. compile transaction 1 that inserts data to an old and a new partition of the 
source table.
2. compile transaction 2 that insert data to the target table
2. compile transaction 3 that merge inserts data from the source table to the 
target table
3. run and commit transaction 1
4. run and commit transaction 2
5. run transaction 3, since it cointains 1 and 2 in its snaphot the 
isValidTxnListState will be triggered without your patch and I think it 
possible that we do a partial read of the transaction 1 for the same reasons.


- Peter


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/72528/#review220838
---


On May 19, 2020, 11:19 a.m., Denys Kuzmenko wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/72528/
> ---
> 
> (Updated May 19, 2020, 11:19 a.m.)
> 
> 
> Review request for hive, Jesús Camacho Rodríguez, Peter Varga, and Peter Vary.
> 
> 
> Bugs: HIVE-23503
> https://issues.apache.org/jira/browse/HIVE-23503
> 
> 
> Repository: hive-git
> 
> 
> Description
> ---
> 

Re: Review Request 72528: ValidTxnManager doesn't consider txns opened and committed between snapshot generation and locking when evaluating ValidTxnListState

2020-05-21 Thread Denys Kuzmenko via Review Board


> On May 20, 2020, 3:16 p.m., Peter Varga wrote:
> > ql/src/java/org/apache/hadoop/hive/ql/Driver.java
> > Lines 686 (patched)
> > 
> >
> > I have concerns here, but I am not sure if they are well founded or 
> > not. I think this will break what the outside world thinks of snapshot 
> > isolation. I might have a hypothetical client that inserts lots of data in 
> > a source table and sometimes issue a merge statement from the source to the 
> > target table. They have some requirement that the target table can not have 
> > partial data regarding some property. Example they inserting sales data, 
> > and the target table can not contain half the data of a day, it can either 
> > have all or none. So what the clients does, it will issue the inserts into 
> > the source table synchronously ordered by the date and when it gets to a 
> > next day it issue a merge statement asynchronously and continues to inserts 
> > the data for the next day synchronously. And it might think that it is save 
> > to do so, since the merge statement has a snapshot it will not see the data 
> > inserted afterwards. But with this change it will break.
> > It might not be the best example, since how would the client know when 
> > the snapshot is actually captured. But I am not familiar enough with the 
> > ecosystem, does anything use the Hive by issuing the compile and run 
> > separately? Because there you could be sure before this change, that the 
> > compilation order also meant snapshot order. So summarized, I don't know 
> > what the outside world excepts of the snapshot isolation.
> 
> Denys Kuzmenko wrote:
> insert into source and merge from source into target won't conflict with 
> each other, they touch different tables. Maybe I missing something here...
> 
> Peter Varga wrote:
> My example was not perfect. I don't mean that it will conflict with the 
> insert into the source table. It can conflict with some other client's 
> transaction. My main point is, after the conflict is noticed and you 
> regenerate the snapshot it will starts to read results from transactions that 
> were opened and committed after the original query was compiled, and I'm just 
> trying to figure out, what kinf of problems can it cause, if any. In my 
> example you start to read records inserted later, but what if somebody added 
> a new partition since the compilation, wouldn't it cause problem?
> 
> Denys Kuzmenko wrote:
> probably there might be an issue as we won't create any locks for the 
> newly created partition, however we'll start reading it.
> instead of rollback & retry on Hive side we might consider to just fail 
> and let the user re-try.

however it still leaves the question what happens now in Hive when somebody 
adds a new partition (insert with dynamic partitioning) since the compilation 
(merge insert). I'll test this out.


- Denys


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/72528/#review220838
---


On May 19, 2020, 11:19 a.m., Denys Kuzmenko wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/72528/
> ---
> 
> (Updated May 19, 2020, 11:19 a.m.)
> 
> 
> Review request for hive, Jesús Camacho Rodríguez, Peter Varga, and Peter Vary.
> 
> 
> Bugs: HIVE-23503
> https://issues.apache.org/jira/browse/HIVE-23503
> 
> 
> Repository: hive-git
> 
> 
> Description
> ---
> 
> ValidTxnManager doesn't consider txns opened and committed between snapshot 
> generation and locking when evaluating ValidTxnListState. This cause issues 
> like duplicate insert in case of concurrent merge insert & insert.
> 
> 
> Diffs
> -
> 
>   ql/src/java/org/apache/hadoop/hive/ql/Driver.java e70c92eef4 
>   ql/src/java/org/apache/hadoop/hive/ql/DriverContext.java a8c83fc504 
>   ql/src/java/org/apache/hadoop/hive/ql/ValidTxnManager.java 7d49c57dda 
>   ql/src/java/org/apache/hadoop/hive/ql/lockmgr/DbTxnManager.java 71afcbdc68 
>   ql/src/java/org/apache/hadoop/hive/ql/lockmgr/DummyTxnManager.java 
> 0383881acc 
>   ql/src/java/org/apache/hadoop/hive/ql/lockmgr/HiveTxnManager.java 
> 600289f837 
>   ql/src/test/org/apache/hadoop/hive/ql/lockmgr/TestDbTxnManager2.java 
> 8a15b7cc5d 
>   
> standalone-metastore/metastore-common/src/main/java/org/apache/hadoop/hive/metastore/HiveMetaStoreClient.java
>  65df9c2ba9 
>   
> standalone-metastore/metastore-common/src/main/java/org/apache/hadoop/hive/metastore/IMetaStoreClient.java
>  887d4303f4 
>   
> standalone-metastore/metastore-server/src/test/java/org/apache/hadoop/hive/metastore/HiveMetaStoreClientPreCatalog.java
>  312936efa8 
>   

Re: Review Request 72528: ValidTxnManager doesn't consider txns opened and committed between snapshot generation and locking when evaluating ValidTxnListState

2020-05-21 Thread Denys Kuzmenko via Review Board


> On May 20, 2020, 3:16 p.m., Peter Varga wrote:
> > ql/src/java/org/apache/hadoop/hive/ql/Driver.java
> > Lines 686 (patched)
> > 
> >
> > I have concerns here, but I am not sure if they are well founded or 
> > not. I think this will break what the outside world thinks of snapshot 
> > isolation. I might have a hypothetical client that inserts lots of data in 
> > a source table and sometimes issue a merge statement from the source to the 
> > target table. They have some requirement that the target table can not have 
> > partial data regarding some property. Example they inserting sales data, 
> > and the target table can not contain half the data of a day, it can either 
> > have all or none. So what the clients does, it will issue the inserts into 
> > the source table synchronously ordered by the date and when it gets to a 
> > next day it issue a merge statement asynchronously and continues to inserts 
> > the data for the next day synchronously. And it might think that it is save 
> > to do so, since the merge statement has a snapshot it will not see the data 
> > inserted afterwards. But with this change it will break.
> > It might not be the best example, since how would the client know when 
> > the snapshot is actually captured. But I am not familiar enough with the 
> > ecosystem, does anything use the Hive by issuing the compile and run 
> > separately? Because there you could be sure before this change, that the 
> > compilation order also meant snapshot order. So summarized, I don't know 
> > what the outside world excepts of the snapshot isolation.
> 
> Denys Kuzmenko wrote:
> insert into source and merge from source into target won't conflict with 
> each other, they touch different tables. Maybe I missing something here...
> 
> Peter Varga wrote:
> My example was not perfect. I don't mean that it will conflict with the 
> insert into the source table. It can conflict with some other client's 
> transaction. My main point is, after the conflict is noticed and you 
> regenerate the snapshot it will starts to read results from transactions that 
> were opened and committed after the original query was compiled, and I'm just 
> trying to figure out, what kinf of problems can it cause, if any. In my 
> example you start to read records inserted later, but what if somebody added 
> a new partition since the compilation, wouldn't it cause problem?

probably there might be an issue as we won't create any locks for the newly 
created partition, however we'll start reading it.
instead of rollback & retry on Hive side we might consider to just fail and let 
the user re-try.


- Denys


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/72528/#review220838
---


On May 19, 2020, 11:19 a.m., Denys Kuzmenko wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/72528/
> ---
> 
> (Updated May 19, 2020, 11:19 a.m.)
> 
> 
> Review request for hive, Jesús Camacho Rodríguez, Peter Varga, and Peter Vary.
> 
> 
> Bugs: HIVE-23503
> https://issues.apache.org/jira/browse/HIVE-23503
> 
> 
> Repository: hive-git
> 
> 
> Description
> ---
> 
> ValidTxnManager doesn't consider txns opened and committed between snapshot 
> generation and locking when evaluating ValidTxnListState. This cause issues 
> like duplicate insert in case of concurrent merge insert & insert.
> 
> 
> Diffs
> -
> 
>   ql/src/java/org/apache/hadoop/hive/ql/Driver.java e70c92eef4 
>   ql/src/java/org/apache/hadoop/hive/ql/DriverContext.java a8c83fc504 
>   ql/src/java/org/apache/hadoop/hive/ql/ValidTxnManager.java 7d49c57dda 
>   ql/src/java/org/apache/hadoop/hive/ql/lockmgr/DbTxnManager.java 71afcbdc68 
>   ql/src/java/org/apache/hadoop/hive/ql/lockmgr/DummyTxnManager.java 
> 0383881acc 
>   ql/src/java/org/apache/hadoop/hive/ql/lockmgr/HiveTxnManager.java 
> 600289f837 
>   ql/src/test/org/apache/hadoop/hive/ql/lockmgr/TestDbTxnManager2.java 
> 8a15b7cc5d 
>   
> standalone-metastore/metastore-common/src/main/java/org/apache/hadoop/hive/metastore/HiveMetaStoreClient.java
>  65df9c2ba9 
>   
> standalone-metastore/metastore-common/src/main/java/org/apache/hadoop/hive/metastore/IMetaStoreClient.java
>  887d4303f4 
>   
> standalone-metastore/metastore-server/src/test/java/org/apache/hadoop/hive/metastore/HiveMetaStoreClientPreCatalog.java
>  312936efa8 
>   storage-api/src/java/org/apache/hadoop/hive/common/ValidReadTxnList.java 
> b8ff03f9c4 
>   storage-api/src/java/org/apache/hadoop/hive/common/ValidTxnList.java 
> d4c3b09730 
> 
> 
> Diff: https://reviews.apache.org/r/72528/diff/1/
> 
> 
> Testing
> ---
> 
> 

Re: Review Request 72528: ValidTxnManager doesn't consider txns opened and committed between snapshot generation and locking when evaluating ValidTxnListState

2020-05-21 Thread Peter Varga via Review Board


> On May 20, 2020, 3:16 p.m., Peter Varga wrote:
> > ql/src/java/org/apache/hadoop/hive/ql/Driver.java
> > Lines 686 (patched)
> > 
> >
> > I have concerns here, but I am not sure if they are well founded or 
> > not. I think this will break what the outside world thinks of snapshot 
> > isolation. I might have a hypothetical client that inserts lots of data in 
> > a source table and sometimes issue a merge statement from the source to the 
> > target table. They have some requirement that the target table can not have 
> > partial data regarding some property. Example they inserting sales data, 
> > and the target table can not contain half the data of a day, it can either 
> > have all or none. So what the clients does, it will issue the inserts into 
> > the source table synchronously ordered by the date and when it gets to a 
> > next day it issue a merge statement asynchronously and continues to inserts 
> > the data for the next day synchronously. And it might think that it is save 
> > to do so, since the merge statement has a snapshot it will not see the data 
> > inserted afterwards. But with this change it will break.
> > It might not be the best example, since how would the client know when 
> > the snapshot is actually captured. But I am not familiar enough with the 
> > ecosystem, does anything use the Hive by issuing the compile and run 
> > separately? Because there you could be sure before this change, that the 
> > compilation order also meant snapshot order. So summarized, I don't know 
> > what the outside world excepts of the snapshot isolation.
> 
> Denys Kuzmenko wrote:
> insert into source and merge from source into target won't conflict with 
> each other, they touch different tables. Maybe I missing something here...

My example was not perfect. I don't mean that it will conflict with the insert 
into the source table. It can conflict with some other client's transaction. My 
main point is, after the conflict is noticed and you regenerate the snapshot it 
will starts to read results from transactions that were opened and committed 
after the original query was compiled, and I'm just trying to figure out, what 
kinf of problems can it cause, if any. In my example you start to read records 
inserted later, but what if somebody added a new partition since the 
compilation, wouldn't it cause problem?


- Peter


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/72528/#review220838
---


On May 19, 2020, 11:19 a.m., Denys Kuzmenko wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/72528/
> ---
> 
> (Updated May 19, 2020, 11:19 a.m.)
> 
> 
> Review request for hive, Jesús Camacho Rodríguez, Peter Varga, and Peter Vary.
> 
> 
> Bugs: HIVE-23503
> https://issues.apache.org/jira/browse/HIVE-23503
> 
> 
> Repository: hive-git
> 
> 
> Description
> ---
> 
> ValidTxnManager doesn't consider txns opened and committed between snapshot 
> generation and locking when evaluating ValidTxnListState. This cause issues 
> like duplicate insert in case of concurrent merge insert & insert.
> 
> 
> Diffs
> -
> 
>   ql/src/java/org/apache/hadoop/hive/ql/Driver.java e70c92eef4 
>   ql/src/java/org/apache/hadoop/hive/ql/DriverContext.java a8c83fc504 
>   ql/src/java/org/apache/hadoop/hive/ql/ValidTxnManager.java 7d49c57dda 
>   ql/src/java/org/apache/hadoop/hive/ql/lockmgr/DbTxnManager.java 71afcbdc68 
>   ql/src/java/org/apache/hadoop/hive/ql/lockmgr/DummyTxnManager.java 
> 0383881acc 
>   ql/src/java/org/apache/hadoop/hive/ql/lockmgr/HiveTxnManager.java 
> 600289f837 
>   ql/src/test/org/apache/hadoop/hive/ql/lockmgr/TestDbTxnManager2.java 
> 8a15b7cc5d 
>   
> standalone-metastore/metastore-common/src/main/java/org/apache/hadoop/hive/metastore/HiveMetaStoreClient.java
>  65df9c2ba9 
>   
> standalone-metastore/metastore-common/src/main/java/org/apache/hadoop/hive/metastore/IMetaStoreClient.java
>  887d4303f4 
>   
> standalone-metastore/metastore-server/src/test/java/org/apache/hadoop/hive/metastore/HiveMetaStoreClientPreCatalog.java
>  312936efa8 
>   storage-api/src/java/org/apache/hadoop/hive/common/ValidReadTxnList.java 
> b8ff03f9c4 
>   storage-api/src/java/org/apache/hadoop/hive/common/ValidTxnList.java 
> d4c3b09730 
> 
> 
> Diff: https://reviews.apache.org/r/72528/diff/1/
> 
> 
> Testing
> ---
> 
> DbTxnManager tests.
> 
> 
> Thanks,
> 
> Denys Kuzmenko
> 
>



Re: Review Request 72528: ValidTxnManager doesn't consider txns opened and committed between snapshot generation and locking when evaluating ValidTxnListState

2020-05-20 Thread Denys Kuzmenko via Review Board


> On May 20, 2020, 3:16 p.m., Peter Varga wrote:
> > ql/src/java/org/apache/hadoop/hive/ql/Driver.java
> > Lines 686 (patched)
> > 
> >
> > I have concerns here, but I am not sure if they are well founded or 
> > not. I think this will break what the outside world thinks of snapshot 
> > isolation. I might have a hypothetical client that inserts lots of data in 
> > a source table and sometimes issue a merge statement from the source to the 
> > target table. They have some requirement that the target table can not have 
> > partial data regarding some property. Example they inserting sales data, 
> > and the target table can not contain half the data of a day, it can either 
> > have all or none. So what the clients does, it will issue the inserts into 
> > the source table synchronously ordered by the date and when it gets to a 
> > next day it issue a merge statement asynchronously and continues to inserts 
> > the data for the next day synchronously. And it might think that it is save 
> > to do so, since the merge statement has a snapshot it will not see the data 
> > inserted afterwards. But with this change it will break.
> > It might not be the best example, since how would the client know when 
> > the snapshot is actually captured. But I am not familiar enough with the 
> > ecosystem, does anything use the Hive by issuing the compile and run 
> > separately? Because there you could be sure before this change, that the 
> > compilation order also meant snapshot order. So summarized, I don't know 
> > what the outside world excepts of the snapshot isolation.

insert into source and merge from source into target won't conflict with each 
other, they touch different tables. Maybe I missing something here...


- Denys


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/72528/#review220838
---


On May 19, 2020, 11:19 a.m., Denys Kuzmenko wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/72528/
> ---
> 
> (Updated May 19, 2020, 11:19 a.m.)
> 
> 
> Review request for hive, Jesús Camacho Rodríguez, Peter Varga, and Peter Vary.
> 
> 
> Bugs: HIVE-23503
> https://issues.apache.org/jira/browse/HIVE-23503
> 
> 
> Repository: hive-git
> 
> 
> Description
> ---
> 
> ValidTxnManager doesn't consider txns opened and committed between snapshot 
> generation and locking when evaluating ValidTxnListState. This cause issues 
> like duplicate insert in case of concurrent merge insert & insert.
> 
> 
> Diffs
> -
> 
>   ql/src/java/org/apache/hadoop/hive/ql/Driver.java e70c92eef4 
>   ql/src/java/org/apache/hadoop/hive/ql/DriverContext.java a8c83fc504 
>   ql/src/java/org/apache/hadoop/hive/ql/ValidTxnManager.java 7d49c57dda 
>   ql/src/java/org/apache/hadoop/hive/ql/lockmgr/DbTxnManager.java 71afcbdc68 
>   ql/src/java/org/apache/hadoop/hive/ql/lockmgr/DummyTxnManager.java 
> 0383881acc 
>   ql/src/java/org/apache/hadoop/hive/ql/lockmgr/HiveTxnManager.java 
> 600289f837 
>   ql/src/test/org/apache/hadoop/hive/ql/lockmgr/TestDbTxnManager2.java 
> 8a15b7cc5d 
>   
> standalone-metastore/metastore-common/src/main/java/org/apache/hadoop/hive/metastore/HiveMetaStoreClient.java
>  65df9c2ba9 
>   
> standalone-metastore/metastore-common/src/main/java/org/apache/hadoop/hive/metastore/IMetaStoreClient.java
>  887d4303f4 
>   
> standalone-metastore/metastore-server/src/test/java/org/apache/hadoop/hive/metastore/HiveMetaStoreClientPreCatalog.java
>  312936efa8 
>   storage-api/src/java/org/apache/hadoop/hive/common/ValidReadTxnList.java 
> b8ff03f9c4 
>   storage-api/src/java/org/apache/hadoop/hive/common/ValidTxnList.java 
> d4c3b09730 
> 
> 
> Diff: https://reviews.apache.org/r/72528/diff/1/
> 
> 
> Testing
> ---
> 
> DbTxnManager tests.
> 
> 
> Thanks,
> 
> Denys Kuzmenko
> 
>



Re: Review Request 72528: ValidTxnManager doesn't consider txns opened and committed between snapshot generation and locking when evaluating ValidTxnListState

2020-05-20 Thread Peter Varga via Review Board

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/72528/#review220838
---



The patch looks great, but I have concers. See below.


ql/src/java/org/apache/hadoop/hive/ql/Driver.java
Lines 686 (patched)


I have concerns here, but I am not sure if they are well founded or not. I 
think this will break what the outside world thinks of snapshot isolation. I 
might have a hypothetical client that inserts lots of data in a source table 
and sometimes issue a merge statement from the source to the target table. They 
have some requirement that the target table can not have partial data regarding 
some property. Example they inserting sales data, and the target table can not 
contain half the data of a day, it can either have all or none. So what the 
clients does, it will issue the inserts into the source table synchronously 
ordered by the date and when it gets to a next day it issue a merge statement 
asynchronously and continues to inserts the data for the next day 
synchronously. And it might think that it is save to do so, since the merge 
statement has a snapshot it will not see the data inserted afterwards. But with 
this change it will break.
It might not be the best example, since how would the client know when the 
snapshot is actually captured. But I am not familiar enough with the ecosystem, 
does anything use the Hive by issuing the compile and run separately? Because 
there you could be sure before this change, that the compilation order also 
meant snapshot order. So summarized, I don't know what the outside world 
excepts of the snapshot isolation.


- Peter Varga


On May 19, 2020, 11:19 a.m., Denys Kuzmenko wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/72528/
> ---
> 
> (Updated May 19, 2020, 11:19 a.m.)
> 
> 
> Review request for hive, Jesús Camacho Rodríguez, Peter Varga, and Peter Vary.
> 
> 
> Bugs: HIVE-23503
> https://issues.apache.org/jira/browse/HIVE-23503
> 
> 
> Repository: hive-git
> 
> 
> Description
> ---
> 
> ValidTxnManager doesn't consider txns opened and committed between snapshot 
> generation and locking when evaluating ValidTxnListState. This cause issues 
> like duplicate insert in case of concurrent merge insert & insert.
> 
> 
> Diffs
> -
> 
>   ql/src/java/org/apache/hadoop/hive/ql/Driver.java e70c92eef4 
>   ql/src/java/org/apache/hadoop/hive/ql/DriverContext.java a8c83fc504 
>   ql/src/java/org/apache/hadoop/hive/ql/ValidTxnManager.java 7d49c57dda 
>   ql/src/java/org/apache/hadoop/hive/ql/lockmgr/DbTxnManager.java 71afcbdc68 
>   ql/src/java/org/apache/hadoop/hive/ql/lockmgr/DummyTxnManager.java 
> 0383881acc 
>   ql/src/java/org/apache/hadoop/hive/ql/lockmgr/HiveTxnManager.java 
> 600289f837 
>   ql/src/test/org/apache/hadoop/hive/ql/lockmgr/TestDbTxnManager2.java 
> 8a15b7cc5d 
>   
> standalone-metastore/metastore-common/src/main/java/org/apache/hadoop/hive/metastore/HiveMetaStoreClient.java
>  65df9c2ba9 
>   
> standalone-metastore/metastore-common/src/main/java/org/apache/hadoop/hive/metastore/IMetaStoreClient.java
>  887d4303f4 
>   
> standalone-metastore/metastore-server/src/test/java/org/apache/hadoop/hive/metastore/HiveMetaStoreClientPreCatalog.java
>  312936efa8 
>   storage-api/src/java/org/apache/hadoop/hive/common/ValidReadTxnList.java 
> b8ff03f9c4 
>   storage-api/src/java/org/apache/hadoop/hive/common/ValidTxnList.java 
> d4c3b09730 
> 
> 
> Diff: https://reviews.apache.org/r/72528/diff/1/
> 
> 
> Testing
> ---
> 
> DbTxnManager tests.
> 
> 
> Thanks,
> 
> Denys Kuzmenko
> 
>



Review Request 72528: ValidTxnManager doesn't consider txns opened and committed between snapshot generation and locking when evaluating ValidTxnListState

2020-05-19 Thread Denys Kuzmenko via Review Board

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/72528/
---

Review request for hive, Peter Varga and Peter Vary.


Bugs: HIVE-23503
https://issues.apache.org/jira/browse/HIVE-23503


Repository: hive-git


Description
---

ValidTxnManager doesn't consider txns opened and committed between snapshot 
generation and locking when evaluating ValidTxnListState. This cause issues 
like duplicate insert in case of concurrent merge insert & insert.


Diffs
-

  ql/src/java/org/apache/hadoop/hive/ql/Driver.java e70c92eef4 
  ql/src/java/org/apache/hadoop/hive/ql/DriverContext.java a8c83fc504 
  ql/src/java/org/apache/hadoop/hive/ql/ValidTxnManager.java 7d49c57dda 
  ql/src/java/org/apache/hadoop/hive/ql/lockmgr/DbTxnManager.java 71afcbdc68 
  ql/src/java/org/apache/hadoop/hive/ql/lockmgr/DummyTxnManager.java 0383881acc 
  ql/src/java/org/apache/hadoop/hive/ql/lockmgr/HiveTxnManager.java 600289f837 
  ql/src/test/org/apache/hadoop/hive/ql/lockmgr/TestDbTxnManager2.java 
8a15b7cc5d 
  
standalone-metastore/metastore-common/src/main/java/org/apache/hadoop/hive/metastore/HiveMetaStoreClient.java
 65df9c2ba9 
  
standalone-metastore/metastore-common/src/main/java/org/apache/hadoop/hive/metastore/IMetaStoreClient.java
 887d4303f4 
  
standalone-metastore/metastore-server/src/test/java/org/apache/hadoop/hive/metastore/HiveMetaStoreClientPreCatalog.java
 312936efa8 
  storage-api/src/java/org/apache/hadoop/hive/common/ValidReadTxnList.java 
b8ff03f9c4 
  storage-api/src/java/org/apache/hadoop/hive/common/ValidTxnList.java 
d4c3b09730 


Diff: https://reviews.apache.org/r/72528/diff/1/


Testing
---

DbTxnManager tests.


Thanks,

Denys Kuzmenko