[jira] [Comment Edited] (HBASE-20431) Store commit transaction for filesystems that do not support an atomic rename

2018-04-26 Thread Sean Mackrory (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-20431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16454836#comment-16454836
 ] 

Sean Mackrory edited comment on HBASE-20431 at 4/26/18 8:39 PM:


{quote}we can, but it'd restrict the code to requiring a DDB, which would make 
the WDC and Ceph groups sad. I think Andrew could get by without it, if a 
single file is all that's needed for the commit.{quote}

It could be an extension to the MetadataStore interface that can be ported to 
any implementation, but yeah - I only bring this up for completeness of 
discussion, since I've thought about the problem a lot. If different 
applications depend on different characteristics of a true atomic rename, you'd 
need different rename strategies to support all of them, so if the application 
can avoid the problem entirely by depending on different semantics that's 
probably better.

{quote}I think S3Guard may still be needed on AWS to ensure that once an object 
has become visible it remains visible, right? When enumerating a bucket we need 
to get back a list of committed objects aka "files" that always includes 
everything that has been committed.{quote}

S3Guard should give you a reliable listing of objects, and Get-After-Put is a 
documented guarantee in Amazon S3 these days, but do be aware that if you 
request an object before it exists, it might still appear to not exist after 
it's written (I think of this as Get-After-Put-After-Get consistency). There's 
nothing S3Guard could do about that without actually duplicating the data 
itself - the best we can do is block requests for objects that S3Guard doesn't 
know exist (which would require what we call "authoritative mode" to be 
implemented and some other work) or if you don't request anything you don't 
discover by listing then you don't have that problem.



was (Author: mackrorysd):
{quote}we can, but it'd restrict the code to requiring a DDB, which would make 
the WDC and Ceph groups sad. I think Andrew could get by without it, if a 
single file is all that's needed for the commit.{quote}

It could be an extension to the MetadataStore interface that can be ported to 
any implementation, but yeah - I only bring this up for completeness of 
discussion, since I've thought about the problem a lot. If different 
applications depend on different characteristics of a true atomic rename, you'd 
need different rename strategies to support all of them, so if the application 
can avoid the problem entirely by depending on different semantics that's 
probably better.

{quote}I think S3Guard may still be needed on AWS to ensure that once an object 
has become visible it remains visible, right? When enumerating a bucket we need 
to get back a list of committed objects aka "files" that always includes 
everything that has been committed.{quote}

S3Guard should give you a reliable listing of objects, and Get-After-Put is a 
documented guarantee in Amazon S3 these days, but do be aware that if you 
request an object before it exists, it might still appear to not exist after 
it's written. There's nothing S3Guard could do about that without actually 
duplicating the data itself - the best we can do is block requests for objects 
that S3Guard doesn't know exist (which would require what we call 
"authoritative mode" to be implemented and some other work) or if you don't 
request anything you don't discover by listing then you don't have that problem.


> Store commit transaction for filesystems that do not support an atomic rename
> -
>
> Key: HBASE-20431
> URL: https://issues.apache.org/jira/browse/HBASE-20431
> Project: HBase
>  Issue Type: Sub-task
>Reporter: Andrew Purtell
>Priority: Major
>
> HBase expects the Hadoop filesystem implementation to support an atomic 
> rename() operation. HDFS does. The S3 backed filesystems do not. The 
> fundamental issue is the non-atomic and eventually consistent nature of the 
> S3 service. A S3 bucket is not a filesystem. S3 is not always immediately 
> read-your-writes. Object metadata can be temporarily inconsistent just after 
> new objects are stored. There can be a settling period to ride over. 
> Renaming/moving objects from one path to another are copy operations with 
> O(file) complexity and O(data) time followed by a series of deletes with 
> O(file) complexity. Failures at any point prior to completion will leave the 
> operation in an inconsistent state. The missing atomic rename semantic opens 
> opportunities for corruption and data loss, which may or may not be 
> repairable with HBCK.
> Handling this at the HBase level could be done with a new multi-step 
> filesystem transaction framework. Call it StoreCommitTransaction. 
> SplitTransaction and MergeTransaction are well 

[jira] [Comment Edited] (HBASE-20431) Store commit transaction for filesystems that do not support an atomic rename

2018-04-26 Thread Andrew Purtell (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-20431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16454775#comment-16454775
 ] 

Andrew Purtell edited comment on HBASE-20431 at 4/26/18 8:03 PM:
-

I think S3Guard is still needed on AWS to ensure that once an object has become 
visible it remains visible, right? When enumerating a bucket we need to get 
back a list of committed objects aka "files" that always includes everything 
that has been committed.

Sounds like Ceph RGW has saner semantics and since it is more likely we'd run 
HBase on Ceph than HBase on AWS, although both are in design scope, this is 
heartening.



was (Author: apurtell):
I think S3Guard is still needed on AWS to ensure that once an object has become 
visible it remains visible, right? 

Sounds like Ceph RGW has saner semantics and since it is more likely we'd run 
HBase on Ceph than HBase on AWS, although both are in design scope, this is 
heartening.


> Store commit transaction for filesystems that do not support an atomic rename
> -
>
> Key: HBASE-20431
> URL: https://issues.apache.org/jira/browse/HBASE-20431
> Project: HBase
>  Issue Type: Sub-task
>Reporter: Andrew Purtell
>Priority: Major
>
> HBase expects the Hadoop filesystem implementation to support an atomic 
> rename() operation. HDFS does. The S3 backed filesystems do not. The 
> fundamental issue is the non-atomic and eventually consistent nature of the 
> S3 service. A S3 bucket is not a filesystem. S3 is not always immediately 
> read-your-writes. Object metadata can be temporarily inconsistent just after 
> new objects are stored. There can be a settling period to ride over. 
> Renaming/moving objects from one path to another are copy operations with 
> O(file) complexity and O(data) time followed by a series of deletes with 
> O(file) complexity. Failures at any point prior to completion will leave the 
> operation in an inconsistent state. The missing atomic rename semantic opens 
> opportunities for corruption and data loss, which may or may not be 
> repairable with HBCK.
> Handling this at the HBase level could be done with a new multi-step 
> filesystem transaction framework. Call it StoreCommitTransaction. 
> SplitTransaction and MergeTransaction are well established cases where even 
> on HDFS we have non-atomic filesystem changes and are our implementation 
> template for the new work. In this new StoreCommitTransaction we'd be moving 
> flush and compaction temporaries out of the temporary directory into the 
> region store directory. On HDFS the implementation would be easy. We can rely 
> on the filesystem's atomic rename semantics. On S3 it would be work: First we 
> would build the list of objects to move, then copy each object into the 
> destination, and then finally delete all objects at the original path. We 
> must handle transient errors with retry strategies appropriate for the action 
> at hand. We must handle serious or permanent errors where the RS doesn't need 
> to be aborted with a rollback that cleans it all up. Finally, we must handle 
> permanent errors where the RS must be aborted with a rollback during region 
> open/recovery. Note that after all objects have been copied and we are 
> deleting obsolete source objects we must roll forward, not back. To support 
> recovery after an abort we must utilize the WAL to track transaction 
> progress. Put markers in for StoreCommitTransaction start and completion 
> state, with details of the store file(s) involved, so it can be rolled back 
> during region recovery at open. This will be significant work in HFile, 
> HStore, flusher, compactor, and HRegion. Wherever we use HDFS's rename now we 
> would substitute the running of this new multi-step filesystem transaction.
> We need to determine this for certain, but I believe on S3 the PUT or 
> multipart upload of an object must complete before the object is visible, so 
> we don't have to worry about the case where an object is visible before fully 
> uploaded as part of normal operations. So an individual object copy will 
> either happen entirely and the target will then become visible, or it won't 
> and the target won't exist.
> S3 has an optimization, PUT COPY 
> (https://docs.aws.amazon.com/AmazonS3/latest/API/RESTObjectCOPY.html), which 
> the AmazonClient embedded in S3A utilizes for moves. When designing the 
> StoreCommitTransaction be sure to allow for filesystem implementations that 
> leverage a server side copy operation. Doing a get-then-put should be 
> optional. (Not sure Hadoop has an interface that advertises this capability 
> yet; we can add one if not.)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (HBASE-20431) Store commit transaction for filesystems that do not support an atomic rename

2018-04-26 Thread Andrew Purtell (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-20431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16454775#comment-16454775
 ] 

Andrew Purtell edited comment on HBASE-20431 at 4/26/18 8:03 PM:
-

I think S3Guard may still be needed on AWS to ensure that once an object has 
become visible it remains visible, right? When enumerating a bucket we need to 
get back a list of committed objects aka "files" that always includes 
everything that has been committed.

Sounds like Ceph RGW has saner semantics and since it is more likely we'd run 
HBase on Ceph than HBase on AWS, although both are in design scope, this is 
heartening.



was (Author: apurtell):
I think S3Guard is still needed on AWS to ensure that once an object has become 
visible it remains visible, right? When enumerating a bucket we need to get 
back a list of committed objects aka "files" that always includes everything 
that has been committed.

Sounds like Ceph RGW has saner semantics and since it is more likely we'd run 
HBase on Ceph than HBase on AWS, although both are in design scope, this is 
heartening.


> Store commit transaction for filesystems that do not support an atomic rename
> -
>
> Key: HBASE-20431
> URL: https://issues.apache.org/jira/browse/HBASE-20431
> Project: HBase
>  Issue Type: Sub-task
>Reporter: Andrew Purtell
>Priority: Major
>
> HBase expects the Hadoop filesystem implementation to support an atomic 
> rename() operation. HDFS does. The S3 backed filesystems do not. The 
> fundamental issue is the non-atomic and eventually consistent nature of the 
> S3 service. A S3 bucket is not a filesystem. S3 is not always immediately 
> read-your-writes. Object metadata can be temporarily inconsistent just after 
> new objects are stored. There can be a settling period to ride over. 
> Renaming/moving objects from one path to another are copy operations with 
> O(file) complexity and O(data) time followed by a series of deletes with 
> O(file) complexity. Failures at any point prior to completion will leave the 
> operation in an inconsistent state. The missing atomic rename semantic opens 
> opportunities for corruption and data loss, which may or may not be 
> repairable with HBCK.
> Handling this at the HBase level could be done with a new multi-step 
> filesystem transaction framework. Call it StoreCommitTransaction. 
> SplitTransaction and MergeTransaction are well established cases where even 
> on HDFS we have non-atomic filesystem changes and are our implementation 
> template for the new work. In this new StoreCommitTransaction we'd be moving 
> flush and compaction temporaries out of the temporary directory into the 
> region store directory. On HDFS the implementation would be easy. We can rely 
> on the filesystem's atomic rename semantics. On S3 it would be work: First we 
> would build the list of objects to move, then copy each object into the 
> destination, and then finally delete all objects at the original path. We 
> must handle transient errors with retry strategies appropriate for the action 
> at hand. We must handle serious or permanent errors where the RS doesn't need 
> to be aborted with a rollback that cleans it all up. Finally, we must handle 
> permanent errors where the RS must be aborted with a rollback during region 
> open/recovery. Note that after all objects have been copied and we are 
> deleting obsolete source objects we must roll forward, not back. To support 
> recovery after an abort we must utilize the WAL to track transaction 
> progress. Put markers in for StoreCommitTransaction start and completion 
> state, with details of the store file(s) involved, so it can be rolled back 
> during region recovery at open. This will be significant work in HFile, 
> HStore, flusher, compactor, and HRegion. Wherever we use HDFS's rename now we 
> would substitute the running of this new multi-step filesystem transaction.
> We need to determine this for certain, but I believe on S3 the PUT or 
> multipart upload of an object must complete before the object is visible, so 
> we don't have to worry about the case where an object is visible before fully 
> uploaded as part of normal operations. So an individual object copy will 
> either happen entirely and the target will then become visible, or it won't 
> and the target won't exist.
> S3 has an optimization, PUT COPY 
> (https://docs.aws.amazon.com/AmazonS3/latest/API/RESTObjectCOPY.html), which 
> the AmazonClient embedded in S3A utilizes for moves. When designing the 
> StoreCommitTransaction be sure to allow for filesystem implementations that 
> leverage a server side copy operation. Doing a get-then-put should be 
> optional. (Not sure Hadoop has an interface that advertises this capability 
> yet; we 

[jira] [Comment Edited] (HBASE-20431) Store commit transaction for filesystems that do not support an atomic rename

2018-04-24 Thread Andrew Purtell (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-20431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16451429#comment-16451429
 ] 

Andrew Purtell edited comment on HBASE-20431 at 4/24/18 11:56 PM:
--

What this issue proposes is a store commit transaction, like SplitTransaction, 
where we have to compose something with atomic semantics out of individual 
actions which cannot natively be combined into an atomic operation, because the 
underlying store can't do it. So, we won't need an atomic directory rename, 
because we can't get this from S3 anyway, and all we require from S3 or the S3 
analogue is atomic single file commit, either via PUT or PUT-COPY. The rest 
will be up to the proposed framework to manage (roll back, roll forward, etc.)

S3Guard is still important, because we need a consistent view of the bucket. 
Inconsistencies in object listings, for example, could result in incorrect 
results, effectively data loss.


was (Author: apurtell):
What this issue proposes is a store commit transaction, like SplitTransaction, 
where we have to compose something with atomic semantics out of individual 
actions which cannot natively be combined into an atomic operation, because the 
underlying store can't do it. So, we won't need an atomic directory rename, 
because we can't get this from S3 anyway, and all we require from S3 or the S3 
analogue is atomic single file commit, either via PUT or PUT-COPY. The rest 
will be up to the proposed framework to manage (roll back, roll forward, etc.)

> Store commit transaction for filesystems that do not support an atomic rename
> -
>
> Key: HBASE-20431
> URL: https://issues.apache.org/jira/browse/HBASE-20431
> Project: HBase
>  Issue Type: Sub-task
>Reporter: Andrew Purtell
>Priority: Major
>
> HBase expects the Hadoop filesystem implementation to support an atomic 
> rename() operation. HDFS does. The S3 backed filesystems do not. The 
> fundamental issue is the non-atomic and eventually consistent nature of the 
> S3 service. A S3 bucket is not a filesystem. S3 is not always immediately 
> read-your-writes. Object metadata can be temporarily inconsistent just after 
> new objects are stored. There can be a settling period to ride over. 
> Renaming/moving objects from one path to another are copy operations with 
> O(file) complexity and O(data) time followed by a series of deletes with 
> O(file) complexity. Failures at any point prior to completion will leave the 
> operation in an inconsistent state. The missing atomic rename semantic opens 
> opportunities for corruption and data loss, which may or may not be 
> repairable with HBCK.
> Handling this at the HBase level could be done with a new multi-step 
> filesystem transaction framework. Call it StoreCommitTransaction. 
> SplitTransaction and MergeTransaction are well established cases where even 
> on HDFS we have non-atomic filesystem changes and are our implementation 
> template for the new work. In this new StoreCommitTransaction we'd be moving 
> flush and compaction temporaries out of the temporary directory into the 
> region store directory. On HDFS the implementation would be easy. We can rely 
> on the filesystem's atomic rename semantics. On S3 it would be work: First we 
> would build the list of objects to move, then copy each object into the 
> destination, and then finally delete all objects at the original path. We 
> must handle transient errors with retry strategies appropriate for the action 
> at hand. We must handle serious or permanent errors where the RS doesn't need 
> to be aborted with a rollback that cleans it all up. Finally, we must handle 
> permanent errors where the RS must be aborted with a rollback during region 
> open/recovery. Note that after all objects have been copied and we are 
> deleting obsolete source objects we must roll forward, not back. To support 
> recovery after an abort we must utilize the WAL to track transaction 
> progress. Put markers in for StoreCommitTransaction start and completion 
> state, with details of the store file(s) involved, so it can be rolled back 
> during region recovery at open. This will be significant work in HFile, 
> HStore, flusher, compactor, and HRegion. Wherever we use HDFS's rename now we 
> would substitute the running of this new multi-step filesystem transaction.
> We need to determine this for certain, but I believe on S3 the PUT or 
> multipart upload of an object must complete before the object is visible, so 
> we don't have to worry about the case where an object is visible before fully 
> uploaded as part of normal operations. So an individual object copy will 
> either happen entirely and the target will then become visible, or it won't 
> and the target won't exist.
> S3 has 

[jira] [Comment Edited] (HBASE-20431) Store commit transaction for filesystems that do not support an atomic rename

2018-04-24 Thread Andrew Purtell (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-20431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16451429#comment-16451429
 ] 

Andrew Purtell edited comment on HBASE-20431 at 4/24/18 11:55 PM:
--

What this issue proposes is a store commit transaction, like SplitTransaction, 
where we have to compose something with atomic semantics out of individual 
actions which cannot natively be combined into an atomic operation, because the 
underlying store can't do it. So, we won't need an atomic directory rename, 
because we can't get this from S3 anyway, and all we require from S3 or the S3 
analogue is atomic single file commit, either via PUT or PUT-COPY. The rest 
will be up to the proposed framework to manage (roll back, roll forward, etc.)


was (Author: apurtell):
What this issue proposes is a store commit transaction, like SplitTransaction, 
where we have to compose something with atomic semantics out of individual 
actions which cannot be combined into an atomic operation. So, we won't need an 
atomic directory rename, because we can't get this from S3 anyway, and all we 
require from S3 or the S3 analogue is atomic single file commit, either via PUT 
or PUT-COPY. The rest will be up to the proposed framework to manage (roll 
back, roll forward, etc.)

> Store commit transaction for filesystems that do not support an atomic rename
> -
>
> Key: HBASE-20431
> URL: https://issues.apache.org/jira/browse/HBASE-20431
> Project: HBase
>  Issue Type: Sub-task
>Reporter: Andrew Purtell
>Priority: Major
>
> HBase expects the Hadoop filesystem implementation to support an atomic 
> rename() operation. HDFS does. The S3 backed filesystems do not. The 
> fundamental issue is the non-atomic and eventually consistent nature of the 
> S3 service. A S3 bucket is not a filesystem. S3 is not always immediately 
> read-your-writes. Object metadata can be temporarily inconsistent just after 
> new objects are stored. There can be a settling period to ride over. 
> Renaming/moving objects from one path to another are copy operations with 
> O(file) complexity and O(data) time followed by a series of deletes with 
> O(file) complexity. Failures at any point prior to completion will leave the 
> operation in an inconsistent state. The missing atomic rename semantic opens 
> opportunities for corruption and data loss, which may or may not be 
> repairable with HBCK.
> Handling this at the HBase level could be done with a new multi-step 
> filesystem transaction framework. Call it StoreCommitTransaction. 
> SplitTransaction and MergeTransaction are well established cases where even 
> on HDFS we have non-atomic filesystem changes and are our implementation 
> template for the new work. In this new StoreCommitTransaction we'd be moving 
> flush and compaction temporaries out of the temporary directory into the 
> region store directory. On HDFS the implementation would be easy. We can rely 
> on the filesystem's atomic rename semantics. On S3 it would be work: First we 
> would build the list of objects to move, then copy each object into the 
> destination, and then finally delete all objects at the original path. We 
> must handle transient errors with retry strategies appropriate for the action 
> at hand. We must handle serious or permanent errors where the RS doesn't need 
> to be aborted with a rollback that cleans it all up. Finally, we must handle 
> permanent errors where the RS must be aborted with a rollback during region 
> open/recovery. Note that after all objects have been copied and we are 
> deleting obsolete source objects we must roll forward, not back. To support 
> recovery after an abort we must utilize the WAL to track transaction 
> progress. Put markers in for StoreCommitTransaction start and completion 
> state, with details of the store file(s) involved, so it can be rolled back 
> during region recovery at open. This will be significant work in HFile, 
> HStore, flusher, compactor, and HRegion. Wherever we use HDFS's rename now we 
> would substitute the running of this new multi-step filesystem transaction.
> We need to determine this for certain, but I believe on S3 the PUT or 
> multipart upload of an object must complete before the object is visible, so 
> we don't have to worry about the case where an object is visible before fully 
> uploaded as part of normal operations. So an individual object copy will 
> either happen entirely and the target will then become visible, or it won't 
> and the target won't exist.
> S3 has an optimization, PUT COPY 
> (https://docs.aws.amazon.com/AmazonS3/latest/API/RESTObjectCOPY.html), which 
> the AmazonClient embedded in S3A utilizes for moves. When designing the 
> StoreCommitTransaction be sure to allow for 

[jira] [Comment Edited] (HBASE-20431) Store commit transaction for filesystems that do not support an atomic rename

2018-04-24 Thread Andrew Purtell (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-20431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16451429#comment-16451429
 ] 

Andrew Purtell edited comment on HBASE-20431 at 4/24/18 11:54 PM:
--

What this issue proposes is a store commit transaction, like SplitTransaction, 
where we have to compose something with atomic semantics out of individual 
actions which cannot be combined into an atomic operation. So, we won't need an 
atomic directory rename, because we can't get this from S3 anyway, and all we 
require from S3 or the S3 analogue is atomic single file commit, either via PUT 
or PUT-COPY. The rest will be up to the proposed framework to manage (roll 
back, roll forward, etc.)


was (Author: apurtell):
What this issue proposes is a store commit transaction, like SplitTransaction, 
where we have to compose something with atomic semantics out of individual 
actions which cannot be combined into an atomic operation. So, all we require 
from S3 or the S3 analogue is atomic single file commit, either via PUT or 
PUT-COPY. The rest will be up to the proposed framework to manage (roll back, 
roll forward, etc.)

> Store commit transaction for filesystems that do not support an atomic rename
> -
>
> Key: HBASE-20431
> URL: https://issues.apache.org/jira/browse/HBASE-20431
> Project: HBase
>  Issue Type: Sub-task
>Reporter: Andrew Purtell
>Priority: Major
>
> HBase expects the Hadoop filesystem implementation to support an atomic 
> rename() operation. HDFS does. The S3 backed filesystems do not. The 
> fundamental issue is the non-atomic and eventually consistent nature of the 
> S3 service. A S3 bucket is not a filesystem. S3 is not always immediately 
> read-your-writes. Object metadata can be temporarily inconsistent just after 
> new objects are stored. There can be a settling period to ride over. 
> Renaming/moving objects from one path to another are copy operations with 
> O(file) complexity and O(data) time followed by a series of deletes with 
> O(file) complexity. Failures at any point prior to completion will leave the 
> operation in an inconsistent state. The missing atomic rename semantic opens 
> opportunities for corruption and data loss, which may or may not be 
> repairable with HBCK.
> Handling this at the HBase level could be done with a new multi-step 
> filesystem transaction framework. Call it StoreCommitTransaction. 
> SplitTransaction and MergeTransaction are well established cases where even 
> on HDFS we have non-atomic filesystem changes and are our implementation 
> template for the new work. In this new StoreCommitTransaction we'd be moving 
> flush and compaction temporaries out of the temporary directory into the 
> region store directory. On HDFS the implementation would be easy. We can rely 
> on the filesystem's atomic rename semantics. On S3 it would be work: First we 
> would build the list of objects to move, then copy each object into the 
> destination, and then finally delete all objects at the original path. We 
> must handle transient errors with retry strategies appropriate for the action 
> at hand. We must handle serious or permanent errors where the RS doesn't need 
> to be aborted with a rollback that cleans it all up. Finally, we must handle 
> permanent errors where the RS must be aborted with a rollback during region 
> open/recovery. Note that after all objects have been copied and we are 
> deleting obsolete source objects we must roll forward, not back. To support 
> recovery after an abort we must utilize the WAL to track transaction 
> progress. Put markers in for StoreCommitTransaction start and completion 
> state, with details of the store file(s) involved, so it can be rolled back 
> during region recovery at open. This will be significant work in HFile, 
> HStore, flusher, compactor, and HRegion. Wherever we use HDFS's rename now we 
> would substitute the running of this new multi-step filesystem transaction.
> We need to determine this for certain, but I believe on S3 the PUT or 
> multipart upload of an object must complete before the object is visible, so 
> we don't have to worry about the case where an object is visible before fully 
> uploaded as part of normal operations. So an individual object copy will 
> either happen entirely and the target will then become visible, or it won't 
> and the target won't exist.
> S3 has an optimization, PUT COPY 
> (https://docs.aws.amazon.com/AmazonS3/latest/API/RESTObjectCOPY.html), which 
> the AmazonClient embedded in S3A utilizes for moves. When designing the 
> StoreCommitTransaction be sure to allow for filesystem implementations that 
> leverage a server side copy operation. Doing a get-then-put should be 
> optional. (Not sure Hadoop has an 

[jira] [Comment Edited] (HBASE-20431) Store commit transaction for filesystems that do not support an atomic rename

2018-04-24 Thread Andrew Purtell (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-20431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16451429#comment-16451429
 ] 

Andrew Purtell edited comment on HBASE-20431 at 4/24/18 11:52 PM:
--

What this issue proposes is a store commit transaction, like SplitTransaction, 
where we have to compose something with atomic semantics out of individual 
actions which cannot be combined into an atomic operation. So, all we require 
from S3 or the S3 analogue is atomic single file commit, either via PUT or 
PUT-COPY. The rest will be up to the proposed framework to manage (roll back, 
roll forward, etc.)


was (Author: apurtell):
What this issue proposes is a store commit transaction, like SplitTransaction, 
where we have to compose something with atomic semantics out of individual 
actions which cannot be combined into an atomic operation. So, all we require 
from S3 or the S3 analogue is atomic single file move. The rest will be up to 
the proposed framework to manage (roll back, roll forward, etc.)

> Store commit transaction for filesystems that do not support an atomic rename
> -
>
> Key: HBASE-20431
> URL: https://issues.apache.org/jira/browse/HBASE-20431
> Project: HBase
>  Issue Type: Sub-task
>Reporter: Andrew Purtell
>Priority: Major
>
> HBase expects the Hadoop filesystem implementation to support an atomic 
> rename() operation. HDFS does. The S3 backed filesystems do not. The 
> fundamental issue is the non-atomic and eventually consistent nature of the 
> S3 service. A S3 bucket is not a filesystem. S3 is not always immediately 
> read-your-writes. Object metadata can be temporarily inconsistent just after 
> new objects are stored. There can be a settling period to ride over. 
> Renaming/moving objects from one path to another are copy operations with 
> O(file) complexity and O(data) time followed by a series of deletes with 
> O(file) complexity. Failures at any point prior to completion will leave the 
> operation in an inconsistent state. The missing atomic rename semantic opens 
> opportunities for corruption and data loss, which may or may not be 
> repairable with HBCK.
> Handling this at the HBase level could be done with a new multi-step 
> filesystem transaction framework. Call it StoreCommitTransaction. 
> SplitTransaction and MergeTransaction are well established cases where even 
> on HDFS we have non-atomic filesystem changes and are our implementation 
> template for the new work. In this new StoreCommitTransaction we'd be moving 
> flush and compaction temporaries out of the temporary directory into the 
> region store directory. On HDFS the implementation would be easy. We can rely 
> on the filesystem's atomic rename semantics. On S3 it would be work: First we 
> would build the list of objects to move, then copy each object into the 
> destination, and then finally delete all objects at the original path. We 
> must handle transient errors with retry strategies appropriate for the action 
> at hand. We must handle serious or permanent errors where the RS doesn't need 
> to be aborted with a rollback that cleans it all up. Finally, we must handle 
> permanent errors where the RS must be aborted with a rollback during region 
> open/recovery. Note that after all objects have been copied and we are 
> deleting obsolete source objects we must roll forward, not back. To support 
> recovery after an abort we must utilize the WAL to track transaction 
> progress. Put markers in for StoreCommitTransaction start and completion 
> state, with details of the store file(s) involved, so it can be rolled back 
> during region recovery at open. This will be significant work in HFile, 
> HStore, flusher, compactor, and HRegion. Wherever we use HDFS's rename now we 
> would substitute the running of this new multi-step filesystem transaction.
> We need to determine this for certain, but I believe on S3 the PUT or 
> multipart upload of an object must complete before the object is visible, so 
> we don't have to worry about the case where an object is visible before fully 
> uploaded as part of normal operations. So an individual object copy will 
> either happen entirely and the target will then become visible, or it won't 
> and the target won't exist.
> S3 has an optimization, PUT COPY 
> (https://docs.aws.amazon.com/AmazonS3/latest/API/RESTObjectCOPY.html), which 
> the AmazonClient embedded in S3A utilizes for moves. When designing the 
> StoreCommitTransaction be sure to allow for filesystem implementations that 
> leverage a server side copy operation. Doing a get-then-put should be 
> optional. (Not sure Hadoop has an interface that advertises this capability 
> yet; we can add one if not.)



--
This message was sent by Atlassian JIRA