[jira] [Commented] (HBASE-23730) Optimize Memstore Flush for Hbase on S3(Object Store)

2020-01-27 Thread ramkrishna.s.vasudevan (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-23730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17024186#comment-17024186
 ] 

ramkrishna.s.vasudevan commented on HBASE-23730:


bq.This route would work for s3 and hdfs
I think if the WAL marker is used effectively in replays (and during read 
replicas - I think it is more important there) . Then even in HDFS also we can 
avoid the rename type of algo. Good point. 

> Optimize Memstore Flush for Hbase on S3(Object Store)
> -
>
> Key: HBASE-23730
> URL: https://issues.apache.org/jira/browse/HBASE-23730
> Project: HBase
>  Issue Type: Improvement
>  Components: regionserver
>Reporter: Jarred Li
>Priority: Major
>
> The current Memstore Flush Process is divided into 2 stages:
>  # Flushcache: In this stage, a “.tmp” region file is written in S3/HDFS for 
> the memstore;
>  # Commit: In this stage, the “.tmp” file created in the stage 1 is renamed 
> to final destination of HBase region file.
> The above design(flush and commit) is OK for HDFS because “rename” is light 
> opertion(only metadata operation). However, for storage like S3 or other 
> object store, rename is “copy” and “delete” operation.
> We can follow the same pattern from V2 of  “FileOutputCommitter” in 
> MapReduce. That means, we can write hfile directly to the S3 destination 
> directory without “copy” and “paste”. So that we can have less S3 operations 
> and the HBase memstore flush is more efficient.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-23730) Optimize Memstore Flush for Hbase on S3(Object Store)

2020-01-24 Thread Michael Stack (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-23730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17023377#comment-17023377
 ] 

Michael Stack commented on HBASE-23730:
---

See WALUtil#writeFlushMarker. IIRC, we write a marker into the WAL when we 
start flush with detail like flush file name. I believe we do the same writing 
a close marker when done. On recover of the Region should it crash during 
flush, we'll replay the WAL. If a 'start' marker w/o a 'close' marker, then we 
can safely delete the file as not complete. See how compaction does similar. 
This way you wouldn't have to maintain a side file that may or may not be there 
when you most need it on the eventually consistent S3. This route would work 
for s3 and hdfs (would be an improvement over current mechanism doing away w/ 
namenode interaction doing rename).

> Optimize Memstore Flush for Hbase on S3(Object Store)
> -
>
> Key: HBASE-23730
> URL: https://issues.apache.org/jira/browse/HBASE-23730
> Project: HBase
>  Issue Type: Improvement
>  Components: regionserver
>Reporter: Jarred Li
>Priority: Major
>
> The current Memstore Flush Process is divided into 2 stages:
>  # Flushcache: In this stage, a “.tmp” region file is written in S3/HDFS for 
> the memstore;
>  # Commit: In this stage, the “.tmp” file created in the stage 1 is renamed 
> to final destination of HBase region file.
> The above design(flush and commit) is OK for HDFS because “rename” is light 
> opertion(only metadata operation). However, for storage like S3 or other 
> object store, rename is “copy” and “delete” operation.
> We can follow the same pattern from V2 of  “FileOutputCommitter” in 
> MapReduce. That means, we can write hfile directly to the S3 destination 
> directory without “copy” and “paste”. So that we can have less S3 operations 
> and the HBase memstore flush is more efficient.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-23730) Optimize Memstore Flush for Hbase on S3(Object Store)

2020-01-24 Thread Jarred Li (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-23730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17023371#comment-17023371
 ] 

Jarred Li commented on HBASE-23730:
---

Hi Sean,

Thanks for your comments. I agree with Stack and you that we shall single code 
path for future maintenace. I am thinking can we use "template" design pattern 
so  that we can keep the main flow same. Put different implementaion for HDFS 
and object store in subclass?

Both of you are experts, I'd like to hear more suggestions from you. Thanks.

> Optimize Memstore Flush for Hbase on S3(Object Store)
> -
>
> Key: HBASE-23730
> URL: https://issues.apache.org/jira/browse/HBASE-23730
> Project: HBase
>  Issue Type: Improvement
>  Components: regionserver
>Reporter: Jarred Li
>Priority: Major
>
> The current Memstore Flush Process is divided into 2 stages:
>  # Flushcache: In this stage, a “.tmp” region file is written in S3/HDFS for 
> the memstore;
>  # Commit: In this stage, the “.tmp” file created in the stage 1 is renamed 
> to final destination of HBase region file.
> The above design(flush and commit) is OK for HDFS because “rename” is light 
> opertion(only metadata operation). However, for storage like S3 or other 
> object store, rename is “copy” and “delete” operation.
> We can follow the same pattern from V2 of  “FileOutputCommitter” in 
> MapReduce. That means, we can write hfile directly to the S3 destination 
> directory without “copy” and “paste”. So that we can have less S3 operations 
> and the HBase memstore flush is more efficient.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-23730) Optimize Memstore Flush for Hbase on S3(Object Store)

2020-01-24 Thread Sean Busbey (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-23730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17023357#comment-17023357
 ] 

Sean Busbey commented on HBASE-23730:
-

I second Stack's request that we implement this in a way that there's a single 
code path used both for HDFS and object stores. It'll be much easier to 
maintain.

> Optimize Memstore Flush for Hbase on S3(Object Store)
> -
>
> Key: HBASE-23730
> URL: https://issues.apache.org/jira/browse/HBASE-23730
> Project: HBase
>  Issue Type: Improvement
>  Components: regionserver
>Reporter: Jarred Li
>Priority: Major
>
> The current Memstore Flush Process is divided into 2 stages:
>  # Flushcache: In this stage, a “.tmp” region file is written in S3/HDFS for 
> the memstore;
>  # Commit: In this stage, the “.tmp” file created in the stage 1 is renamed 
> to final destination of HBase region file.
> The above design(flush and commit) is OK for HDFS because “rename” is light 
> opertion(only metadata operation). However, for storage like S3 or other 
> object store, rename is “copy” and “delete” operation.
> We can follow the same pattern from V2 of  “FileOutputCommitter” in 
> MapReduce. That means, we can write hfile directly to the S3 destination 
> directory without “copy” and “paste”. So that we can have less S3 operations 
> and the HBase memstore flush is more efficient.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-23730) Optimize Memstore Flush for Hbase on S3(Object Store)

2020-01-24 Thread Jarred Li (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-23730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17023349#comment-17023349
 ] 

Jarred Li commented on HBASE-23730:
---

Hi Miachael, thank you very much for your comments.

I think we can use a flag file to indicate the complection of flush. We 
generate two files during "Flushcache" stage, let's see "76ce2875960" and 
"76ce2875960.inprogress". The "76ce2875960.inprogress"  file is the flag file 
to indicate the flush is inprogress. During "Commit" stage, we don't call 
"rename" of the region file(which is copy and delete for object store), we call 
"delete" of flag to complete the whole process of flush. Since the flag file is 
just empty file and there is no data operation, the performance is better. 
Meanwhile, object store "rename" is not atomic operation, it is error prone. 
"delete" a emplty file is relative easy.

 

The above solution is only for object store such as S3. For HDFS, I think we 
shall keep it as is as rename of HDFS is atomic operation.

 

> Optimize Memstore Flush for Hbase on S3(Object Store)
> -
>
> Key: HBASE-23730
> URL: https://issues.apache.org/jira/browse/HBASE-23730
> Project: HBase
>  Issue Type: Improvement
>  Components: regionserver
>Reporter: Jarred Li
>Priority: Major
>
> The current Memstore Flush Process is divided into 2 stages:
>  # Flushcache: In this stage, a “.tmp” region file is written in S3/HDFS for 
> the memstore;
>  # Commit: In this stage, the “.tmp” file created in the stage 1 is renamed 
> to final destination of HBase region file.
> The above design(flush and commit) is OK for HDFS because “rename” is light 
> opertion(only metadata operation). However, for storage like S3 or other 
> object store, rename is “copy” and “delete” operation.
> We can follow the same pattern from V2 of  “FileOutputCommitter” in 
> MapReduce. That means, we can write hfile directly to the S3 destination 
> directory without “copy” and “paste”. So that we can have less S3 operations 
> and the HBase memstore flush is more efficient.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-23730) Optimize Memstore Flush for Hbase on S3(Object Store)

2020-01-23 Thread Michael Stack (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-23730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17022700#comment-17022700
 ] 

Michael Stack commented on HBASE-23730:
---

What if crash during flush? On reopen of the region, how we know file is only 
half-done rather than just corrupt? Will WAL replay clean up half-done files?

Otherwise, sounds good. Make an implementation that makes it so we can use it 
against hdfs too -- skipping the .tmp file and rename. Thanks.

> Optimize Memstore Flush for Hbase on S3(Object Store)
> -
>
> Key: HBASE-23730
> URL: https://issues.apache.org/jira/browse/HBASE-23730
> Project: HBase
>  Issue Type: Improvement
>  Components: regionserver
>Reporter: Jarred Li
>Priority: Major
>
> The current Memstore Flush Process is divided into 2 stages:
>  # Flushcache: In this stage, a “.tmp” region file is written in S3/HDFS for 
> the memstore;
>  # Commit: In this stage, the “.tmp” file created in the stage 1 is renamed 
> to final destination of HBase region file.
> The above design(flush and commit) is OK for HDFS because “rename” is light 
> opertion(only metadata operation). However, for storage like S3 or other 
> object store, rename is “copy” and “delete” operation.
> We can follow the same pattern from V2 of  “FileOutputCommitter” in 
> MapReduce. That means, we can write hfile directly to the S3 destination 
> directory without “copy” and “paste”. So that we can have less S3 operations 
> and the HBase memstore flush is more efficient.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)