[jira] [Commented] (HBASE-23730) Optimize Memstore Flush for Hbase on S3(Object Store)
[ https://issues.apache.org/jira/browse/HBASE-23730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17024186#comment-17024186 ] ramkrishna.s.vasudevan commented on HBASE-23730: bq.This route would work for s3 and hdfs I think if the WAL marker is used effectively in replays (and during read replicas - I think it is more important there) . Then even in HDFS also we can avoid the rename type of algo. Good point. > Optimize Memstore Flush for Hbase on S3(Object Store) > - > > Key: HBASE-23730 > URL: https://issues.apache.org/jira/browse/HBASE-23730 > Project: HBase > Issue Type: Improvement > Components: regionserver >Reporter: Jarred Li >Priority: Major > > The current Memstore Flush Process is divided into 2 stages: > # Flushcache: In this stage, a “.tmp” region file is written in S3/HDFS for > the memstore; > # Commit: In this stage, the “.tmp” file created in the stage 1 is renamed > to final destination of HBase region file. > The above design(flush and commit) is OK for HDFS because “rename” is light > opertion(only metadata operation). However, for storage like S3 or other > object store, rename is “copy” and “delete” operation. > We can follow the same pattern from V2 of “FileOutputCommitter” in > MapReduce. That means, we can write hfile directly to the S3 destination > directory without “copy” and “paste”. So that we can have less S3 operations > and the HBase memstore flush is more efficient. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HBASE-23730) Optimize Memstore Flush for Hbase on S3(Object Store)
[ https://issues.apache.org/jira/browse/HBASE-23730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17023377#comment-17023377 ] Michael Stack commented on HBASE-23730: --- See WALUtil#writeFlushMarker. IIRC, we write a marker into the WAL when we start flush with detail like flush file name. I believe we do the same writing a close marker when done. On recover of the Region should it crash during flush, we'll replay the WAL. If a 'start' marker w/o a 'close' marker, then we can safely delete the file as not complete. See how compaction does similar. This way you wouldn't have to maintain a side file that may or may not be there when you most need it on the eventually consistent S3. This route would work for s3 and hdfs (would be an improvement over current mechanism doing away w/ namenode interaction doing rename). > Optimize Memstore Flush for Hbase on S3(Object Store) > - > > Key: HBASE-23730 > URL: https://issues.apache.org/jira/browse/HBASE-23730 > Project: HBase > Issue Type: Improvement > Components: regionserver >Reporter: Jarred Li >Priority: Major > > The current Memstore Flush Process is divided into 2 stages: > # Flushcache: In this stage, a “.tmp” region file is written in S3/HDFS for > the memstore; > # Commit: In this stage, the “.tmp” file created in the stage 1 is renamed > to final destination of HBase region file. > The above design(flush and commit) is OK for HDFS because “rename” is light > opertion(only metadata operation). However, for storage like S3 or other > object store, rename is “copy” and “delete” operation. > We can follow the same pattern from V2 of “FileOutputCommitter” in > MapReduce. That means, we can write hfile directly to the S3 destination > directory without “copy” and “paste”. So that we can have less S3 operations > and the HBase memstore flush is more efficient. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HBASE-23730) Optimize Memstore Flush for Hbase on S3(Object Store)
[ https://issues.apache.org/jira/browse/HBASE-23730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17023371#comment-17023371 ] Jarred Li commented on HBASE-23730: --- Hi Sean, Thanks for your comments. I agree with Stack and you that we shall single code path for future maintenace. I am thinking can we use "template" design pattern so that we can keep the main flow same. Put different implementaion for HDFS and object store in subclass? Both of you are experts, I'd like to hear more suggestions from you. Thanks. > Optimize Memstore Flush for Hbase on S3(Object Store) > - > > Key: HBASE-23730 > URL: https://issues.apache.org/jira/browse/HBASE-23730 > Project: HBase > Issue Type: Improvement > Components: regionserver >Reporter: Jarred Li >Priority: Major > > The current Memstore Flush Process is divided into 2 stages: > # Flushcache: In this stage, a “.tmp” region file is written in S3/HDFS for > the memstore; > # Commit: In this stage, the “.tmp” file created in the stage 1 is renamed > to final destination of HBase region file. > The above design(flush and commit) is OK for HDFS because “rename” is light > opertion(only metadata operation). However, for storage like S3 or other > object store, rename is “copy” and “delete” operation. > We can follow the same pattern from V2 of “FileOutputCommitter” in > MapReduce. That means, we can write hfile directly to the S3 destination > directory without “copy” and “paste”. So that we can have less S3 operations > and the HBase memstore flush is more efficient. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HBASE-23730) Optimize Memstore Flush for Hbase on S3(Object Store)
[ https://issues.apache.org/jira/browse/HBASE-23730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17023357#comment-17023357 ] Sean Busbey commented on HBASE-23730: - I second Stack's request that we implement this in a way that there's a single code path used both for HDFS and object stores. It'll be much easier to maintain. > Optimize Memstore Flush for Hbase on S3(Object Store) > - > > Key: HBASE-23730 > URL: https://issues.apache.org/jira/browse/HBASE-23730 > Project: HBase > Issue Type: Improvement > Components: regionserver >Reporter: Jarred Li >Priority: Major > > The current Memstore Flush Process is divided into 2 stages: > # Flushcache: In this stage, a “.tmp” region file is written in S3/HDFS for > the memstore; > # Commit: In this stage, the “.tmp” file created in the stage 1 is renamed > to final destination of HBase region file. > The above design(flush and commit) is OK for HDFS because “rename” is light > opertion(only metadata operation). However, for storage like S3 or other > object store, rename is “copy” and “delete” operation. > We can follow the same pattern from V2 of “FileOutputCommitter” in > MapReduce. That means, we can write hfile directly to the S3 destination > directory without “copy” and “paste”. So that we can have less S3 operations > and the HBase memstore flush is more efficient. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HBASE-23730) Optimize Memstore Flush for Hbase on S3(Object Store)
[ https://issues.apache.org/jira/browse/HBASE-23730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17023349#comment-17023349 ] Jarred Li commented on HBASE-23730: --- Hi Miachael, thank you very much for your comments. I think we can use a flag file to indicate the complection of flush. We generate two files during "Flushcache" stage, let's see "76ce2875960" and "76ce2875960.inprogress". The "76ce2875960.inprogress" file is the flag file to indicate the flush is inprogress. During "Commit" stage, we don't call "rename" of the region file(which is copy and delete for object store), we call "delete" of flag to complete the whole process of flush. Since the flag file is just empty file and there is no data operation, the performance is better. Meanwhile, object store "rename" is not atomic operation, it is error prone. "delete" a emplty file is relative easy. The above solution is only for object store such as S3. For HDFS, I think we shall keep it as is as rename of HDFS is atomic operation. > Optimize Memstore Flush for Hbase on S3(Object Store) > - > > Key: HBASE-23730 > URL: https://issues.apache.org/jira/browse/HBASE-23730 > Project: HBase > Issue Type: Improvement > Components: regionserver >Reporter: Jarred Li >Priority: Major > > The current Memstore Flush Process is divided into 2 stages: > # Flushcache: In this stage, a “.tmp” region file is written in S3/HDFS for > the memstore; > # Commit: In this stage, the “.tmp” file created in the stage 1 is renamed > to final destination of HBase region file. > The above design(flush and commit) is OK for HDFS because “rename” is light > opertion(only metadata operation). However, for storage like S3 or other > object store, rename is “copy” and “delete” operation. > We can follow the same pattern from V2 of “FileOutputCommitter” in > MapReduce. That means, we can write hfile directly to the S3 destination > directory without “copy” and “paste”. So that we can have less S3 operations > and the HBase memstore flush is more efficient. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HBASE-23730) Optimize Memstore Flush for Hbase on S3(Object Store)
[ https://issues.apache.org/jira/browse/HBASE-23730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17022700#comment-17022700 ] Michael Stack commented on HBASE-23730: --- What if crash during flush? On reopen of the region, how we know file is only half-done rather than just corrupt? Will WAL replay clean up half-done files? Otherwise, sounds good. Make an implementation that makes it so we can use it against hdfs too -- skipping the .tmp file and rename. Thanks. > Optimize Memstore Flush for Hbase on S3(Object Store) > - > > Key: HBASE-23730 > URL: https://issues.apache.org/jira/browse/HBASE-23730 > Project: HBase > Issue Type: Improvement > Components: regionserver >Reporter: Jarred Li >Priority: Major > > The current Memstore Flush Process is divided into 2 stages: > # Flushcache: In this stage, a “.tmp” region file is written in S3/HDFS for > the memstore; > # Commit: In this stage, the “.tmp” file created in the stage 1 is renamed > to final destination of HBase region file. > The above design(flush and commit) is OK for HDFS because “rename” is light > opertion(only metadata operation). However, for storage like S3 or other > object store, rename is “copy” and “delete” operation. > We can follow the same pattern from V2 of “FileOutputCommitter” in > MapReduce. That means, we can write hfile directly to the S3 destination > directory without “copy” and “paste”. So that we can have less S3 operations > and the HBase memstore flush is more efficient. -- This message was sent by Atlassian Jira (v8.3.4#803005)