[jira] [Created] (HBASE-22990) Parallelize listing phase from beginning of LoadIncrementalHFiles
Lavinia-Stefania Sirbu created HBASE-22990: -- Summary: Parallelize listing phase from beginning of LoadIncrementalHFiles Key: HBASE-22990 URL: https://issues.apache.org/jira/browse/HBASE-22990 Project: HBase Issue Type: Improvement Reporter: Lavinia-Stefania Sirbu Assignee: Lavinia-Stefania Sirbu -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Commented] (HBASE-21285) Enhanced TableSnapshotInputFormat to allow a size based splitting
[ https://issues.apache.org/jira/browse/HBASE-21285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16748725#comment-16748725 ] Lavinia-Stefania Sirbu commented on HBASE-21285: [~yuzhih...@gmail.com] Could you please take a look and give me your opinion? Thank you! > Enhanced TableSnapshotInputFormat to allow a size based splitting > - > > Key: HBASE-21285 > URL: https://issues.apache.org/jira/browse/HBASE-21285 > Project: HBase > Issue Type: Improvement > Components: snapshots >Affects Versions: 1.4.0 >Reporter: Lavinia-Stefania Sirbu >Assignee: Lavinia-Stefania Sirbu >Priority: Minor > Attachments: HBASE-21285.branch-1.4.001.patch, > HBASE-21285.branch-1.4.002.patch, HBASE-21285.branch-1.4.003.patch, > HBASE-21285.branch-1.4.004.patch, HBASE-21285.master.001.patch > > > Currently, all the splits generated by a snapshot are having length 0. Right > now, we have a configuration for the number of splits per region, but it's a > general one and not very helpful when the sizes for regions are really > different. The modification must be done in TableSnapshotInputFormatImpl > where the length must be computed. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21285) Enhanced TableSnapshotInputFormat to allow a size based splitting
[ https://issues.apache.org/jira/browse/HBASE-21285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16696592#comment-16696592 ] Lavinia-Stefania Sirbu commented on HBASE-21285: [~yuzhih...@gmail.com] In my first approach, I used the getters because my idea was that if someone wants to do a size based TableSnapshotInputFormat, they need access to all the members, so they can create new input split objects. I have added a new patch (004) with a new approach where the additional splitting based on size is done inside the getSplits method. Can you take a look? Thank you! > Enhanced TableSnapshotInputFormat to allow a size based splitting > - > > Key: HBASE-21285 > URL: https://issues.apache.org/jira/browse/HBASE-21285 > Project: HBase > Issue Type: Improvement > Components: snapshots >Affects Versions: 1.4.0 >Reporter: Lavinia-Stefania Sirbu >Assignee: Lavinia-Stefania Sirbu >Priority: Minor > Attachments: HBASE-21285.branch-1.4.001.patch, > HBASE-21285.branch-1.4.002.patch, HBASE-21285.branch-1.4.003.patch, > HBASE-21285.branch-1.4.004.patch, HBASE-21285.master.001.patch > > > Currently, all the splits generated by a snapshot are having length 0. Right > now, we have a configuration for the number of splits per region, but it's a > general one and not very helpful when the sizes for regions are really > different. The modification must be done in TableSnapshotInputFormatImpl > where the length must be computed. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HBASE-21285) Enhanced TableSnapshotInputFormat to allow a size based splitting
[ https://issues.apache.org/jira/browse/HBASE-21285?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lavinia-Stefania Sirbu updated HBASE-21285: --- Attachment: HBASE-21285.branch-1.4.004.patch Status: Patch Available (was: Open) > Enhanced TableSnapshotInputFormat to allow a size based splitting > - > > Key: HBASE-21285 > URL: https://issues.apache.org/jira/browse/HBASE-21285 > Project: HBase > Issue Type: Improvement > Components: snapshots >Affects Versions: 1.4.0 >Reporter: Lavinia-Stefania Sirbu >Assignee: Lavinia-Stefania Sirbu >Priority: Minor > Attachments: HBASE-21285.branch-1.4.001.patch, > HBASE-21285.branch-1.4.002.patch, HBASE-21285.branch-1.4.003.patch, > HBASE-21285.branch-1.4.004.patch, HBASE-21285.master.001.patch > > > Currently, all the splits generated by a snapshot are having length 0. Right > now, we have a configuration for the number of splits per region, but it's a > general one and not very helpful when the sizes for regions are really > different. The modification must be done in TableSnapshotInputFormatImpl > where the length must be computed. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HBASE-21285) Enhanced TableSnapshotInputFormat to allow a size based splitting
[ https://issues.apache.org/jira/browse/HBASE-21285?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lavinia-Stefania Sirbu updated HBASE-21285: --- Status: Open (was: Patch Available) > Enhanced TableSnapshotInputFormat to allow a size based splitting > - > > Key: HBASE-21285 > URL: https://issues.apache.org/jira/browse/HBASE-21285 > Project: HBase > Issue Type: Improvement > Components: snapshots >Affects Versions: 1.4.0 >Reporter: Lavinia-Stefania Sirbu >Assignee: Lavinia-Stefania Sirbu >Priority: Minor > Attachments: HBASE-21285.branch-1.4.001.patch, > HBASE-21285.branch-1.4.002.patch, HBASE-21285.branch-1.4.003.patch, > HBASE-21285.master.001.patch > > > Currently, all the splits generated by a snapshot are having length 0. Right > now, we have a configuration for the number of splits per region, but it's a > general one and not very helpful when the sizes for regions are really > different. The modification must be done in TableSnapshotInputFormatImpl > where the length must be computed. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HBASE-21285) Enhanced TableSnapshotInputFormat to allow a size based splitting
[ https://issues.apache.org/jira/browse/HBASE-21285?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lavinia-Stefania Sirbu updated HBASE-21285: --- Attachment: HBASE-21285.master.001.patch Status: Patch Available (was: Open) > Enhanced TableSnapshotInputFormat to allow a size based splitting > - > > Key: HBASE-21285 > URL: https://issues.apache.org/jira/browse/HBASE-21285 > Project: HBase > Issue Type: Improvement > Components: snapshots >Affects Versions: 1.4.0 >Reporter: Lavinia-Stefania Sirbu >Assignee: Lavinia-Stefania Sirbu >Priority: Minor > Attachments: HBASE-21285.branch-1.4.001.patch, > HBASE-21285.branch-1.4.002.patch, HBASE-21285.branch-1.4.003.patch, > HBASE-21285.master.001.patch > > > Currently, all the splits generated by a snapshot are having length 0. Right > now, we have a configuration for the number of splits per region, but it's a > general one and not very helpful when the sizes for regions are really > different. The modification must be done in TableSnapshotInputFormatImpl > where the length must be computed. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HBASE-21285) Enhanced TableSnapshotInputFormat to allow a size based splitting
[ https://issues.apache.org/jira/browse/HBASE-21285?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lavinia-Stefania Sirbu updated HBASE-21285: --- Status: Open (was: Patch Available) > Enhanced TableSnapshotInputFormat to allow a size based splitting > - > > Key: HBASE-21285 > URL: https://issues.apache.org/jira/browse/HBASE-21285 > Project: HBase > Issue Type: Improvement > Components: snapshots >Affects Versions: 1.4.0 >Reporter: Lavinia-Stefania Sirbu >Assignee: Lavinia-Stefania Sirbu >Priority: Minor > Attachments: HBASE-21285.branch-1.4.001.patch, > HBASE-21285.branch-1.4.002.patch, HBASE-21285.branch-1.4.003.patch, > HBASE-21285.master.001.patch > > > Currently, all the splits generated by a snapshot are having length 0. Right > now, we have a configuration for the number of splits per region, but it's a > general one and not very helpful when the sizes for regions are really > different. The modification must be done in TableSnapshotInputFormatImpl > where the length must be computed. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21285) Enhanced TableSnapshotInputFormat to allow a size based splitting
[ https://issues.apache.org/jira/browse/HBASE-21285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16651949#comment-16651949 ] Lavinia-Stefania Sirbu commented on HBASE-21285: [~yuzhih...@gmail.com] I am going to attach a patch for master branch later today. Regarding the failed unit tests, I am not able to reproduce them locally (as you also saw when applying the patch). Do you have any suggestions of how I should approach this problem? I have also observed this behaviour for https://issues.apache.org/jira/browse/HBASE-21286, an unit test failed/passed without any modifications (the 3rd patch was just to fix some coding style problems). > Enhanced TableSnapshotInputFormat to allow a size based splitting > - > > Key: HBASE-21285 > URL: https://issues.apache.org/jira/browse/HBASE-21285 > Project: HBase > Issue Type: Improvement > Components: snapshots >Affects Versions: 1.4.0 >Reporter: Lavinia-Stefania Sirbu >Assignee: Lavinia-Stefania Sirbu >Priority: Minor > Attachments: HBASE-21285.branch-1.4.001.patch, > HBASE-21285.branch-1.4.002.patch, HBASE-21285.branch-1.4.003.patch > > > Currently, all the splits generated by a snapshot are having length 0. Right > now, we have a configuration for the number of splits per region, but it's a > general one and not very helpful when the sizes for regions are really > different. The modification must be done in TableSnapshotInputFormatImpl > where the length must be computed. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HBASE-21286) Parallelize computeHDFSBlocksDistribution when getting splits of a HBaseSnapshot
[ https://issues.apache.org/jira/browse/HBASE-21286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lavinia-Stefania Sirbu updated HBASE-21286: --- Attachment: HBASE-21286.branch-1.4.003.patch Status: Patch Available (was: Open) > Parallelize computeHDFSBlocksDistribution when getting splits of a > HBaseSnapshot > > > Key: HBASE-21286 > URL: https://issues.apache.org/jira/browse/HBASE-21286 > Project: HBase > Issue Type: Improvement > Components: snapshots >Affects Versions: 1.4.0 >Reporter: Lavinia-Stefania Sirbu >Assignee: Lavinia-Stefania Sirbu >Priority: Minor > Attachments: HBASE-21286.branch-1.4.001.patch, > HBASE-21286.branch-1.4.002.patch, HBASE-21286.branch-1.4.003.patch > > > Even if this step is called computeHDFSBlocksDistribution, this is executed > no matter the file system of the snapshot. For example, we have observed an > important slowness when we have a snapshot in s3 (~26k regions, 5column > families, 2 files per column family) the getsplits time is ~40min due to the > calls in s3 for listing the files to get the best locations. > Parallelizing this operation can reduce the overall setup time. The thread > pool should be configurable and a good choice could be > "hbase.snapshot.thread.pool.max" that is also used in RestoreSnapshotHelper. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HBASE-21286) Parallelize computeHDFSBlocksDistribution when getting splits of a HBaseSnapshot
[ https://issues.apache.org/jira/browse/HBASE-21286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lavinia-Stefania Sirbu updated HBASE-21286: --- Status: Open (was: Patch Available) > Parallelize computeHDFSBlocksDistribution when getting splits of a > HBaseSnapshot > > > Key: HBASE-21286 > URL: https://issues.apache.org/jira/browse/HBASE-21286 > Project: HBase > Issue Type: Improvement > Components: snapshots >Affects Versions: 1.4.0 >Reporter: Lavinia-Stefania Sirbu >Assignee: Lavinia-Stefania Sirbu >Priority: Minor > Attachments: HBASE-21286.branch-1.4.001.patch, > HBASE-21286.branch-1.4.002.patch > > > Even if this step is called computeHDFSBlocksDistribution, this is executed > no matter the file system of the snapshot. For example, we have observed an > important slowness when we have a snapshot in s3 (~26k regions, 5column > families, 2 files per column family) the getsplits time is ~40min due to the > calls in s3 for listing the files to get the best locations. > Parallelizing this operation can reduce the overall setup time. The thread > pool should be configurable and a good choice could be > "hbase.snapshot.thread.pool.max" that is also used in RestoreSnapshotHelper. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21286) Parallelize computeHDFSBlocksDistribution when getting splits of a HBaseSnapshot
[ https://issues.apache.org/jira/browse/HBASE-21286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16646614#comment-16646614 ] Lavinia-Stefania Sirbu commented on HBASE-21286: [~yuzhih...@gmail.com] >From my experiments, the general "formula" is total time ~= (nr of regions * >nr of column families * nr of hfiles for a region) * latency / nr of threads. Unfortunately, I do not have results with the improvements for the big snapshot from s3 mentioned in the description (~26k regions, 5column families, 2 files per column family), but I have tested with smaller ones, and the results decreased linear whit the number of threads used. Another test done was with a big snapshot(~26k regions, 5column families, 2 files per column family) exported to a custom file system over s3 (with a known latency), and the results were also in compliance with the formula. > Parallelize computeHDFSBlocksDistribution when getting splits of a > HBaseSnapshot > > > Key: HBASE-21286 > URL: https://issues.apache.org/jira/browse/HBASE-21286 > Project: HBase > Issue Type: Improvement > Components: snapshots >Affects Versions: 1.4.0 >Reporter: Lavinia-Stefania Sirbu >Priority: Minor > Attachments: HBASE-21286.branch-1.4.001.patch, > HBASE-21286.branch-1.4.002.patch > > > Even if this step is called computeHDFSBlocksDistribution, this is executed > no matter the file system of the snapshot. For example, we have observed an > important slowness when we have a snapshot in s3 (~26k regions, 5column > families, 2 files per column family) the getsplits time is ~40min due to the > calls in s3 for listing the files to get the best locations. > Parallelizing this operation can reduce the overall setup time. The thread > pool should be configurable and a good choice could be > "hbase.snapshot.thread.pool.max" that is also used in RestoreSnapshotHelper. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HBASE-21286) Parallelize computeHDFSBlocksDistribution when getting splits of a HBaseSnapshot
[ https://issues.apache.org/jira/browse/HBASE-21286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lavinia-Stefania Sirbu updated HBASE-21286: --- Attachment: HBASE-21286.branch-1.4.002.patch Status: Patch Available (was: Open) > Parallelize computeHDFSBlocksDistribution when getting splits of a > HBaseSnapshot > > > Key: HBASE-21286 > URL: https://issues.apache.org/jira/browse/HBASE-21286 > Project: HBase > Issue Type: Improvement > Components: snapshots >Affects Versions: 1.4.0 >Reporter: Lavinia-Stefania Sirbu >Priority: Minor > Attachments: HBASE-21286.branch-1.4.001.patch, > HBASE-21286.branch-1.4.002.patch > > > Even if this step is called computeHDFSBlocksDistribution, this is executed > no matter the file system of the snapshot. For example, we have observed an > important slowness when we have a snapshot in s3 (~26k regions, 5column > families, 2 files per column family) the getsplits time is ~40min due to the > calls in s3 for listing the files to get the best locations. > Parallelizing this operation can reduce the overall setup time. The thread > pool should be configurable and a good choice could be > "hbase.snapshot.thread.pool.max" that is also used in RestoreSnapshotHelper. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HBASE-21286) Parallelize computeHDFSBlocksDistribution when getting splits of a HBaseSnapshot
[ https://issues.apache.org/jira/browse/HBASE-21286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lavinia-Stefania Sirbu updated HBASE-21286: --- Status: Open (was: Patch Available) > Parallelize computeHDFSBlocksDistribution when getting splits of a > HBaseSnapshot > > > Key: HBASE-21286 > URL: https://issues.apache.org/jira/browse/HBASE-21286 > Project: HBase > Issue Type: Improvement > Components: snapshots >Affects Versions: 1.4.0 >Reporter: Lavinia-Stefania Sirbu >Priority: Minor > Attachments: HBASE-21286.branch-1.4.001.patch > > > Even if this step is called computeHDFSBlocksDistribution, this is executed > no matter the file system of the snapshot. For example, we have observed an > important slowness when we have a snapshot in s3 (~26k regions, 5column > families, 2 files per column family) the getsplits time is ~40min due to the > calls in s3 for listing the files to get the best locations. > Parallelizing this operation can reduce the overall setup time. The thread > pool should be configurable and a good choice could be > "hbase.snapshot.thread.pool.max" that is also used in RestoreSnapshotHelper. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HBASE-21285) Enhanced TableSnapshotInputFormat to allow a size based splitting
[ https://issues.apache.org/jira/browse/HBASE-21285?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lavinia-Stefania Sirbu updated HBASE-21285: --- Status: Open (was: Patch Available) > Enhanced TableSnapshotInputFormat to allow a size based splitting > - > > Key: HBASE-21285 > URL: https://issues.apache.org/jira/browse/HBASE-21285 > Project: HBase > Issue Type: Improvement > Components: snapshots >Affects Versions: 1.4.0 >Reporter: Lavinia-Stefania Sirbu >Priority: Minor > Attachments: HBASE-21285.branch-1.4.001.patch, > HBASE-21285.branch-1.4.002.patch, HBASE-21285.branch-1.4.003.patch > > > Currently, all the splits generated by a snapshot are having length 0. Right > now, we have a configuration for the number of splits per region, but it's a > general one and not very helpful when the sizes for regions are really > different. The modification must be done in TableSnapshotInputFormatImpl > where the length must be computed. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HBASE-21285) Enhanced TableSnapshotInputFormat to allow a size based splitting
[ https://issues.apache.org/jira/browse/HBASE-21285?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lavinia-Stefania Sirbu updated HBASE-21285: --- Attachment: HBASE-21285.branch-1.4.003.patch Status: Patch Available (was: Open) > Enhanced TableSnapshotInputFormat to allow a size based splitting > - > > Key: HBASE-21285 > URL: https://issues.apache.org/jira/browse/HBASE-21285 > Project: HBase > Issue Type: Improvement > Components: snapshots >Affects Versions: 1.4.0 >Reporter: Lavinia-Stefania Sirbu >Priority: Minor > Attachments: HBASE-21285.branch-1.4.001.patch, > HBASE-21285.branch-1.4.002.patch, HBASE-21285.branch-1.4.003.patch > > > Currently, all the splits generated by a snapshot are having length 0. Right > now, we have a configuration for the number of splits per region, but it's a > general one and not very helpful when the sizes for regions are really > different. The modification must be done in TableSnapshotInputFormatImpl > where the length must be computed. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21285) Enhanced TableSnapshotInputFormat to allow a size based splitting
[ https://issues.apache.org/jira/browse/HBASE-21285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16645694#comment-16645694 ] Lavinia-Stefania Sirbu commented on HBASE-21285: Sure [~yuzhih...@gmail.com], the result of my experiments are: * Number of regions = 20 (~200 hfiles) => getRegionSizes ~25ms * Number of regions = 500 (~9000 hfiles) => getRegionSizes ~100ms * Number of regions = 23000 regions (~216000 hfiles) => getRegionSizes ~3.5s > Enhanced TableSnapshotInputFormat to allow a size based splitting > - > > Key: HBASE-21285 > URL: https://issues.apache.org/jira/browse/HBASE-21285 > Project: HBase > Issue Type: Improvement > Components: snapshots >Affects Versions: 1.4.0 >Reporter: Lavinia-Stefania Sirbu >Priority: Minor > Attachments: HBASE-21285.branch-1.4.001.patch, > HBASE-21285.branch-1.4.002.patch > > > Currently, all the splits generated by a snapshot are having length 0. Right > now, we have a configuration for the number of splits per region, but it's a > general one and not very helpful when the sizes for regions are really > different. The modification must be done in TableSnapshotInputFormatImpl > where the length must be computed. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HBASE-21285) Enhanced TableSnapshotInputFormat to allow a size based splitting
[ https://issues.apache.org/jira/browse/HBASE-21285?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lavinia-Stefania Sirbu updated HBASE-21285: --- Status: Open (was: Patch Available) > Enhanced TableSnapshotInputFormat to allow a size based splitting > - > > Key: HBASE-21285 > URL: https://issues.apache.org/jira/browse/HBASE-21285 > Project: HBase > Issue Type: Improvement > Components: snapshots >Affects Versions: 1.4.0 >Reporter: Lavinia-Stefania Sirbu >Priority: Minor > Attachments: HBASE-21285.branch-1.4.001.patch, > HBASE-21285.branch-1.4.002.patch > > > Currently, all the splits generated by a snapshot are having length 0. Right > now, we have a configuration for the number of splits per region, but it's a > general one and not very helpful when the sizes for regions are really > different. The modification must be done in TableSnapshotInputFormatImpl > where the length must be computed. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HBASE-21285) Enhanced TableSnapshotInputFormat to allow a size based splitting
[ https://issues.apache.org/jira/browse/HBASE-21285?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lavinia-Stefania Sirbu updated HBASE-21285: --- Attachment: HBASE-21285.branch-1.4.002.patch Status: Patch Available (was: Open) > Enhanced TableSnapshotInputFormat to allow a size based splitting > - > > Key: HBASE-21285 > URL: https://issues.apache.org/jira/browse/HBASE-21285 > Project: HBase > Issue Type: Improvement > Components: snapshots >Affects Versions: 1.4.0 >Reporter: Lavinia-Stefania Sirbu >Priority: Minor > Attachments: HBASE-21285.branch-1.4.001.patch, > HBASE-21285.branch-1.4.002.patch > > > Currently, all the splits generated by a snapshot are having length 0. Right > now, we have a configuration for the number of splits per region, but it's a > general one and not very helpful when the sizes for regions are really > different. The modification must be done in TableSnapshotInputFormatImpl > where the length must be computed. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HBASE-21286) Parallelize computeHDFSBlocksDistribution when getting splits of a HBaseSnapshot
[ https://issues.apache.org/jira/browse/HBASE-21286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lavinia-Stefania Sirbu updated HBASE-21286: --- Attachment: HBASE-21286.branch-1.4.001.patch Status: Patch Available (was: Open) > Parallelize computeHDFSBlocksDistribution when getting splits of a > HBaseSnapshot > > > Key: HBASE-21286 > URL: https://issues.apache.org/jira/browse/HBASE-21286 > Project: HBase > Issue Type: Improvement > Components: snapshots >Affects Versions: 1.4.0 >Reporter: Lavinia-Stefania Sirbu >Priority: Minor > Attachments: HBASE-21286.branch-1.4.001.patch > > > Even if this step is called computeHDFSBlocksDistribution, this is executed > no matter the file system of the snapshot. For example, we have observed an > important slowness when we have a snapshot in s3 (~26k regions, 5column > families, 2 files per column family) the getsplits time is ~40min due to the > calls in s3 for listing the files to get the best locations. > Parallelizing this operation can reduce the overall setup time. The thread > pool should be configurable and a good choice could be > "hbase.snapshot.thread.pool.max" that is also used in RestoreSnapshotHelper. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HBASE-21285) Enhanced TableSnapshotInputFormat to allow a size based splitting
[ https://issues.apache.org/jira/browse/HBASE-21285?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lavinia-Stefania Sirbu updated HBASE-21285: --- Attachment: (was: HBASE-21285.branch-1.4.001.patch) > Enhanced TableSnapshotInputFormat to allow a size based splitting > - > > Key: HBASE-21285 > URL: https://issues.apache.org/jira/browse/HBASE-21285 > Project: HBase > Issue Type: Improvement > Components: snapshots >Affects Versions: 1.4.0 >Reporter: Lavinia-Stefania Sirbu >Priority: Minor > Attachments: HBASE-21285.branch-1.4.001.patch > > > Currently, all the splits generated by a snapshot are having length 0. Right > now, we have a configuration for the number of splits per region, but it's a > general one and not very helpful when the sizes for regions are really > different. The modification must be done in TableSnapshotInputFormatImpl > where the length must be computed. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HBASE-21285) Enhanced TableSnapshotInputFormat to allow a size based splitting
[ https://issues.apache.org/jira/browse/HBASE-21285?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lavinia-Stefania Sirbu updated HBASE-21285: --- Attachment: HBASE-21285.branch-1.4.001.patch Status: Patch Available (was: Open) > Enhanced TableSnapshotInputFormat to allow a size based splitting > - > > Key: HBASE-21285 > URL: https://issues.apache.org/jira/browse/HBASE-21285 > Project: HBase > Issue Type: Improvement > Components: snapshots >Affects Versions: 1.4.0 >Reporter: Lavinia-Stefania Sirbu >Priority: Minor > Attachments: HBASE-21285.branch-1.4.001.patch > > > Currently, all the splits generated by a snapshot are having length 0. Right > now, we have a configuration for the number of splits per region, but it's a > general one and not very helpful when the sizes for regions are really > different. The modification must be done in TableSnapshotInputFormatImpl > where the length must be computed. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HBASE-21285) Enhanced TableSnapshotInputFormat to allow a size based splitting
[ https://issues.apache.org/jira/browse/HBASE-21285?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lavinia-Stefania Sirbu updated HBASE-21285: --- Attachment: HBASE-21285.branch-1.4.001.patch > Enhanced TableSnapshotInputFormat to allow a size based splitting > - > > Key: HBASE-21285 > URL: https://issues.apache.org/jira/browse/HBASE-21285 > Project: HBase > Issue Type: Improvement > Components: snapshots >Affects Versions: 1.4.0 >Reporter: Lavinia-Stefania Sirbu >Priority: Minor > Attachments: HBASE-21285.branch-1.4.001.patch > > > Currently, all the splits generated by a snapshot are having length 0. Right > now, we have a configuration for the number of splits per region, but it's a > general one and not very helpful when the sizes for regions are really > different. The modification must be done in TableSnapshotInputFormatImpl > where the length must be computed. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (HBASE-21286) Parallelize computeHDFSBlocksDistribution when getting splits of a HBaseSnapshot
Lavinia-Stefania Sirbu created HBASE-21286: -- Summary: Parallelize computeHDFSBlocksDistribution when getting splits of a HBaseSnapshot Key: HBASE-21286 URL: https://issues.apache.org/jira/browse/HBASE-21286 Project: HBase Issue Type: Improvement Components: snapshots Affects Versions: 1.4.0 Reporter: Lavinia-Stefania Sirbu Even if this step is called computeHDFSBlocksDistribution, this is executed no matter the file system of the snapshot. For example, we have observed an important slowness when we have a snapshot in s3 (~26k regions, 5column families, 2 files per column family) the getsplits time is ~40min due to the calls in s3 for listing the files to get the best locations. Parallelizing this operation can reduce the overall setup time. The thread pool should be configurable and a good choice could be "hbase.snapshot.thread.pool.max" that is also used in RestoreSnapshotHelper. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HBASE-21285) Enhanced TableSnapshotInputFormat to allow a size based splitting
[ https://issues.apache.org/jira/browse/HBASE-21285?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lavinia-Stefania Sirbu updated HBASE-21285: --- Description: Currently, all the splits generated by a snapshot are having length 0. Right now, we have a configuration for the number of splits per region, but it's a general one and not very helpful when the sizes for regions are really different. The modification must be done in TableSnapshotInputFormatImpl where the length must be computed. (was: Currently, all the splits generated by a snapshot are having length 0. Fixing this would give the opportunity to have a better control of the splits by redoing a sized based splitting (right now, we have a configuration for the number of splits per region, but it's a general one and not very helpful when the sizes for regions are really different).) > Enhanced TableSnapshotInputFormat to allow a size based splitting > - > > Key: HBASE-21285 > URL: https://issues.apache.org/jira/browse/HBASE-21285 > Project: HBase > Issue Type: Improvement > Components: snapshots >Affects Versions: 1.4.0 >Reporter: Lavinia-Stefania Sirbu >Priority: Minor > > Currently, all the splits generated by a snapshot are having length 0. Right > now, we have a configuration for the number of splits per region, but it's a > general one and not very helpful when the sizes for regions are really > different. The modification must be done in TableSnapshotInputFormatImpl > where the length must be computed. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HBASE-21285) Enhanced TableSnapshotInputFormat to allow a size based splitting
[ https://issues.apache.org/jira/browse/HBASE-21285?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lavinia-Stefania Sirbu updated HBASE-21285: --- Summary: Enhanced TableSnapshotInputFormat to allow a size based splitting (was: Compute TableSnapshotInputFormatImpl.InputSplit length) > Enhanced TableSnapshotInputFormat to allow a size based splitting > - > > Key: HBASE-21285 > URL: https://issues.apache.org/jira/browse/HBASE-21285 > Project: HBase > Issue Type: Improvement > Components: snapshots >Affects Versions: 1.4.0 >Reporter: Lavinia-Stefania Sirbu >Priority: Minor > > Currently, all the splits generated by a snapshot are having length 0. Fixing > this would give the opportunity to have a better control of the splits by > redoing a sized based splitting (right now, we have a configuration for the > number of splits per region, but it's a general one and not very helpful when > the sizes for regions are really different). -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (HBASE-21285) Compute TableSnapshotInputFormatImpl.InputSplit length
Lavinia-Stefania Sirbu created HBASE-21285: -- Summary: Compute TableSnapshotInputFormatImpl.InputSplit length Key: HBASE-21285 URL: https://issues.apache.org/jira/browse/HBASE-21285 Project: HBase Issue Type: Improvement Components: snapshots Affects Versions: 1.4.0 Reporter: Lavinia-Stefania Sirbu Currently, all the splits generated by a snapshot are having length 0. Fixing this would give the opportunity to have a better control of the splits by redoing a sized based splitting (right now, we have a configuration for the number of splits per region, but it's a general one and not very helpful when the sizes for regions are really different). -- This message was sent by Atlassian JIRA (v7.6.3#76005)