[jira] [Created] (HBASE-22990) Parallelize listing phase from beginning of LoadIncrementalHFiles

2019-09-08 Thread Lavinia-Stefania Sirbu (Jira)
Lavinia-Stefania Sirbu created HBASE-22990:
--

 Summary: Parallelize listing phase from beginning of 
LoadIncrementalHFiles
 Key: HBASE-22990
 URL: https://issues.apache.org/jira/browse/HBASE-22990
 Project: HBase
  Issue Type: Improvement
Reporter: Lavinia-Stefania Sirbu
Assignee: Lavinia-Stefania Sirbu






--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (HBASE-21285) Enhanced TableSnapshotInputFormat to allow a size based splitting

2019-01-22 Thread Lavinia-Stefania Sirbu (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16748725#comment-16748725
 ] 

Lavinia-Stefania Sirbu commented on HBASE-21285:


[~yuzhih...@gmail.com] Could you please take a look and give me your opinion? 
Thank you!

> Enhanced TableSnapshotInputFormat to allow a size based splitting
> -
>
> Key: HBASE-21285
> URL: https://issues.apache.org/jira/browse/HBASE-21285
> Project: HBase
>  Issue Type: Improvement
>  Components: snapshots
>Affects Versions: 1.4.0
>Reporter: Lavinia-Stefania Sirbu
>Assignee: Lavinia-Stefania Sirbu
>Priority: Minor
> Attachments: HBASE-21285.branch-1.4.001.patch, 
> HBASE-21285.branch-1.4.002.patch, HBASE-21285.branch-1.4.003.patch, 
> HBASE-21285.branch-1.4.004.patch, HBASE-21285.master.001.patch
>
>
> Currently, all the splits generated by a snapshot are having length 0. Right 
> now, we have a configuration for the number of splits per region, but it's a 
> general one and not very helpful when the sizes for regions are really 
> different. The modification must be done in TableSnapshotInputFormatImpl 
> where the length must be computed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-21285) Enhanced TableSnapshotInputFormat to allow a size based splitting

2018-11-23 Thread Lavinia-Stefania Sirbu (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16696592#comment-16696592
 ] 

Lavinia-Stefania Sirbu commented on HBASE-21285:


[~yuzhih...@gmail.com] In my first approach, I used the getters because my idea 
was that if someone wants to do a size based TableSnapshotInputFormat, they 
need access to all the members, so they can create new input split objects.

I have added a new patch (004) with a new approach where the additional 
splitting based on size is done inside the getSplits method. Can you take a 
look? Thank you!

> Enhanced TableSnapshotInputFormat to allow a size based splitting
> -
>
> Key: HBASE-21285
> URL: https://issues.apache.org/jira/browse/HBASE-21285
> Project: HBase
>  Issue Type: Improvement
>  Components: snapshots
>Affects Versions: 1.4.0
>Reporter: Lavinia-Stefania Sirbu
>Assignee: Lavinia-Stefania Sirbu
>Priority: Minor
> Attachments: HBASE-21285.branch-1.4.001.patch, 
> HBASE-21285.branch-1.4.002.patch, HBASE-21285.branch-1.4.003.patch, 
> HBASE-21285.branch-1.4.004.patch, HBASE-21285.master.001.patch
>
>
> Currently, all the splits generated by a snapshot are having length 0. Right 
> now, we have a configuration for the number of splits per region, but it's a 
> general one and not very helpful when the sizes for regions are really 
> different. The modification must be done in TableSnapshotInputFormatImpl 
> where the length must be computed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HBASE-21285) Enhanced TableSnapshotInputFormat to allow a size based splitting

2018-11-22 Thread Lavinia-Stefania Sirbu (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-21285?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lavinia-Stefania Sirbu updated HBASE-21285:
---
Attachment: HBASE-21285.branch-1.4.004.patch
Status: Patch Available  (was: Open)

> Enhanced TableSnapshotInputFormat to allow a size based splitting
> -
>
> Key: HBASE-21285
> URL: https://issues.apache.org/jira/browse/HBASE-21285
> Project: HBase
>  Issue Type: Improvement
>  Components: snapshots
>Affects Versions: 1.4.0
>Reporter: Lavinia-Stefania Sirbu
>Assignee: Lavinia-Stefania Sirbu
>Priority: Minor
> Attachments: HBASE-21285.branch-1.4.001.patch, 
> HBASE-21285.branch-1.4.002.patch, HBASE-21285.branch-1.4.003.patch, 
> HBASE-21285.branch-1.4.004.patch, HBASE-21285.master.001.patch
>
>
> Currently, all the splits generated by a snapshot are having length 0. Right 
> now, we have a configuration for the number of splits per region, but it's a 
> general one and not very helpful when the sizes for regions are really 
> different. The modification must be done in TableSnapshotInputFormatImpl 
> where the length must be computed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HBASE-21285) Enhanced TableSnapshotInputFormat to allow a size based splitting

2018-11-22 Thread Lavinia-Stefania Sirbu (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-21285?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lavinia-Stefania Sirbu updated HBASE-21285:
---
Status: Open  (was: Patch Available)

> Enhanced TableSnapshotInputFormat to allow a size based splitting
> -
>
> Key: HBASE-21285
> URL: https://issues.apache.org/jira/browse/HBASE-21285
> Project: HBase
>  Issue Type: Improvement
>  Components: snapshots
>Affects Versions: 1.4.0
>Reporter: Lavinia-Stefania Sirbu
>Assignee: Lavinia-Stefania Sirbu
>Priority: Minor
> Attachments: HBASE-21285.branch-1.4.001.patch, 
> HBASE-21285.branch-1.4.002.patch, HBASE-21285.branch-1.4.003.patch, 
> HBASE-21285.master.001.patch
>
>
> Currently, all the splits generated by a snapshot are having length 0. Right 
> now, we have a configuration for the number of splits per region, but it's a 
> general one and not very helpful when the sizes for regions are really 
> different. The modification must be done in TableSnapshotInputFormatImpl 
> where the length must be computed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HBASE-21285) Enhanced TableSnapshotInputFormat to allow a size based splitting

2018-10-16 Thread Lavinia-Stefania Sirbu (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-21285?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lavinia-Stefania Sirbu updated HBASE-21285:
---
Attachment: HBASE-21285.master.001.patch
Status: Patch Available  (was: Open)

> Enhanced TableSnapshotInputFormat to allow a size based splitting
> -
>
> Key: HBASE-21285
> URL: https://issues.apache.org/jira/browse/HBASE-21285
> Project: HBase
>  Issue Type: Improvement
>  Components: snapshots
>Affects Versions: 1.4.0
>Reporter: Lavinia-Stefania Sirbu
>Assignee: Lavinia-Stefania Sirbu
>Priority: Minor
> Attachments: HBASE-21285.branch-1.4.001.patch, 
> HBASE-21285.branch-1.4.002.patch, HBASE-21285.branch-1.4.003.patch, 
> HBASE-21285.master.001.patch
>
>
> Currently, all the splits generated by a snapshot are having length 0. Right 
> now, we have a configuration for the number of splits per region, but it's a 
> general one and not very helpful when the sizes for regions are really 
> different. The modification must be done in TableSnapshotInputFormatImpl 
> where the length must be computed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HBASE-21285) Enhanced TableSnapshotInputFormat to allow a size based splitting

2018-10-16 Thread Lavinia-Stefania Sirbu (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-21285?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lavinia-Stefania Sirbu updated HBASE-21285:
---
Status: Open  (was: Patch Available)

> Enhanced TableSnapshotInputFormat to allow a size based splitting
> -
>
> Key: HBASE-21285
> URL: https://issues.apache.org/jira/browse/HBASE-21285
> Project: HBase
>  Issue Type: Improvement
>  Components: snapshots
>Affects Versions: 1.4.0
>Reporter: Lavinia-Stefania Sirbu
>Assignee: Lavinia-Stefania Sirbu
>Priority: Minor
> Attachments: HBASE-21285.branch-1.4.001.patch, 
> HBASE-21285.branch-1.4.002.patch, HBASE-21285.branch-1.4.003.patch, 
> HBASE-21285.master.001.patch
>
>
> Currently, all the splits generated by a snapshot are having length 0. Right 
> now, we have a configuration for the number of splits per region, but it's a 
> general one and not very helpful when the sizes for regions are really 
> different. The modification must be done in TableSnapshotInputFormatImpl 
> where the length must be computed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-21285) Enhanced TableSnapshotInputFormat to allow a size based splitting

2018-10-16 Thread Lavinia-Stefania Sirbu (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16651949#comment-16651949
 ] 

Lavinia-Stefania Sirbu commented on HBASE-21285:


[~yuzhih...@gmail.com] I am going to attach a patch for master branch later 
today.

Regarding the failed unit tests, I am not able to reproduce them locally (as 
you also saw when applying the patch). Do you have any suggestions of how I 
should approach this problem? I have also observed this behaviour for 
https://issues.apache.org/jira/browse/HBASE-21286, an unit test failed/passed 
without any modifications (the 3rd patch was just to fix some coding style 
problems).

> Enhanced TableSnapshotInputFormat to allow a size based splitting
> -
>
> Key: HBASE-21285
> URL: https://issues.apache.org/jira/browse/HBASE-21285
> Project: HBase
>  Issue Type: Improvement
>  Components: snapshots
>Affects Versions: 1.4.0
>Reporter: Lavinia-Stefania Sirbu
>Assignee: Lavinia-Stefania Sirbu
>Priority: Minor
> Attachments: HBASE-21285.branch-1.4.001.patch, 
> HBASE-21285.branch-1.4.002.patch, HBASE-21285.branch-1.4.003.patch
>
>
> Currently, all the splits generated by a snapshot are having length 0. Right 
> now, we have a configuration for the number of splits per region, but it's a 
> general one and not very helpful when the sizes for regions are really 
> different. The modification must be done in TableSnapshotInputFormatImpl 
> where the length must be computed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HBASE-21286) Parallelize computeHDFSBlocksDistribution when getting splits of a HBaseSnapshot

2018-10-14 Thread Lavinia-Stefania Sirbu (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-21286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lavinia-Stefania Sirbu updated HBASE-21286:
---
Attachment: HBASE-21286.branch-1.4.003.patch
Status: Patch Available  (was: Open)

> Parallelize computeHDFSBlocksDistribution when getting splits of a 
> HBaseSnapshot
> 
>
> Key: HBASE-21286
> URL: https://issues.apache.org/jira/browse/HBASE-21286
> Project: HBase
>  Issue Type: Improvement
>  Components: snapshots
>Affects Versions: 1.4.0
>Reporter: Lavinia-Stefania Sirbu
>Assignee: Lavinia-Stefania Sirbu
>Priority: Minor
> Attachments: HBASE-21286.branch-1.4.001.patch, 
> HBASE-21286.branch-1.4.002.patch, HBASE-21286.branch-1.4.003.patch
>
>
> Even if this step is called computeHDFSBlocksDistribution, this is executed 
> no matter the file system of the snapshot. For example, we have observed an 
> important slowness when we have a snapshot in s3 (~26k regions, 5column 
> families, 2 files per column family) the getsplits time is ~40min due to the 
> calls in s3 for listing the files to get the best locations.
> Parallelizing this operation can reduce the overall setup time. The thread 
> pool should be configurable and a good choice could be 
> "hbase.snapshot.thread.pool.max" that is also used in RestoreSnapshotHelper.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HBASE-21286) Parallelize computeHDFSBlocksDistribution when getting splits of a HBaseSnapshot

2018-10-14 Thread Lavinia-Stefania Sirbu (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-21286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lavinia-Stefania Sirbu updated HBASE-21286:
---
Status: Open  (was: Patch Available)

> Parallelize computeHDFSBlocksDistribution when getting splits of a 
> HBaseSnapshot
> 
>
> Key: HBASE-21286
> URL: https://issues.apache.org/jira/browse/HBASE-21286
> Project: HBase
>  Issue Type: Improvement
>  Components: snapshots
>Affects Versions: 1.4.0
>Reporter: Lavinia-Stefania Sirbu
>Assignee: Lavinia-Stefania Sirbu
>Priority: Minor
> Attachments: HBASE-21286.branch-1.4.001.patch, 
> HBASE-21286.branch-1.4.002.patch
>
>
> Even if this step is called computeHDFSBlocksDistribution, this is executed 
> no matter the file system of the snapshot. For example, we have observed an 
> important slowness when we have a snapshot in s3 (~26k regions, 5column 
> families, 2 files per column family) the getsplits time is ~40min due to the 
> calls in s3 for listing the files to get the best locations.
> Parallelizing this operation can reduce the overall setup time. The thread 
> pool should be configurable and a good choice could be 
> "hbase.snapshot.thread.pool.max" that is also used in RestoreSnapshotHelper.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-21286) Parallelize computeHDFSBlocksDistribution when getting splits of a HBaseSnapshot

2018-10-11 Thread Lavinia-Stefania Sirbu (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16646614#comment-16646614
 ] 

Lavinia-Stefania Sirbu commented on HBASE-21286:


[~yuzhih...@gmail.com]
>From my experiments, the general "formula" is total time ~= (nr of regions * 
>nr of column families * nr of hfiles for a region) * latency / nr of threads. 
Unfortunately, I do not have results with the improvements for the big snapshot 
from s3 mentioned in the description (~26k regions, 5column families, 2 files 
per column family), but I have tested with smaller ones, and the results 
decreased linear whit the number of threads used.
Another test done was with a big snapshot(~26k regions, 5column families, 2 
files per column family) exported to a custom file system over s3 (with a known 
latency), and the results were also in compliance with the formula.

> Parallelize computeHDFSBlocksDistribution when getting splits of a 
> HBaseSnapshot
> 
>
> Key: HBASE-21286
> URL: https://issues.apache.org/jira/browse/HBASE-21286
> Project: HBase
>  Issue Type: Improvement
>  Components: snapshots
>Affects Versions: 1.4.0
>Reporter: Lavinia-Stefania Sirbu
>Priority: Minor
> Attachments: HBASE-21286.branch-1.4.001.patch, 
> HBASE-21286.branch-1.4.002.patch
>
>
> Even if this step is called computeHDFSBlocksDistribution, this is executed 
> no matter the file system of the snapshot. For example, we have observed an 
> important slowness when we have a snapshot in s3 (~26k regions, 5column 
> families, 2 files per column family) the getsplits time is ~40min due to the 
> calls in s3 for listing the files to get the best locations.
> Parallelizing this operation can reduce the overall setup time. The thread 
> pool should be configurable and a good choice could be 
> "hbase.snapshot.thread.pool.max" that is also used in RestoreSnapshotHelper.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HBASE-21286) Parallelize computeHDFSBlocksDistribution when getting splits of a HBaseSnapshot

2018-10-11 Thread Lavinia-Stefania Sirbu (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-21286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lavinia-Stefania Sirbu updated HBASE-21286:
---
Attachment: HBASE-21286.branch-1.4.002.patch
Status: Patch Available  (was: Open)

> Parallelize computeHDFSBlocksDistribution when getting splits of a 
> HBaseSnapshot
> 
>
> Key: HBASE-21286
> URL: https://issues.apache.org/jira/browse/HBASE-21286
> Project: HBase
>  Issue Type: Improvement
>  Components: snapshots
>Affects Versions: 1.4.0
>Reporter: Lavinia-Stefania Sirbu
>Priority: Minor
> Attachments: HBASE-21286.branch-1.4.001.patch, 
> HBASE-21286.branch-1.4.002.patch
>
>
> Even if this step is called computeHDFSBlocksDistribution, this is executed 
> no matter the file system of the snapshot. For example, we have observed an 
> important slowness when we have a snapshot in s3 (~26k regions, 5column 
> families, 2 files per column family) the getsplits time is ~40min due to the 
> calls in s3 for listing the files to get the best locations.
> Parallelizing this operation can reduce the overall setup time. The thread 
> pool should be configurable and a good choice could be 
> "hbase.snapshot.thread.pool.max" that is also used in RestoreSnapshotHelper.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HBASE-21286) Parallelize computeHDFSBlocksDistribution when getting splits of a HBaseSnapshot

2018-10-11 Thread Lavinia-Stefania Sirbu (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-21286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lavinia-Stefania Sirbu updated HBASE-21286:
---
Status: Open  (was: Patch Available)

> Parallelize computeHDFSBlocksDistribution when getting splits of a 
> HBaseSnapshot
> 
>
> Key: HBASE-21286
> URL: https://issues.apache.org/jira/browse/HBASE-21286
> Project: HBase
>  Issue Type: Improvement
>  Components: snapshots
>Affects Versions: 1.4.0
>Reporter: Lavinia-Stefania Sirbu
>Priority: Minor
> Attachments: HBASE-21286.branch-1.4.001.patch
>
>
> Even if this step is called computeHDFSBlocksDistribution, this is executed 
> no matter the file system of the snapshot. For example, we have observed an 
> important slowness when we have a snapshot in s3 (~26k regions, 5column 
> families, 2 files per column family) the getsplits time is ~40min due to the 
> calls in s3 for listing the files to get the best locations.
> Parallelizing this operation can reduce the overall setup time. The thread 
> pool should be configurable and a good choice could be 
> "hbase.snapshot.thread.pool.max" that is also used in RestoreSnapshotHelper.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HBASE-21285) Enhanced TableSnapshotInputFormat to allow a size based splitting

2018-10-11 Thread Lavinia-Stefania Sirbu (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-21285?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lavinia-Stefania Sirbu updated HBASE-21285:
---
Status: Open  (was: Patch Available)

> Enhanced TableSnapshotInputFormat to allow a size based splitting
> -
>
> Key: HBASE-21285
> URL: https://issues.apache.org/jira/browse/HBASE-21285
> Project: HBase
>  Issue Type: Improvement
>  Components: snapshots
>Affects Versions: 1.4.0
>Reporter: Lavinia-Stefania Sirbu
>Priority: Minor
> Attachments: HBASE-21285.branch-1.4.001.patch, 
> HBASE-21285.branch-1.4.002.patch, HBASE-21285.branch-1.4.003.patch
>
>
> Currently, all the splits generated by a snapshot are having length 0. Right 
> now, we have a configuration for the number of splits per region, but it's a 
> general one and not very helpful when the sizes for regions are really 
> different. The modification must be done in TableSnapshotInputFormatImpl 
> where the length must be computed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HBASE-21285) Enhanced TableSnapshotInputFormat to allow a size based splitting

2018-10-11 Thread Lavinia-Stefania Sirbu (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-21285?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lavinia-Stefania Sirbu updated HBASE-21285:
---
Attachment: HBASE-21285.branch-1.4.003.patch
Status: Patch Available  (was: Open)

> Enhanced TableSnapshotInputFormat to allow a size based splitting
> -
>
> Key: HBASE-21285
> URL: https://issues.apache.org/jira/browse/HBASE-21285
> Project: HBase
>  Issue Type: Improvement
>  Components: snapshots
>Affects Versions: 1.4.0
>Reporter: Lavinia-Stefania Sirbu
>Priority: Minor
> Attachments: HBASE-21285.branch-1.4.001.patch, 
> HBASE-21285.branch-1.4.002.patch, HBASE-21285.branch-1.4.003.patch
>
>
> Currently, all the splits generated by a snapshot are having length 0. Right 
> now, we have a configuration for the number of splits per region, but it's a 
> general one and not very helpful when the sizes for regions are really 
> different. The modification must be done in TableSnapshotInputFormatImpl 
> where the length must be computed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-21285) Enhanced TableSnapshotInputFormat to allow a size based splitting

2018-10-10 Thread Lavinia-Stefania Sirbu (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16645694#comment-16645694
 ] 

Lavinia-Stefania Sirbu commented on HBASE-21285:


Sure [~yuzhih...@gmail.com], the result of my experiments are:
* Number of regions = 20 (~200 hfiles) => getRegionSizes ~25ms
* Number of regions = 500 (~9000 hfiles) => getRegionSizes ~100ms
* Number of regions = 23000 regions (~216000 hfiles) => getRegionSizes ~3.5s

> Enhanced TableSnapshotInputFormat to allow a size based splitting
> -
>
> Key: HBASE-21285
> URL: https://issues.apache.org/jira/browse/HBASE-21285
> Project: HBase
>  Issue Type: Improvement
>  Components: snapshots
>Affects Versions: 1.4.0
>Reporter: Lavinia-Stefania Sirbu
>Priority: Minor
> Attachments: HBASE-21285.branch-1.4.001.patch, 
> HBASE-21285.branch-1.4.002.patch
>
>
> Currently, all the splits generated by a snapshot are having length 0. Right 
> now, we have a configuration for the number of splits per region, but it's a 
> general one and not very helpful when the sizes for regions are really 
> different. The modification must be done in TableSnapshotInputFormatImpl 
> where the length must be computed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HBASE-21285) Enhanced TableSnapshotInputFormat to allow a size based splitting

2018-10-10 Thread Lavinia-Stefania Sirbu (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-21285?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lavinia-Stefania Sirbu updated HBASE-21285:
---
Status: Open  (was: Patch Available)

> Enhanced TableSnapshotInputFormat to allow a size based splitting
> -
>
> Key: HBASE-21285
> URL: https://issues.apache.org/jira/browse/HBASE-21285
> Project: HBase
>  Issue Type: Improvement
>  Components: snapshots
>Affects Versions: 1.4.0
>Reporter: Lavinia-Stefania Sirbu
>Priority: Minor
> Attachments: HBASE-21285.branch-1.4.001.patch, 
> HBASE-21285.branch-1.4.002.patch
>
>
> Currently, all the splits generated by a snapshot are having length 0. Right 
> now, we have a configuration for the number of splits per region, but it's a 
> general one and not very helpful when the sizes for regions are really 
> different. The modification must be done in TableSnapshotInputFormatImpl 
> where the length must be computed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HBASE-21285) Enhanced TableSnapshotInputFormat to allow a size based splitting

2018-10-10 Thread Lavinia-Stefania Sirbu (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-21285?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lavinia-Stefania Sirbu updated HBASE-21285:
---
Attachment: HBASE-21285.branch-1.4.002.patch
Status: Patch Available  (was: Open)

> Enhanced TableSnapshotInputFormat to allow a size based splitting
> -
>
> Key: HBASE-21285
> URL: https://issues.apache.org/jira/browse/HBASE-21285
> Project: HBase
>  Issue Type: Improvement
>  Components: snapshots
>Affects Versions: 1.4.0
>Reporter: Lavinia-Stefania Sirbu
>Priority: Minor
> Attachments: HBASE-21285.branch-1.4.001.patch, 
> HBASE-21285.branch-1.4.002.patch
>
>
> Currently, all the splits generated by a snapshot are having length 0. Right 
> now, we have a configuration for the number of splits per region, but it's a 
> general one and not very helpful when the sizes for regions are really 
> different. The modification must be done in TableSnapshotInputFormatImpl 
> where the length must be computed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HBASE-21286) Parallelize computeHDFSBlocksDistribution when getting splits of a HBaseSnapshot

2018-10-10 Thread Lavinia-Stefania Sirbu (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-21286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lavinia-Stefania Sirbu updated HBASE-21286:
---
Attachment: HBASE-21286.branch-1.4.001.patch
Status: Patch Available  (was: Open)

> Parallelize computeHDFSBlocksDistribution when getting splits of a 
> HBaseSnapshot
> 
>
> Key: HBASE-21286
> URL: https://issues.apache.org/jira/browse/HBASE-21286
> Project: HBase
>  Issue Type: Improvement
>  Components: snapshots
>Affects Versions: 1.4.0
>Reporter: Lavinia-Stefania Sirbu
>Priority: Minor
> Attachments: HBASE-21286.branch-1.4.001.patch
>
>
> Even if this step is called computeHDFSBlocksDistribution, this is executed 
> no matter the file system of the snapshot. For example, we have observed an 
> important slowness when we have a snapshot in s3 (~26k regions, 5column 
> families, 2 files per column family) the getsplits time is ~40min due to the 
> calls in s3 for listing the files to get the best locations.
> Parallelizing this operation can reduce the overall setup time. The thread 
> pool should be configurable and a good choice could be 
> "hbase.snapshot.thread.pool.max" that is also used in RestoreSnapshotHelper.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HBASE-21285) Enhanced TableSnapshotInputFormat to allow a size based splitting

2018-10-10 Thread Lavinia-Stefania Sirbu (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-21285?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lavinia-Stefania Sirbu updated HBASE-21285:
---
Attachment: (was: HBASE-21285.branch-1.4.001.patch)

> Enhanced TableSnapshotInputFormat to allow a size based splitting
> -
>
> Key: HBASE-21285
> URL: https://issues.apache.org/jira/browse/HBASE-21285
> Project: HBase
>  Issue Type: Improvement
>  Components: snapshots
>Affects Versions: 1.4.0
>Reporter: Lavinia-Stefania Sirbu
>Priority: Minor
> Attachments: HBASE-21285.branch-1.4.001.patch
>
>
> Currently, all the splits generated by a snapshot are having length 0. Right 
> now, we have a configuration for the number of splits per region, but it's a 
> general one and not very helpful when the sizes for regions are really 
> different. The modification must be done in TableSnapshotInputFormatImpl 
> where the length must be computed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HBASE-21285) Enhanced TableSnapshotInputFormat to allow a size based splitting

2018-10-10 Thread Lavinia-Stefania Sirbu (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-21285?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lavinia-Stefania Sirbu updated HBASE-21285:
---
Attachment: HBASE-21285.branch-1.4.001.patch
Status: Patch Available  (was: Open)

> Enhanced TableSnapshotInputFormat to allow a size based splitting
> -
>
> Key: HBASE-21285
> URL: https://issues.apache.org/jira/browse/HBASE-21285
> Project: HBase
>  Issue Type: Improvement
>  Components: snapshots
>Affects Versions: 1.4.0
>Reporter: Lavinia-Stefania Sirbu
>Priority: Minor
> Attachments: HBASE-21285.branch-1.4.001.patch
>
>
> Currently, all the splits generated by a snapshot are having length 0. Right 
> now, we have a configuration for the number of splits per region, but it's a 
> general one and not very helpful when the sizes for regions are really 
> different. The modification must be done in TableSnapshotInputFormatImpl 
> where the length must be computed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HBASE-21285) Enhanced TableSnapshotInputFormat to allow a size based splitting

2018-10-10 Thread Lavinia-Stefania Sirbu (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-21285?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lavinia-Stefania Sirbu updated HBASE-21285:
---
Attachment: HBASE-21285.branch-1.4.001.patch

> Enhanced TableSnapshotInputFormat to allow a size based splitting
> -
>
> Key: HBASE-21285
> URL: https://issues.apache.org/jira/browse/HBASE-21285
> Project: HBase
>  Issue Type: Improvement
>  Components: snapshots
>Affects Versions: 1.4.0
>Reporter: Lavinia-Stefania Sirbu
>Priority: Minor
> Attachments: HBASE-21285.branch-1.4.001.patch
>
>
> Currently, all the splits generated by a snapshot are having length 0. Right 
> now, we have a configuration for the number of splits per region, but it's a 
> general one and not very helpful when the sizes for regions are really 
> different. The modification must be done in TableSnapshotInputFormatImpl 
> where the length must be computed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HBASE-21286) Parallelize computeHDFSBlocksDistribution when getting splits of a HBaseSnapshot

2018-10-10 Thread Lavinia-Stefania Sirbu (JIRA)
Lavinia-Stefania Sirbu created HBASE-21286:
--

 Summary: Parallelize computeHDFSBlocksDistribution when getting 
splits of a HBaseSnapshot
 Key: HBASE-21286
 URL: https://issues.apache.org/jira/browse/HBASE-21286
 Project: HBase
  Issue Type: Improvement
  Components: snapshots
Affects Versions: 1.4.0
Reporter: Lavinia-Stefania Sirbu


Even if this step is called computeHDFSBlocksDistribution, this is executed no 
matter the file system of the snapshot. For example, we have observed an 
important slowness when we have a snapshot in s3 (~26k regions, 5column 
families, 2 files per column family) the getsplits time is ~40min due to the 
calls in s3 for listing the files to get the best locations.

Parallelizing this operation can reduce the overall setup time. The thread pool 
should be configurable and a good choice could be 
"hbase.snapshot.thread.pool.max" that is also used in RestoreSnapshotHelper.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HBASE-21285) Enhanced TableSnapshotInputFormat to allow a size based splitting

2018-10-10 Thread Lavinia-Stefania Sirbu (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-21285?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lavinia-Stefania Sirbu updated HBASE-21285:
---
Description: Currently, all the splits generated by a snapshot are having 
length 0. Right now, we have a configuration for the number of splits per 
region, but it's a general one and not very helpful when the sizes for regions 
are really different. The modification must be done in 
TableSnapshotInputFormatImpl where the length must be computed.  (was: 
Currently, all the splits generated by a snapshot are having length 0. Fixing 
this would give the opportunity to have a better control of the splits by 
redoing a sized based splitting (right now, we have a configuration for the 
number of splits per region, but it's a general one and not very helpful when 
the sizes for regions are really different).)

> Enhanced TableSnapshotInputFormat to allow a size based splitting
> -
>
> Key: HBASE-21285
> URL: https://issues.apache.org/jira/browse/HBASE-21285
> Project: HBase
>  Issue Type: Improvement
>  Components: snapshots
>Affects Versions: 1.4.0
>Reporter: Lavinia-Stefania Sirbu
>Priority: Minor
>
> Currently, all the splits generated by a snapshot are having length 0. Right 
> now, we have a configuration for the number of splits per region, but it's a 
> general one and not very helpful when the sizes for regions are really 
> different. The modification must be done in TableSnapshotInputFormatImpl 
> where the length must be computed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HBASE-21285) Enhanced TableSnapshotInputFormat to allow a size based splitting

2018-10-10 Thread Lavinia-Stefania Sirbu (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-21285?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lavinia-Stefania Sirbu updated HBASE-21285:
---
Summary: Enhanced TableSnapshotInputFormat to allow a size based splitting  
(was: Compute TableSnapshotInputFormatImpl.InputSplit length)

> Enhanced TableSnapshotInputFormat to allow a size based splitting
> -
>
> Key: HBASE-21285
> URL: https://issues.apache.org/jira/browse/HBASE-21285
> Project: HBase
>  Issue Type: Improvement
>  Components: snapshots
>Affects Versions: 1.4.0
>Reporter: Lavinia-Stefania Sirbu
>Priority: Minor
>
> Currently, all the splits generated by a snapshot are having length 0. Fixing 
> this would give the opportunity to have a better control of the splits by 
> redoing a sized based splitting (right now, we have a configuration for the 
> number of splits per region, but it's a general one and not very helpful when 
> the sizes for regions are really different).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HBASE-21285) Compute TableSnapshotInputFormatImpl.InputSplit length

2018-10-10 Thread Lavinia-Stefania Sirbu (JIRA)
Lavinia-Stefania Sirbu created HBASE-21285:
--

 Summary: Compute TableSnapshotInputFormatImpl.InputSplit length
 Key: HBASE-21285
 URL: https://issues.apache.org/jira/browse/HBASE-21285
 Project: HBase
  Issue Type: Improvement
  Components: snapshots
Affects Versions: 1.4.0
Reporter: Lavinia-Stefania Sirbu


Currently, all the splits generated by a snapshot are having length 0. Fixing 
this would give the opportunity to have a better control of the splits by 
redoing a sized based splitting (right now, we have a configuration for the 
number of splits per region, but it's a general one and not very helpful when 
the sizes for regions are really different).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)