Lavinia-Stefania Sirbu created HBASE-21286:
----------------------------------------------

             Summary: Parallelize computeHDFSBlocksDistribution when getting 
splits of a HBaseSnapshot
                 Key: HBASE-21286
                 URL: https://issues.apache.org/jira/browse/HBASE-21286
             Project: HBase
          Issue Type: Improvement
          Components: snapshots
    Affects Versions: 1.4.0
            Reporter: Lavinia-Stefania Sirbu


Even if this step is called computeHDFSBlocksDistribution, this is executed no 
matter the file system of the snapshot. For example, we have observed an 
important slowness when we have a snapshot in s3 (~26k regions, 5column 
families, 2 files per column family) the getsplits time is ~40min due to the 
calls in s3 for listing the files to get the best locations.

Parallelizing this operation can reduce the overall setup time. The thread pool 
should be configurable and a good choice could be 
"hbase.snapshot.thread.pool.max" that is also used in RestoreSnapshotHelper.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to