if you have a patch to parallelize that, feel free to post a patch and it will probably be integrated. The idea was to replace the multiple empty files with few small manifests HBASE-7987.. but that work is still in progress. so, feel free to post a patch with the fix.
Thanks! Matteo On Fri, Nov 8, 2013 at 11:19 AM, Bryan Beaudreault <[email protected] > wrote: > Ok thanks, I guess I am paying the cost of too many regions, which when > multiplied by store files results in many thousand small files. Is there > any reason I couldn't modify this to parallelize it a little? > > > On Fri, Nov 8, 2013 at 2:06 PM, Matteo Bertozzi <[email protected] > >wrote: > > > The first copy doesn't resolve the links, so you're copying empty files. > > The data copy is only on "step 2" with the MR job > > > > Matteo > > > > > > > > On Fri, Nov 8, 2013 at 10:54 AM, Bryan Beaudreault < > > [email protected] > > > wrote: > > > > > Hello all. I'm trying out the ExportSnapshot tool and it is extremely > > > slow. I took a look at the code and I think I know why. > > > > > > > > > > > > https://github.com/cloudera/hbase/blob/cdh4-0.94.6_4.4.0/src/main/java/org/apache/hadoop/hbase/snapshot/ExportSnapshot.java#L635 > > > > > > In step 1 it is for some reason copying from fs1 to fs2. This > basically > > > means in a single threaded process we are copying an entire hbase table > > to > > > another cluster. I can understand wanting to copy from fs1 to fs1 > (i.e. > > > different path on same fs), so as to dereference all the soft links of > > the > > > snapshots. But why between filesystems? > > > > > > In step 2 you finally do the MR job, which makes much more sense, but > as > > > far as I can tell all of the files would already exist, as > FileUtils.copy > > > just does a recursive copy of all paths in a tree. > > > > > > Am I missing something? I appreciate any input. > > > > > > - Bryan > > > > > >
