Ok thanks, I guess I am paying the cost of too many regions, which when multiplied by store files results in many thousand small files. Is there any reason I couldn't modify this to parallelize it a little?
On Fri, Nov 8, 2013 at 2:06 PM, Matteo Bertozzi <[email protected]>wrote: > The first copy doesn't resolve the links, so you're copying empty files. > The data copy is only on "step 2" with the MR job > > Matteo > > > > On Fri, Nov 8, 2013 at 10:54 AM, Bryan Beaudreault < > [email protected] > > wrote: > > > Hello all. I'm trying out the ExportSnapshot tool and it is extremely > > slow. I took a look at the code and I think I know why. > > > > > > > https://github.com/cloudera/hbase/blob/cdh4-0.94.6_4.4.0/src/main/java/org/apache/hadoop/hbase/snapshot/ExportSnapshot.java#L635 > > > > In step 1 it is for some reason copying from fs1 to fs2. This basically > > means in a single threaded process we are copying an entire hbase table > to > > another cluster. I can understand wanting to copy from fs1 to fs1 (i.e. > > different path on same fs), so as to dereference all the soft links of > the > > snapshots. But why between filesystems? > > > > In step 2 you finally do the MR job, which makes much more sense, but as > > far as I can tell all of the files would already exist, as FileUtils.copy > > just does a recursive copy of all paths in a tree. > > > > Am I missing something? I appreciate any input. > > > > - Bryan > > >
