On Wed, Aug 25, 2010 at 8:05 AM, Matthew LeMieux <[email protected]> wrote: > For those who are curious, using rack awareness to speed up the process of > adding and removing nodes did not work in my experiment. > > Once the extra rack was no longer needed, HDFS was using up time and > bandwidth to duplicate the data on the primary rack over to the transient > rack rather than replicating the data on transient rack over to the primary > rack. With the alternative method of putting all machines in one rack, the > data on the permanent machines does not need to be copied over to the > transient machines, making the process of replication much faster. >
Would adding a custom script for rack awareness help here? > But, once I've copied the files over to the M/R cluster, is there a way to > read in the files, i.e., is there an HFileInputFormat equivalent? > No. We don't have such a beastie at the moment. It'd be a little tricky to write in that it would in essence need to keep the order and merge the content of all files in the same way as its done inside in the HRegion class that floats in a running HRegionServer (maybe whats needed is a HRegionInputFormat?). You'll need to bring up an hbase instance on the M/R cluster if you want to run M/R against the hbase content. Or, just M/R across racks? St.Ack
