Hi folks!
I have just read about the HDFS RAID feature that was added to Hadoop 0.21 or 0.22. and I am quite curious to know if people use it, what kind of use they have and what they think about Map/Reduce data locality. First big actor of this technology is Facebook, that claims to save many PB with it (see http://www.slideshare.net/ydn/hdfs-raid-facebook <http://www.slideshare.net/ydn/hdfs-raid-facebook%20slides%204%20and%205> slides 4 and 5). I understand the following advantages with HDFS RAID: - You can save space - System tolerates more missing blocks Nonetheless, one of the drawback I see is M/R data locality. As far as I understand, the advantage of having 3 replicas of each blocks is not only security if one server fails or a block is corrupted, but also the possibility to have as far as 3 tasktrackers executing the map task with local data. If you consider the 4th slide of the Facebook presentation, such infrastructure decreases this possibility to only 1 tasktracker. That means that if this tasktracker is very busy executing other tasks, you have the following choice: - Waiting this tasktracker to finish executing (part of) the current tasks (freeing map slots for instance) - Executing the map task for this block in another tasktracker, transferring the information of the block through the network In both cases, you´ll get a M/R penalty (please, tell me if I am wrong). Has somebody considered such penalty or has some benchmarks to share with us? One of the scenario I can think in order to take advantage of HDFS RAID without suffering this penalty is: - Using normal HDFS with default replication=3 for my fresh data - Using HDFS RAID for my historical data (that is barely used by M/R) And you, what are you using HDFS RAID for? Regards, Sourygna Luangsay
