is HDFS RAID "data locality" efficient?

Sourygna Luangsay Wed, 08 Aug 2012 09:46:34 -0700

Hi folks!


I have just read about the HDFS RAID feature that was added to Hadoop 0.21
or 0.22. and I am quite curious to know if people use it, what kind of use
they have and what they think about Map/Reduce data locality.

 

First big actor of this technology is Facebook, that claims to save many PB
with it (see http://www.slideshare.net/ydn/hdfs-raid-facebook
<http://www.slideshare.net/ydn/hdfs-raid-facebook%20slides%204%20and%205>
slides 4 and 5).

 

I understand the following advantages with HDFS RAID:

-          You can save space

-          System tolerates more missing blocks

 

Nonetheless, one of the drawback I see is M/R data locality.

As far as I understand, the advantage of having 3 replicas of each blocks is
not only security if one server fails or a block is corrupted,
but also the possibility to have as far as 3 tasktrackers executing the map
task with local data.

If you consider the 4th slide of the Facebook presentation, such
infrastructure decreases this possibility to only 1 tasktracker.

That means that if this tasktracker is very busy executing other tasks, you
have the following choice:

-          Waiting this tasktracker to finish executing (part of) the
current tasks (freeing map slots for instance)

-          Executing the map task for this block in another tasktracker,
transferring the information of the block through the network

In both cases, you´ll get a M/R penalty (please, tell me if I am wrong).

 

Has somebody considered such penalty or has some benchmarks to share with
us?

 

One of the scenario I can think in order to take advantage of HDFS RAID
without suffering this penalty is:

-          Using normal HDFS with default replication=3 for my fresh data

-          Using HDFS RAID for my historical data (that is barely used by
M/R)

 

And you, what are you using HDFS RAID for?

 

Regards,

 

Sourygna Luangsay

is HDFS RAID "data locality" efficient?

Reply via email to