It will be difficult to make a SAN work well for Hadoop, but not impossible.  I 
have done direct comparisons (but not published them yet).  Direct local 
storage is likely to have much more capacity and more total bandwidth.  But you 
can do pretty well with a SAN if you stuff it with the highest-capacity disks 
and provide an independent 8 gb (FC) or 10 GbE connection for every host.  
Watch out for overall SAN bandwidth limits (which may well be much less than 
the sum of the capacity of the wires connected to it).  There will definitely 
be a hard limit to how many hosts you connect to a single SAN.  Scaling to 
larger clusters will require multiple SANs.

Locality is an issue.  Even though each host has a direct physical access to 
all the data, a "remote" access in HDFS will still have to go over the network 
to the host that owns the data.  "Local" access is fine with the constraints 
above.

RAID is not good for Hadoop performance for both local and SAN storage, so 
you'll want to configure one LUN for each physical disk in the SAN.  If you do 
have mirroring or RAID on the SAN, you may be tempted to use that to replace 
Hadoop replication.  But while the data is protected, access to the data is 
lost if the datanode goes down.  You can get around that by running the 
datanode in a VM which is stored on the SAN and using VMware HA to 
automatically restart the VM on another host in case of a failure.  Hortonworks 
has demonstrated this use-case but this strategy is a bit bleeding-edge.

Jeff

From: Pamecha, Abhishek [mailto:[email protected]]
Sent: Tuesday, October 16, 2012 11:28 AM
To: [email protected]
Subject: HDFS using SAN

Hi

I have read scattered documentation across the net which mostly say HDFS 
doesn't go well with SAN being used to store data. While some say, it is an 
emerging trend. I would love to know if there have been any tests performed 
which hint on what aspects does a direct storage excels/falls behind a SAN.

We are investigating whether a direct storage option is better than a SAN 
storage for a modest cluster with data in 100 TBs in steady state. The SAN of 
course can support order of magnitude more of iops we care about for now, but 
given it is a shared infrastructure and we may expand our data size, it may not 
be an advantage in the future.

Another thing I am interested in: for MR jobs, where data locality is the key 
driver, how does that span out when using a SAN instead of direct storage?

And of course on the subjective topics of availability and reliability on using 
a SAN for data storage in HDFS, I would love to receive your views.

Thanks,
Abhishek

Reply via email to