Re: HDFS using SAN

Tom Deutsch Wed, 17 Oct 2012 06:32:30 -0700

And of source IBM has supported our GPFS and SONAS customers for a couple of 
years already.


---------------------------------------
Sent from my Blackberry so please excuse typing and spelling errors.


----- Original Message -----
From: "Kevin O'dell" [[email protected]]
Sent: 10/17/2012 09:25 AM AST
To: [email protected]
Subject: Re: HDFS using SAN



You may want to take a look at the Netapp White Paper on this.  They have a
SAN solution as their Hadoop offering.

http://www.netapp.com/templates/mediaView?m=tr-3969.pdf&cc=us&wid=130618138&mid=56872393

On Tue, Oct 16, 2012 at 7:28 PM, Pamecha, Abhishek <[email protected]> wrote:

>  Yes, for MR, my impression is typically the n/w utilization is next to
> none during map and reduce tasks but jumps during shuffle.  With a SAN, I
> would assume there is no such separation. There will be network activity
> all over the job’s time window with shuffle probably doing more than what
> it should. ****
>
> ** **
>
> Moreover, I hear typically SANs by default, would split data in different
> physical disks [even w/o RAID], so contiguity is lost. But I have no idea
> on if that is a good thing or bad. Looks bad on the surface, but probably
> depends on how much parallelized data fetches from multiple physical disks
> can be done by a SAN efficiently. Any comments on this aspect?****
>
> ** **
>
> And yes, when the dataset volume increases and one needs to basically do
> full table scan equivalents, I am assuming the n/w needs to support that
> entire data move from SAN to the data node all in parallel to different
> mappers.****
>
> ** **
>
> So what I am gathering is  although storing data over SAN is possible for
> a Hadoop installation, Map-shuffle-reduce may not be the best way to
> process data in that env. Is this conclusion correct? ****
>
> ** **
>
> <3 way Replication and RAID suggestions are great. ****
>
> ** **
>
> Thanks,****
>
> Abhishek****
>
> ** **
>
> *From:* lohit [mailto:[email protected]]
> *Sent:* Tuesday, October 16, 2012 3:26 PM
> *To:* [email protected]
> *Subject:* Re: HDFS using SAN****
>
> ** **
>
> Adding to this. Locality is very important for MapReduce applications. One
> might not see much of a difference for small MapReduce jobs running on
> direct attached storage vs SAN, but when you cluster grows or you find jobs
> which are heavy on IO, you would see quite a bit of difference. One thing
> which is obviously is also cost difference. Argument for that has been that
> SAN storage is much more reliable so you do not need default of 3 way
> replication factor you would do on direct attached storage. ****
>
> ** **
>
> 2012/10/16 Jeffrey Buell <[email protected]>****
>
> It will be difficult to make a SAN work well for Hadoop, but not
> impossible.  I have done direct comparisons (but not published them yet).
> Direct local storage is likely to have much more capacity and more total
> bandwidth.  But you can do pretty well with a SAN if you stuff it with the
> highest-capacity disks and provide an independent 8 gb (FC) or 10 GbE
> connection for every host.  Watch out for overall SAN bandwidth limits
> (which may well be much less than the sum of the capacity of the wires
> connected to it).  There will definitely be a hard limit to how many hosts
> you connect to a single SAN.  Scaling to larger clusters will require
> multiple SANs.****
>
>  ****
>
> Locality is an issue.  Even though each host has a direct physical access
> to all the data, a “remote” access in HDFS will still have to go over the
> network to the host that owns the data.  “Local” access is fine with the
> constraints above.****
>
>  ****
>
> RAID is not good for Hadoop performance for both local and SAN storage, so
> you’ll want to configure one LUN for each physical disk in the SAN.  If you
> do have mirroring or RAID on the SAN, you may be tempted to use that to
> replace Hadoop replication.  But while the data is protected, access to the
> data is lost if the datanode goes down.  You can get around that by running
> the datanode in a VM which is stored on the SAN and using VMware HA to
> automatically restart the VM on another host in case of a failure.
> Hortonworks has demonstrated this use-case but this strategy is a bit
> bleeding-edge.****
>
>  ****
>
> Jeff****
>
>  ****
>
> *From:* Pamecha, Abhishek [mailto:[email protected]]
> *Sent:* Tuesday, October 16, 2012 11:28 AM
> *To:* [email protected]
> *Subject:* HDFS using SAN****
>
>  ****
>
> Hi ****
>
>  ****
>
> I have read scattered documentation across the net which mostly say HDFS
> doesn't go well with SAN being used to store data. While some say, it is an
> emerging trend. I would love to know if there have been any tests performed
> which hint on what aspects does a direct storage excels/falls behind a SAN.
> ****
>
>  ****
>
> We are investigating whether a direct storage option is better than a SAN
> storage for a modest cluster with data in 100 TBs in steady state. The SAN
> of course can support order of magnitude more of iops we care about for
> now, but given it is a shared infrastructure and we may expand our data
> size, it may not be an advantage in the future.****
>
>  ****
>
> Another thing I am interested in: for MR jobs, where data locality is the
> key driver, how does that span out when using a SAN instead of direct
> storage?****
>
>  ****
>
> And of course on the subjective topics of availability and reliability on
> using a SAN for data storage in HDFS, I would love to receive your views.*
> ***
>
>  ****
>
> Thanks,****
>
> Abhishek****
>
>  ****
>
>
>
> ****
>
> ** **
>
> --
> Have a Nice Day!
> Lohit****
>



--
Kevin O'Dell
Customer Operations Engineer, Cloudera

Re: HDFS using SAN

Reply via email to