And of source IBM has supported our GPFS and SONAS customers for a couple of years already.
--------------------------------------- Sent from my Blackberry so please excuse typing and spelling errors. ----- Original Message ----- From: "Kevin O'dell" [[email protected]] Sent: 10/17/2012 09:25 AM AST To: [email protected] Subject: Re: HDFS using SAN You may want to take a look at the Netapp White Paper on this. They have a SAN solution as their Hadoop offering. http://www.netapp.com/templates/mediaView?m=tr-3969.pdf&cc=us&wid=130618138&mid=56872393 On Tue, Oct 16, 2012 at 7:28 PM, Pamecha, Abhishek <[email protected]> wrote: > Yes, for MR, my impression is typically the n/w utilization is next to > none during map and reduce tasks but jumps during shuffle. With a SAN, I > would assume there is no such separation. There will be network activity > all over the job’s time window with shuffle probably doing more than what > it should. **** > > ** ** > > Moreover, I hear typically SANs by default, would split data in different > physical disks [even w/o RAID], so contiguity is lost. But I have no idea > on if that is a good thing or bad. Looks bad on the surface, but probably > depends on how much parallelized data fetches from multiple physical disks > can be done by a SAN efficiently. Any comments on this aspect?**** > > ** ** > > And yes, when the dataset volume increases and one needs to basically do > full table scan equivalents, I am assuming the n/w needs to support that > entire data move from SAN to the data node all in parallel to different > mappers.**** > > ** ** > > So what I am gathering is although storing data over SAN is possible for > a Hadoop installation, Map-shuffle-reduce may not be the best way to > process data in that env. Is this conclusion correct? **** > > ** ** > > <3 way Replication and RAID suggestions are great. **** > > ** ** > > Thanks,**** > > Abhishek**** > > ** ** > > *From:* lohit [mailto:[email protected]] > *Sent:* Tuesday, October 16, 2012 3:26 PM > *To:* [email protected] > *Subject:* Re: HDFS using SAN**** > > ** ** > > Adding to this. Locality is very important for MapReduce applications. One > might not see much of a difference for small MapReduce jobs running on > direct attached storage vs SAN, but when you cluster grows or you find jobs > which are heavy on IO, you would see quite a bit of difference. One thing > which is obviously is also cost difference. Argument for that has been that > SAN storage is much more reliable so you do not need default of 3 way > replication factor you would do on direct attached storage. **** > > ** ** > > 2012/10/16 Jeffrey Buell <[email protected]>**** > > It will be difficult to make a SAN work well for Hadoop, but not > impossible. I have done direct comparisons (but not published them yet). > Direct local storage is likely to have much more capacity and more total > bandwidth. But you can do pretty well with a SAN if you stuff it with the > highest-capacity disks and provide an independent 8 gb (FC) or 10 GbE > connection for every host. Watch out for overall SAN bandwidth limits > (which may well be much less than the sum of the capacity of the wires > connected to it). There will definitely be a hard limit to how many hosts > you connect to a single SAN. Scaling to larger clusters will require > multiple SANs.**** > > **** > > Locality is an issue. Even though each host has a direct physical access > to all the data, a “remote” access in HDFS will still have to go over the > network to the host that owns the data. “Local” access is fine with the > constraints above.**** > > **** > > RAID is not good for Hadoop performance for both local and SAN storage, so > you’ll want to configure one LUN for each physical disk in the SAN. If you > do have mirroring or RAID on the SAN, you may be tempted to use that to > replace Hadoop replication. But while the data is protected, access to the > data is lost if the datanode goes down. You can get around that by running > the datanode in a VM which is stored on the SAN and using VMware HA to > automatically restart the VM on another host in case of a failure. > Hortonworks has demonstrated this use-case but this strategy is a bit > bleeding-edge.**** > > **** > > Jeff**** > > **** > > *From:* Pamecha, Abhishek [mailto:[email protected]] > *Sent:* Tuesday, October 16, 2012 11:28 AM > *To:* [email protected] > *Subject:* HDFS using SAN**** > > **** > > Hi **** > > **** > > I have read scattered documentation across the net which mostly say HDFS > doesn't go well with SAN being used to store data. While some say, it is an > emerging trend. I would love to know if there have been any tests performed > which hint on what aspects does a direct storage excels/falls behind a SAN. > **** > > **** > > We are investigating whether a direct storage option is better than a SAN > storage for a modest cluster with data in 100 TBs in steady state. The SAN > of course can support order of magnitude more of iops we care about for > now, but given it is a shared infrastructure and we may expand our data > size, it may not be an advantage in the future.**** > > **** > > Another thing I am interested in: for MR jobs, where data locality is the > key driver, how does that span out when using a SAN instead of direct > storage?**** > > **** > > And of course on the subjective topics of availability and reliability on > using a SAN for data storage in HDFS, I would love to receive your views.* > *** > > **** > > Thanks,**** > > Abhishek**** > > **** > > > > **** > > ** ** > > -- > Have a Nice Day! > Lohit**** > -- Kevin O'Dell Customer Operations Engineer, Cloudera
