Re: HDFS using SAN

Mohamed Riadh Trad Wed, 17 Oct 2012 06:38:17 -0700

Sauvegarde tes données!

Le 17 oct. 2012 à 15:25, Kevin O'dell a écrit :


> You may want to take a look at the Netapp White Paper on this.  They have a 
> SAN solution as their Hadoop offering.
> 
> http://www.netapp.com/templates/mediaView?m=tr-3969.pdf&cc=us&wid=130618138&mid=56872393
> 
> On Tue, Oct 16, 2012 at 7:28 PM, Pamecha, Abhishek <[email protected]> wrote:
> Yes, for MR, my impression is typically the n/w utilization is next to none 
> during map and reduce tasks but jumps during shuffle.  With a SAN, I would 
> assume there is no such separation. There will be network activity all over 
> the job’s time window with shuffle probably doing more than what it should.
> 
>  
> 
> Moreover, I hear typically SANs by default, would split data in different 
> physical disks [even w/o RAID], so contiguity is lost. But I have no idea on 
> if that is a good thing or bad. Looks bad on the surface, but probably 
> depends on how much parallelized data fetches from multiple physical disks 
> can be done by a SAN efficiently. Any comments on this aspect?
> 
>  
> 
> And yes, when the dataset volume increases and one needs to basically do full 
> table scan equivalents, I am assuming the n/w needs to support that entire 
> data move from SAN to the data node all in parallel to different mappers.
> 
>  
> 
> So what I am gathering is  although storing data over SAN is possible for a 
> Hadoop installation, Map-shuffle-reduce may not be the best way to process 
> data in that env. Is this conclusion correct?
> 
>  
> 
> <3 way Replication and RAID suggestions are great.
> 
>  
> 
> Thanks,
> 
> Abhishek
> 
>  
> 
> From: lohit [mailto:[email protected]] 
> Sent: Tuesday, October 16, 2012 3:26 PM
> To: [email protected]
> Subject: Re: HDFS using SAN
> 
>  
> 
> Adding to this. Locality is very important for MapReduce applications. One 
> might not see much of a difference for small MapReduce jobs running on direct 
> attached storage vs SAN, but when you cluster grows or you find jobs which 
> are heavy on IO, you would see quite a bit of difference. One thing which is 
> obviously is also cost difference. Argument for that has been that SAN 
> storage is much more reliable so you do not need default of 3 way replication 
> factor you would do on direct attached storage. 
> 
>  
> 
> 2012/10/16 Jeffrey Buell <[email protected]>
> 
> It will be difficult to make a SAN work well for Hadoop, but not impossible.  
> I have done direct comparisons (but not published them yet).  Direct local 
> storage is likely to have much more capacity and more total bandwidth.  But 
> you can do pretty well with a SAN if you stuff it with the highest-capacity 
> disks and provide an independent 8 gb (FC) or 10 GbE connection for every 
> host.  Watch out for overall SAN bandwidth limits (which may well be much 
> less than the sum of the capacity of the wires connected to it).  There will 
> definitely be a hard limit to how many hosts you connect to a single SAN.  
> Scaling to larger clusters will require multiple SANs.
> 
>  
> 
> Locality is an issue.  Even though each host has a direct physical access to 
> all the data, a “remote” access in HDFS will still have to go over the 
> network to the host that owns the data.  “Local” access is fine with the 
> constraints above.
> 
>  
> 
> RAID is not good for Hadoop performance for both local and SAN storage, so 
> you’ll want to configure one LUN for each physical disk in the SAN.  If you 
> do have mirroring or RAID on the SAN, you may be tempted to use that to 
> replace Hadoop replication.  But while the data is protected, access to the 
> data is lost if the datanode goes down.  You can get around that by running 
> the datanode in a VM which is stored on the SAN and using VMware HA to 
> automatically restart the VM on another host in case of a failure.  
> Hortonworks has demonstrated this use-case but this strategy is a bit 
> bleeding-edge.
> 
>  
> 
> Jeff
> 
>  
> 
> From: Pamecha, Abhishek [mailto:[email protected]] 
> Sent: Tuesday, October 16, 2012 11:28 AM
> To: [email protected]
> Subject: HDFS using SAN
> 
>  
> 
> Hi
> 
>  
> 
> I have read scattered documentation across the net which mostly say HDFS 
> doesn't go well with SAN being used to store data. While some say, it is an 
> emerging trend. I would love to know if there have been any tests performed 
> which hint on what aspects does a direct storage excels/falls behind a SAN.
> 
>  
> 
> We are investigating whether a direct storage option is better than a SAN 
> storage for a modest cluster with data in 100 TBs in steady state. The SAN of 
> course can support order of magnitude more of iops we care about for now, but 
> given it is a shared infrastructure and we may expand our data size, it may 
> not be an advantage in the future.
> 
>  
> 
> Another thing I am interested in: for MR jobs, where data locality is the key 
> driver, how does that span out when using a SAN instead of direct storage?
> 
>  
> 
> And of course on the subjective topics of availability and reliability on 
> using a SAN for data storage in HDFS, I would love to receive your views.
> 
>  
> 
> Thanks,
> 
> Abhishek
> 
>  
> 
> 
> 
> 
>  
> 
> -- 
> Have a Nice Day!
> Lohit
> 
> 
> 
> 
> -- 
> Kevin O'Dell
> Customer Operations Engineer, Cloudera

Re: HDFS using SAN

Reply via email to