Yes, I have been reaching the same conclusions here. Tom would you care to spell out the 'obvious' io considerations? I would like to see if there are more that are different than mine.
My 3 observations have been that 1. for full tables scan MR jobs, SAN approach is transporting entire dataset over the n/w to data nodes. Not good. 2. The shuffle s actually includes more n/w transfers when it could have been just intra-SAN transfer. Disadvantage. 3. SAN controller caches ( an additional stop in data transfer as opposed to das) may not be utilized as effectively because they are shared by multiple data nodes. ( frequent eviction) So overall my conclusion is MR is not the best suited data processing method when data is stored in a SAN. Btw, I thought SAN would do block level transfer and file system on top is your choice. I was surprised to see GPFS 'as' the SAN. Could you please clarify? Any way you can share your cluster size? Thanks Abhishek i Sent from my iPad with iMstakes On Oct 18, 2012, at 7:41, "Tom Deutsch" <[email protected]<mailto:[email protected]>> wrote: Agreed Luca, we do this to support existing customers that have requested it and it works fine within obvious IO considerations. But not a recommended way to do a green field deployment. ------------------------------------------------ Tom Deutsch Program Director Information Management Big Data Technologies IBM 3565 Harbor Blvd Costa Mesa, CA 92626-1420 [email protected]<mailto:[email protected]> Twitter: @thomasdeutsch Data Management Blog: ibmdatamag.com/author/tdeutsch/<http://ibmdatamag.com/author/tdeutsch/> LinkedIn: http://www.linkedin.com/profile/view?id=833160 Quora: http://www.quora.com/Tom-Deutsch Smarter Computing Blog: http://www.smartercomputingblog.com/contributorsprofile/?user_id=223 IBM Big Data Hub Blog: http://www.ibmbigdatahub.com/blog/author/tom-deutsch Big Data for Business Executives Group: http://www.linkedin.com/groups?gid=4455695 <graycol.gif>Luca Pireddu ---10/18/2012 05:33:48 AM---On 10/18/2012 02:21 AM, Pamecha, Abhishek wrote: > Tom From: Luca Pireddu <[email protected]<mailto:[email protected]>> To: [email protected]<mailto:[email protected]>, Date: 10/18/2012 05:33 AM Subject: Re: HDFS using SAN ________________________________ On 10/18/2012 02:21 AM, Pamecha, Abhishek wrote: > Tom > > Do you mean you are using GPFS instead of HDFS? Also, if you can share, > are you deploying it as DAS set up or a SAN? > > Thanks, > > Abhishek > Though I don't think I'd buy a SAN for a new Hadoop cluster, we have a SAN and are using it *instead of HDFS* with a small/medium Hadoop MapReduce cluster (up to 100 nodes or so, depending on our need). We still use the local node disks for intermediate data (mapred local storage). Although this set-up does limit our possibility to scale to a large number of nodes, that's not a concern for us. On the plus, we gain the flexibility to be able to share our cluster with non-Hadoop users at our centre. -- Luca Pireddu CRS4 - Distributed Computing Group Loc. Pixina Manna Edificio 1 09010 Pula (CA), Italy Tel: +39 0709250452
<<inline: graycol.gif>>
