On 8 Mar 2016, at 16:34, Eddie Esquivel <eduardo.esqui...@gmail.com<mailto:eduardo.esqui...@gmail.com>> wrote:
Hello All, In the Spark documentation under "Hardware Requirements" it very clearly states: We recommend having 4-8 disks per node, configured without RAID (just as separate mount points) My question is why not raid? What is the argument\reason for not using Raid? RAID uses some form of erasure coding to keep data durable in the presence of single disk failures, on a single machine. It relies on the ability to recreate a lost disk fast (Getting harder with big disks), and assume that the the failure mode is the HDD, not the interconnect, the software stack or the server itself Cross machine replication lets you deal with that and resilience to entire machine failures, gives you more hosts where the data is local, and more bandwidth some theory on Hadoop cluster data integrity and durability: http://www.slideshare.net/steve_l/did-you-reallywantthatdata as for RAID-0, which does offer bandwidth, it has the weakest reliability guarantees http://hortonworks.com/blog/why-not-raid-0-its-about-time-and-snowflakes/ Hadoop 3 is adding erasure coding to HDFS, where you get better compression of your data (~1.6 to 2 x raw data, vs 3x today), in exchange for a performance cost: the notion of "local" data is weakened; your bandwidth drops. I think it'll be used primarily for cold data, though I'm personally curious about the combination of EC+SSD on a fast network: is it worth the network cost in exchange for keeping more data on SSD? There's a special case: single node machine with lots of cores+RAM, disks off it. There'd I'd use RAID + think of some some backup strategy for data you really care about.