Re: Spark on RAID

Steve Loughran Wed, 09 Mar 2016 02:35:58 -0800

On 8 Mar 2016, at 16:34, Eddie Esquivel 
<eduardo.esqui...@gmail.com<mailto:eduardo.esqui...@gmail.com>> wrote:


Hello All,
In the Spark documentation under "Hardware Requirements" it very clearly states:

We recommend having 4-8 disks per node, configured without RAID (just as 
separate mount points)

My question is why not raid? What is the argument\reason for not using Raid?



RAID uses some form of erasure coding to keep data durable in the presence of 
single disk failures, on a single machine. It relies on the ability to recreate 
a lost disk fast (Getting harder with big disks), and assume that the the 
failure mode is the HDD, not the interconnect, the software stack or the server 
itself

Cross machine replication lets you deal with that and resilience to entire 
machine failures, gives you more hosts where the data is local, and more 
bandwidth

some theory on Hadoop cluster data integrity and durability:
http://www.slideshare.net/steve_l/did-you-reallywantthatdata


as for RAID-0, which does offer bandwidth, it has the weakest reliability 
guarantees

http://hortonworks.com/blog/why-not-raid-0-its-about-time-and-snowflakes/


Hadoop 3 is adding erasure coding to HDFS, where you get better compression of 
your data (~1.6 to 2 x raw data, vs 3x today), in exchange for a performance 
cost: the notion of "local" data is weakened; your bandwidth drops. I think 
it'll be used primarily for cold data, though I'm personally curious about the 
combination of EC+SSD on a fast network: is it worth the network cost in 
exchange for keeping more data on SSD?

There's a special case: single node machine with lots of cores+RAM, disks off 
it. There'd I'd use RAID + think of some some backup strategy for data you 
really care about.

Re: Spark on RAID

Reply via email to