Re: Spark on RAID

2016-03-09 Thread Steve Loughran

On 8 Mar 2016, at 16:34, Eddie Esquivel 
> wrote:

Hello All,
In the Spark documentation under "Hardware Requirements" it very clearly states:

We recommend having 4-8 disks per node, configured without RAID (just as 
separate mount points)

My question is why not raid? What is the argument\reason for not using Raid?



RAID uses some form of erasure coding to keep data durable in the presence of 
single disk failures, on a single machine. It relies on the ability to recreate 
a lost disk fast (Getting harder with big disks), and assume that the the 
failure mode is the HDD, not the interconnect, the software stack or the server 
itself

Cross machine replication lets you deal with that and resilience to entire 
machine failures, gives you more hosts where the data is local, and more 
bandwidth

some theory on Hadoop cluster data integrity and durability:
http://www.slideshare.net/steve_l/did-you-reallywantthatdata


as for RAID-0, which does offer bandwidth, it has the weakest reliability 
guarantees

http://hortonworks.com/blog/why-not-raid-0-its-about-time-and-snowflakes/


Hadoop 3 is adding erasure coding to HDFS, where you get better compression of 
your data (~1.6 to 2 x raw data, vs 3x today), in exchange for a performance 
cost: the notion of "local" data is weakened; your bandwidth drops. I think 
it'll be used primarily for cold data, though I'm personally curious about the 
combination of EC+SSD on a fast network: is it worth the network cost in 
exchange for keeping more data on SSD?

There's a special case: single node machine with lots of cores+RAM, disks off 
it. There'd I'd use RAID + think of some some backup strategy for data you 
really care about.


Re: Spark on RAID

2016-03-08 Thread Mark Hamstra
One issue is that RAID levels providing data replication are not necessary
since HDFS already replicates blocks on multiple nodes.

On Tue, Mar 8, 2016 at 8:45 AM, Alex Kozlov  wrote:

> Parallel disk IO?  But the effect should be less noticeable compared to
> Hadoop which reads/writes a lot.  Much depends on how often Spark persists
> on disk.  Depends on the specifics of the RAID controller as well.
>
> If you write to HDFS as opposed to local file system this may be a big
> factor as well.
>
> On Tue, Mar 8, 2016 at 8:34 AM, Eddie Esquivel  > wrote:
>
>> Hello All,
>> In the Spark documentation under "Hardware Requirements" it very clearly
>> states:
>>
>> We recommend having *4-8 disks* per node, configured *without* RAID
>> (just as separate mount points)
>>
>> My question is why not raid? What is the argument\reason for not using
>> Raid?
>>
>> Thanks!
>> -Eddie
>>
>
> --
> Alex Kozlov
>


Re: Spark on RAID

2016-03-08 Thread Alex Kozlov
Parallel disk IO?  But the effect should be less noticeable compared to
Hadoop which reads/writes a lot.  Much depends on how often Spark persists
on disk.  Depends on the specifics of the RAID controller as well.

If you write to HDFS as opposed to local file system this may be a big
factor as well.

On Tue, Mar 8, 2016 at 8:34 AM, Eddie Esquivel 
wrote:

> Hello All,
> In the Spark documentation under "Hardware Requirements" it very clearly
> states:
>
> We recommend having *4-8 disks* per node, configured *without* RAID (just
> as separate mount points)
>
> My question is why not raid? What is the argument\reason for not using
> Raid?
>
> Thanks!
> -Eddie
>

--
Alex Kozlov


Spark on RAID

2016-03-08 Thread Eddie Esquivel
Hello All,
In the Spark documentation under "Hardware Requirements" it very clearly
states:

We recommend having *4-8 disks* per node, configured *without* RAID (just
as separate mount points)

My question is why not raid? What is the argument\reason for not using Raid?

Thanks!
-Eddie