Re: Hadoop and RAID 5

Ulul Sun, 05 Oct 2014 14:18:28 -0700

Hi Travis

Thank you for your detailed answer and for honoring my question with ablog entry :-)

I will look into bus quiescing with admins but I'm under the impressionthat nothing special is done, the HW RAID controller taking care ofeverything, HP doc stating that inserting a hot-pluggable disk inducesone or two seconds pause in disk activity. I'll check whether this ishandled through the controller cache and/or done out of business hoursfor safety.

I'll ask for internal benchmarking hoping it will convince everyone toaccept the JBOD model and automate what's necessary for it not todisrupt operations


Thanks again
Ulul

Le 02/10/2014 00:25, Travis a écrit :

On Wed, Oct 1, 2014 at 4:01 PM, Ulul <[email protected]<mailto:[email protected]>> wrote:
    Dear hadoopers,

    Has anyone been confronted to deploying a cluster in a traditional
    IT shop whose admins handle thousands of servers ?
    They traditionally use SAN or NAS storage for app data, rely on
    RAID 1 for system disks and in the few cases where internal disks
    are used, they configure them with RAID 5 provided by the internal
    HW controller.


Yes.  I've been on both sides of this discussion.
The key is to help them understand that you don't need redundancywithin a system because Hadoop provides redundancy across the entirecluster via replication. This then leaves the problem as a performanceone, in which case you show them benchmarks on the hardware theyprovide in both RAID (RAID0, RAID1, and RAID5) and JBOD modes.
    Using a JBOD setup , as advised in each and every Hadoop doc I
    ever laid my hands on, means that each HDD failure will imply, on
    top of the physical replacement of the drive, that an admin
    performs at least an mkfs.
    Added to the fact that these operations will become more frequent
    since more internal disks will be used, it can be perceived as an
    annoying disruption in industrial handling of numerous servers.
I fail to see how this is really any different than the process ofhaving to deal with a failed drive in an array. Depending on yourarray type, you may still have to do things to quiesce the bus beforedoing any drive operation, such as adding or removing the drive, youmay still have to trigger the rebuild yourself, and so on.
I have a few thousand disks in my cluster. We lose about 3-5 aquarter. I don't find it any more work to re-mkfs the drive afterit's been swapped out and have built tools around the process to makesure it's consistently done by our DC staff (and yes, I did it beforethe DC staff was asked to). If you're concerned about the high-touchaspect of swapping disks out, then you can always configure thedatanode to be tolerant of multiple disk failures (something youcannot do with RAID5) and then just take the whole machine out of thecluster to do swaps when you've reached a particular threshold of baddisks.
    In Tom White's guide there is a discussion of RAID 0, stating that
    Yahoo benchmarks showed a 10% loss in performance so we can expect
    even worse perf with RAID 5 but I found no figures.
I had to re-read that section for reference. My apologies if thefollowing is a little long-winded and rambling.
I'm going to assume that Tom is not talking about single-disk RAID0volumes, which is a common way of doing JBOD with a RAID controllerthat doesn't have JBOD support.
In general, performance is going to depend upon how many activestreams of I/O you have going on the system.
With JBOD, as Tom discusses, every spindle is it's own unique snowflake, and if your drive controller can keep up, you can write as fastas that drive can handle reading off the bus. Performance is going todepend upon how many active reading/writing streams you have accessingeach spindle in the systems.
If I had one stream, I would only get the performance of one spindlein the JBOD. If I had twelve spindles, I'm going to get maximumperformance with at least twelve streams. With RAID0, you're takingyour one stream, cutting it up into multiple parts and either readingit or writing it to all disks, taking advantage of the performance ofall spindles.
The problem arises when you start adding more streams in parallel tothe RAID0 environment. Each parallel I/O operation begins competingwith each other from the controller's standpoint. Sometimes thingsstart to stack up as the controller has to wait for competing I/Ooperations on a single spindle. For example, having to wait for awrite to complete before a read can be done.
At a certain point, the performance of RAID0 begins to hit a knee asthe number of I/O requests goes up because the controller becomes thebottleneck. RAID0 is going to be the closest in performance, but withthe risk that if you lose a single disk, you lose the entire RAID.With JBOD, if you lose a single disk, you only lose the data on that disk.
Now, with RAID5, you're going to have even worse performance becauseyou're dealing with not only the parity calculation, but also with thefact that you incur a performance penalty during reads and writes dueto how the data is laid out across all disks in the RAID. You caread more about this here:http://theithollow.com/2012/03/understanding-raid-penalty/
To put this in perspective, I use 12 7200rpm NLSAS disks in a systemconnected to an LSI9207 SAS controller. This is configured for JBOD.I have benchmarked streaming reads and writes in this environment tobe between 1.6 and 1.8GBytes/sec using 1 i/o stream per spindle for atotal of 12 i/o streams occurring on the system. Btw, this benchmarkhas held stable at this rate for at least 3 i/o streams per spindle; Ihaven't tested higher yet.
Now, I might get this performance with RAID0, but why should Itolerate the risk of losing all data on the system vs just the data ona single drive? Going with RAID0 means that not only do I have toreplace the disk, but now I have to have Hadoop rebalance/redistributedata to the entire system, not just dealing with the small amount ofdata missing from one spindle. And since Hadoop is already handlingmy redundancy via replication of data, why should I tolerate theperformance penalty associated with RAID5? I don't need redundancy ina *single* system, I need redundancy across the entire cluster.
    I also found an Hortonworks interview of StackIQ who provides
    software to automate such failure fix up. But it would be rather
    painful to go straight to another solution, contract and so on
    while starting with Hadoop.

    Please share your experiences around RAID for redundancy (1, 5 or
    other) in Hadoop conf.
I can't see any situation that we would use RAID for the data drivesin our Hadoop cluster. We only use RAID1 for the OS drives, simplybecause we want to reduce the recovery period associated with a systemfailure. No reason to re-install a system and have to replicate databack onto it if we don't have to.
Cheers,
Travis
--
Travis Campbell
[email protected] <mailto:[email protected]>

Re: Hadoop and RAID 5

Reply via email to