Re: Raid over 48 disks ... for real now
Quoting Norman Elton [EMAIL PROTECTED]: I posed the question a few weeks ago about how to best accommodate software RAID over an array of 48 disks (a Sun X4500 server, a.k.a. Thumper). I appreciate all the suggestions. Well, the hardware is here. It is indeed six Marvell 88SX6081 SATA controllers, each with eight 1TB drives, for a total raw storage of 48TB. I must admit, it's quite impressive. And loud. More information about the hardware is available online... http://www.sun.com/servers/x64/x4500/arch-wp.pdf It came loaded with Solaris, configured with ZFS. Things seemed to work fine. I did not do any benchmarks, but I can revert to that configuration if necessary. Now I've loaded RHEL onto the box. For a first-shot, I've created one RAID-5 array (+ 1 spare) on each of the controllers, then used LVM to create a VolGroup across the arrays. So now I'm trying to figure out what to do with this space. So far, I've tested mke2fs on a 1TB and a 5TB LogVol. I wish RHEL would support XFS/ZFS, but for now, I'm stuck with ext3. Am I better off sticking with relatively small partitions (2-5 TB), or should I crank up the block size and go for one big partition? Impressive system. I'm curious to what the storage drives look like and how they attach to the server with that many disks? Sounds like you have some time to play around before shoving it into production. I wonder how long it would take to run an fsck on one large filesystem? Cheers, Mike - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Raid over 48 disks ... for real now
I wonder how long it would take to run an fsck on one large filesystem? :) I would imagine you'd have time to order a new system, build it, and restore the backups before the fsck was done! - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Raid over 48 disks ... for real now
It is quite a box. There's a picture of the box with the cover removed on Sun's website: http://www.sun.com/images/k3/k3_sunfirex4500_4.jpg From the X4500 homepage, there's a gallery of additional pictures. The drives drop in from the top. Massive fans channel air in the small gaps between the drives. It doesn't look like there's much room between the disks, but a lot of cold air gets sucked in the front, and a lot of hot air comes out the back. So it must be doing its job :). I have not tried a fsck on it yet. I'll probably setup a lot of 2TB partitions rather than a single large partition. Then write the software to handle storing data across many partitions. Norman On 1/18/08, [EMAIL PROTECTED] [EMAIL PROTECTED] wrote: Quoting Norman Elton [EMAIL PROTECTED]: I posed the question a few weeks ago about how to best accommodate software RAID over an array of 48 disks (a Sun X4500 server, a.k.a. Thumper). I appreciate all the suggestions. Well, the hardware is here. It is indeed six Marvell 88SX6081 SATA controllers, each with eight 1TB drives, for a total raw storage of 48TB. I must admit, it's quite impressive. And loud. More information about the hardware is available online... http://www.sun.com/servers/x64/x4500/arch-wp.pdf It came loaded with Solaris, configured with ZFS. Things seemed to work fine. I did not do any benchmarks, but I can revert to that configuration if necessary. Now I've loaded RHEL onto the box. For a first-shot, I've created one RAID-5 array (+ 1 spare) on each of the controllers, then used LVM to create a VolGroup across the arrays. So now I'm trying to figure out what to do with this space. So far, I've tested mke2fs on a 1TB and a 5TB LogVol. I wish RHEL would support XFS/ZFS, but for now, I'm stuck with ext3. Am I better off sticking with relatively small partitions (2-5 TB), or should I crank up the block size and go for one big partition? Impressive system. I'm curious to what the storage drives look like and how they attach to the server with that many disks? Sounds like you have some time to play around before shoving it into production. I wonder how long it would take to run an fsck on one large filesystem? Cheers, Mike - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Raid over 48 disks ... for real now
On Thu, 17 Jan 2008, Janek Kozicki wrote: I wish RHEL would support XFS/ZFS, but for now, I'm stuck with ext3. there is ext4 (or ext4dev) - it's an ext3 modified to support 1024 PB size (1048576 TB). You could check if it's feasible. Personally I'd always stick with ext2/ext3/ext4 since it is most widely used and thus has the best recovery tools. Something else to keep in mind...XFS fs repair tools require large amounts of memory. If you were to create one or a few really huge fs's on this array, you might end up with fs's which can't be repaired because you don't have or even can't get a machine with enough RAM for the job...not to mention the amount of time it would take. -- Jon Lewis | I route Senior Network Engineer | therefore you are Atlantic Net| _ http://www.lewis.org/~jlewis/pgp for PGP public key_ - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Raid over 48 disks ... for real now
I posed the question a few weeks ago about how to best accommodate software RAID over an array of 48 disks (a Sun X4500 server, a.k.a. Thumper). I appreciate all the suggestions. Well, the hardware is here. It is indeed six Marvell 88SX6081 SATA controllers, each with eight 1TB drives, for a total raw storage of 48TB. I must admit, it's quite impressive. And loud. More information about the hardware is available online... http://www.sun.com/servers/x64/x4500/arch-wp.pdf It came loaded with Solaris, configured with ZFS. Things seemed to work fine. I did not do any benchmarks, but I can revert to that configuration if necessary. Now I've loaded RHEL onto the box. For a first-shot, I've created one RAID-5 array (+ 1 spare) on each of the controllers, then used LVM to create a VolGroup across the arrays. So now I'm trying to figure out what to do with this space. So far, I've tested mke2fs on a 1TB and a 5TB LogVol. I wish RHEL would support XFS/ZFS, but for now, I'm stuck with ext3. Am I better off sticking with relatively small partitions (2-5 TB), or should I crank up the block size and go for one big partition? Thoughts? Norman Elton - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Raid over 48 disks ... for real now
Hi, sounds like a monster server. I am interested in how you will make the space useful to remote machines- iscsi? this is what I am researching currently. Yes, it's a honker of a box. It will be collecting data from various collector servers. The plan right now is to collect the file to binary files using a daemon (already running on a smaller box), then make the last 30/60/90/?? days available in a database that is populated from these files. If we need to gather older data, then the individual files must be consulted locally. So, in production, I would probably setup the database partition on it's own set of 6 disks, then dedicate the rest to handling/archiving the raw binary files. These files are small (a few MB each), as they get rotated every five minutes. Hope this makes sense, and provides a little background info on what we're trying to do. Norman - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Raid over 48 disks
On Wed, 19 Dec 2007 07:28:20 +1100, Neil Brown [EMAIL PROTECTED] said: [ ... what to do with 48 drive Sun Thumpers ... ] neilb I wouldn't create a raid5 or raid6 on all 48 devices. neilb RAID5 only survives a single device failure and with that neilb many devices, the chance of a second failure before you neilb recover becomes appreciable. That's just one of the many problems, other are: * If a drive fails, rebuild traffic is going to hit hard, with reading in parallel 47 blocks to compute a new 48th. * With a parity strip length of 48 it will be that much harder to avoid read-modify before write, as it will be avoidable only for writes of at least 48 blocks aligned on 48 block boundaries. And reading 47 blocks to write one is going to be quite painful. [ ... ] neilb RAID10 would be a good option if you are happy wit 24 neilb drives worth of space. [ ... ] That sounds like the only feasible option (except for the 3 drive case in most cases). Parity RAID does not scale much beyond 3-4 drives. neilb Alternately, 8 6drive RAID5s or 6 8raid RAID6s, and use neilb RAID0 to combine them together. This would give you neilb adequate reliability and performance and still a large neilb amount of storage space. That sounds optimistic to me: the reason to do a RAID50 of 8x(5+1) can only be to have a single filesystem, else one could have 8 distinct filesystems each with a subtree of the whole. With a single filesystem the failure of any one of the 8 RAID5 components of the RAID0 will cause the loss of the whole lot. So in the 47+1 case a loss of any two drives would lead to complete loss; in the 8x(5+1) case only a loss of two drives in the same RAID5 will. It does not sound like a great improvement to me (especially considering the thoroughly inane practice of building arrays out of disks of the same make and model taken out of the same box). There are also modest improvements in the RMW strip size and in the cost of a rebuild after a single drive loss. Probably the reduction in the RMW strip size is the best improvement. Anyhow, let's assume 0.5TB drives; with a 47+1 we get a single 23.5TB filesystem, and with 8*(5+1) we get a 20TB filesystem. With current filesystem technology either size is worrying, for example as to time needed for an 'fsck'. In practice RAID5 beyond 3-4 drives seems only useful for almost read-only filesystems where restoring from backups is quick and easy, never mind the 47+1 case or the 8x(5+1) one, and I think that giving some credit even to the latter arrangement is not quite right... - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Raid over 48 disks
Peter Grandi wrote: On Wed, 19 Dec 2007 07:28:20 +1100, Neil Brown [EMAIL PROTECTED] said: [ ... what to do with 48 drive Sun Thumpers ... ] neilb I wouldn't create a raid5 or raid6 on all 48 devices. neilb RAID5 only survives a single device failure and with that neilb many devices, the chance of a second failure before you neilb recover becomes appreciable. That's just one of the many problems, other are: * If a drive fails, rebuild traffic is going to hit hard, with reading in parallel 47 blocks to compute a new 48th. * With a parity strip length of 48 it will be that much harder to avoid read-modify before write, as it will be avoidable only for writes of at least 48 blocks aligned on 48 block boundaries. And reading 47 blocks to write one is going to be quite painful. [ ... ] neilb RAID10 would be a good option if you are happy wit 24 neilb drives worth of space. [ ... ] That sounds like the only feasible option (except for the 3 drive case in most cases). Parity RAID does not scale much beyond 3-4 drives. neilb Alternately, 8 6drive RAID5s or 6 8raid RAID6s, and use neilb RAID0 to combine them together. This would give you neilb adequate reliability and performance and still a large neilb amount of storage space. That sounds optimistic to me: the reason to do a RAID50 of 8x(5+1) can only be to have a single filesystem, else one could have 8 distinct filesystems each with a subtree of the whole. With a single filesystem the failure of any one of the 8 RAID5 components of the RAID0 will cause the loss of the whole lot. So in the 47+1 case a loss of any two drives would lead to complete loss; in the 8x(5+1) case only a loss of two drives in the same RAID5 will. It does not sound like a great improvement to me (especially considering the thoroughly inane practice of building arrays out of disks of the same make and model taken out of the same box). Quality control just isn't that good that same box make a big difference, assuming that you have an appropriate number of hot spares online. Note that I said big difference, is there some clustering of failures? Some, but damn little. A few years ago I was working with multiple 6TB machines and 20+ 1TB machines, all using small, fast, drives in RAID5E. I can't remember a case where a drive failed before rebuild was complete, and only one or two where there was a failure to degraded mode before the hot spare was replaced. That said, RAID5E typically can rebuild a lot faster than a typical hot spare as a unit drive, at least for any given impact on performance. This undoubtedly reduce our exposure time. There are also modest improvements in the RMW strip size and in the cost of a rebuild after a single drive loss. Probably the reduction in the RMW strip size is the best improvement. Anyhow, let's assume 0.5TB drives; with a 47+1 we get a single 23.5TB filesystem, and with 8*(5+1) we get a 20TB filesystem. With current filesystem technology either size is worrying, for example as to time needed for an 'fsck'. Given that someone is putting a typical filesystem full of small files on a big raid, I agree. But fsck with large files is pretty fast on a given filesystem (200GB files on a 6TB ext3, for instance), due to the small number of inodes in play. While the bitmap resolution is a factor, it's pretty linear, fsck with lots of files gets really slow. And let's face it, the objective of raid is to avoid doing that fsck in the first place ;-) -- Bill Davidsen [EMAIL PROTECTED] Woe unto the statesman who makes war without a reason that will still be valid when the war is over... Otto von Bismark - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Raid over 48 disks
Norman Elton [EMAIL PROTECTED] writes: We're investigating the possibility of running Linux (RHEL) on top of Sun's X4500 Thumper box: http://www.sun.com/servers/x64/x4500/ I think BNL's evalation of Solaris/ZFS vs. Linux/MD on a thumper might be of interest: http://hepix.caspur.it/storage/hep_pdf/2007/Spring/Petkus_HEPiX_Spring06.storageeval.pdf -- Leif Nixon -Systems expert National Supercomputer Centre- Linkoping University - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Raid over 48 disks
Mattias Wadenstein [EMAIL PROTECTED] writes: There are those that have run Linux MD RAID on thumpers before. I vaguely recall some driver issues (unrelated to MD) that made it less suitable than solaris, but that might be fixed in recent kernels. I think that was mainly an issue for people trying to squeeze Scientific Linux 3 onto their thumpers. -- Leif Nixon -Systems expert National Supercomputer Centre- Linkoping University - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Raid over 48 disks
Bill Davidsen wrote: 16k read64k write chunk sizeRAID 5RAID 6RAID 5RAID 6 128k492497268270 256k615530288270 512k625607230174 1024k 65062017075 What is your stripe cache size? I didn't fiddle with the default when I did these tests. Now (with 256k chunk size) I had # cat stripe_cache_size 256 but increasing that to 1024 didn't show a noticeable improvement for reading. Still around 550MB/s. Kind regards, Thiemo - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Raid over 48 disks
Thiemo Nagel wrote: Bill Davidsen wrote: 16k read64k write chunk sizeRAID 5RAID 6RAID 5RAID 6 128k492497268270 256k615530288270 512k625607230174 1024k 65062017075 What is your stripe cache size? I didn't fiddle with the default when I did these tests. Now (with 256k chunk size) I had # cat stripe_cache_size 256 but increasing that to 1024 didn't show a noticeable improvement for reading. Still around 550MB/s. You can use blockdev to raise the readahead, either on the drives or the array. That may make a difference, I use 4-8MB on the drive, more on the array depending on how I use it. Kind regards, Thiemo -- Bill Davidsen [EMAIL PROTECTED] Woe unto the statesman who makes war without a reason that will still be valid when the war is over... Otto von Bismark - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Raid over 48 disks
On Wed, 19 Dec 2007, Neil Brown wrote: On Tuesday December 18, [EMAIL PROTECTED] wrote: We're investigating the possibility of running Linux (RHEL) on top of Sun's X4500 Thumper box: http://www.sun.com/servers/x64/x4500/ Basically, it's a server with 48 SATA hard drives. No hardware RAID. It's designed for Sun's ZFS filesystem. So... we're curious how Linux will handle such a beast. Has anyone run MD software RAID over so many disks? Then piled LVM/ext3 on top of that? Any suggestions? There are those that have run Linux MD RAID on thumpers before. I vaguely recall some driver issues (unrelated to MD) that made it less suitable than solaris, but that might be fixed in recent kernels. Alternately, 8 6drive RAID5s or 6 8raid RAID6s, and use RAID0 to combine them together. This would give you adequate reliability and performance and still a large amount of storage space. My personal suggestion would be 5 9-disk raid6s, one raid1 root mirror and one hot spare. Then raid0, lvm, or separate filesystem on those 5 raidsets for data, depending on your needs. You get almost as much data space as with the 6 8-disk raid6s, and have a separate pair of disks for all the small updates (logging, metadata, etc), so this makes alot of sense if most of the data is bulk file access. /Mattias Wadenstein - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Raid over 48 disks
Guy Watkins wrote: } -Original Message- } From: [EMAIL PROTECTED] [mailto:linux-raid- } [EMAIL PROTECTED] On Behalf Of Brendan Conoboy } Sent: Tuesday, December 18, 2007 3:36 PM } To: Norman Elton } Cc: linux-raid@vger.kernel.org } Subject: Re: Raid over 48 disks } } Norman Elton wrote: } We're investigating the possibility of running Linux (RHEL) on top of } Sun's X4500 Thumper box: } } http://www.sun.com/servers/x64/x4500/ } } Neat- 6 8 port SATA controllers! It'll be worth checking to be sure } each controller has equal bandwidth. If some controllers are on slower } buses than others you may want to consider that and balance the md } device layout. Assuming the 6 controllers are equal, I would make 3 16 disk RAID6 arrays using 2 disks from each controller. That way any 1 controller can fail and your system will still be running. 6 disks will be used for redundancy. Or 6 8 disk RAID6 arrays using 1 disk from each controller). That way any 2 controllers can fail and your system will still be running. 12 disks will be used for redundancy. Might be too excessive! Combine them into a RAID0 array. Guy Sounds interesting! Just out of interest, whats stopping you from using Solaris? Though, I'm curious how md will compare to ZFS performance wise. There is some interesting configuration info / advice for Solaris here: http://www.solarisinternals.com/wiki/index.php/ZFS_Configuration_Guide esp for the X4500. Russell - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Raid over 48 disks
Thiemo Nagel wrote: Performance of the raw device is fair: # dd if=/dev/md2 of=/dev/zero bs=128k count=64k 8589934592 bytes (8.6 GB) copied, 15.6071 seconds, 550 MB/s Somewhat less through ext3 (created with -E stride=64): # dd if=largetestfile of=/dev/zero bs=128k count=64k 8589934592 bytes (8.6 GB) copied, 26.4103 seconds, 325 MB/s Quite slow? 10 disks (raptors) raid 5 on regular sata controllers: # dd if=/dev/md3 of=/dev/zero bs=128k count=64k 8589934592 bytes (8.6 GB) copied, 10.718 seconds, 801 MB/s # dd if=bigfile of=/dev/zero bs=128k count=64k 3640379392 bytes (3.6 GB) copied, 6.58454 seconds, 553 MB/s Interesting. Any ideas what could be the reason? How much do you get from a single drive? -- The Samsung HD501LJ that I'm using gives ~84MB/s when reading from the beginning of the disk. With RAID 5 I'm getting slightly better results (though I really wonder why, since naively I would expect identical read performance) but that does only account for a small part of the difference: 16k read64k write chunk sizeRAID 5RAID 6RAID 5RAID 6 128k492497268270 256k615530288270 512k625607230174 1024k 65062017075 What is your stripe cache size? -- Bill Davidsen [EMAIL PROTECTED] Woe unto the statesman who makes war without a reason that will still be valid when the war is over... Otto von Bismark - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Raid over 48 disks
On Wed, 19 Dec 2007, Bill Davidsen wrote: Thiemo Nagel wrote: Performance of the raw device is fair: # dd if=/dev/md2 of=/dev/zero bs=128k count=64k 8589934592 bytes (8.6 GB) copied, 15.6071 seconds, 550 MB/s Somewhat less through ext3 (created with -E stride=64): # dd if=largetestfile of=/dev/zero bs=128k count=64k 8589934592 bytes (8.6 GB) copied, 26.4103 seconds, 325 MB/s Quite slow? 10 disks (raptors) raid 5 on regular sata controllers: # dd if=/dev/md3 of=/dev/zero bs=128k count=64k 8589934592 bytes (8.6 GB) copied, 10.718 seconds, 801 MB/s # dd if=bigfile of=/dev/zero bs=128k count=64k 3640379392 bytes (3.6 GB) copied, 6.58454 seconds, 553 MB/s Interesting. Any ideas what could be the reason? How much do you get from a single drive? -- The Samsung HD501LJ that I'm using gives ~84MB/s when reading from the beginning of the disk. With RAID 5 I'm getting slightly better results (though I really wonder why, since naively I would expect identical read performance) but that does only account for a small part of the difference: 16k read64k write chunk sizeRAID 5RAID 6RAID 5RAID 6 128k492497268270 256k615530288270 512k625607230174 1024k 65062017075 What is your stripe cache size? # Set stripe-cache_size for RAID5. echo Setting stripe_cache_size to 16 MiB for /dev/md3 echo 16384 /sys/block/md3/md/stripe_cache_size Justin. - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Raid over 48 disks
Mattias Wadenstein wrote: On Wed, 19 Dec 2007, Neil Brown wrote: On Tuesday December 18, [EMAIL PROTECTED] wrote: We're investigating the possibility of running Linux (RHEL) on top of Sun's X4500 Thumper box: http://www.sun.com/servers/x64/x4500/ Basically, it's a server with 48 SATA hard drives. No hardware RAID. It's designed for Sun's ZFS filesystem. So... we're curious how Linux will handle such a beast. Has anyone run MD software RAID over so many disks? Then piled LVM/ext3 on top of that? Any suggestions? There are those that have run Linux MD RAID on thumpers before. I vaguely recall some driver issues (unrelated to MD) that made it less suitable than solaris, but that might be fixed in recent kernels. Alternately, 8 6drive RAID5s or 6 8raid RAID6s, and use RAID0 to combine them together. This would give you adequate reliability and performance and still a large amount of storage space. My personal suggestion would be 5 9-disk raid6s, one raid1 root mirror and one hot spare. Then raid0, lvm, or separate filesystem on those 5 raidsets for data, depending on your needs. Other than thinking raid-10 better than raid-1for performance, I like it. You get almost as much data space as with the 6 8-disk raid6s, and have a separate pair of disks for all the small updates (logging, metadata, etc), so this makes alot of sense if most of the data is bulk file access. -- Bill Davidsen [EMAIL PROTECTED] Woe unto the statesman who makes war without a reason that will still be valid when the war is over... Otto von Bismark - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Raid over 48 disks
We're investigating the possibility of running Linux (RHEL) on top of Sun's X4500 Thumper box: http://www.sun.com/servers/x64/x4500/ Basically, it's a server with 48 SATA hard drives. No hardware RAID. It's designed for Sun's ZFS filesystem. So... we're curious how Linux will handle such a beast. Has anyone run MD software RAID over so many disks? Then piled LVM/ext3 on top of that? Any suggestions? Are we crazy to think this is even possible? Thanks! Norman Elton - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Raid over 48 disks
On Tue, 18 Dec 2007, Norman Elton wrote: We're investigating the possibility of running Linux (RHEL) on top of Sun's X4500 Thumper box: http://www.sun.com/servers/x64/x4500/ Basically, it's a server with 48 SATA hard drives. No hardware RAID. It's designed for Sun's ZFS filesystem. So... we're curious how Linux will handle such a beast. Has anyone run MD software RAID over so many disks? Then piled LVM/ext3 on top of that? Any suggestions? Are we crazy to think this is even possible? Thanks! Norman Elton - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html It sounds VERY fun and exciting if you ask me! The most disks I've used when testing SW RAID was 10 with various raid settings. With that many drives you'd want RAID6 or RAID10 for sure incase more than one failed at the same time and definitely XFS/JFS/EXT4(?) as EXT3 is capped to 8TB. I'd be curious what kind of aggregate bandwidth you can get off of it with that many drives. Justin. - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Raid over 48 disks
On Tue Dec 18, 2007 at 12:29:27PM -0500, Norman Elton wrote: We're investigating the possibility of running Linux (RHEL) on top of Sun's X4500 Thumper box: http://www.sun.com/servers/x64/x4500/ Basically, it's a server with 48 SATA hard drives. No hardware RAID. It's designed for Sun's ZFS filesystem. So... we're curious how Linux will handle such a beast. Has anyone run MD software RAID over so many disks? Then piled LVM/ext3 on top of that? Any suggestions? Are we crazy to think this is even possible? The most I've done is 28 drives in RAID-10 (SCSI drives, with the array formatted as XFS). That keeps failing one drive, but I've not had time to give the drive a full test yet to confirm it's a drive issue. It's been running quite happily (under pretty heavy database load) on 27 disks for a couple of months now though. Cheers, Robin -- ___ ( ' } | Robin Hill[EMAIL PROTECTED] | / / ) | Little Jim says | // !! | He fallen in de water !! | pgpFz4s5k2eD3.pgp Description: PGP signature
Re: Raid over 48 disks
Dear Norman, So... we're curious how Linux will handle such a beast. Has anyone run MD software RAID over so many disks? Then piled LVM/ext3 on top of that? Any suggestions? Are we crazy to think this is even possible? I'm running 22x 500GB disks attached to RocketRaid2340 and NFORCE-MCP55 onboard controllers on an Athlon DC 5000+ with 1GB RAM: 9746150400 blocks super 1.2 level 6, 256k chunk, algorithm 2 [22/22] Performance of the raw device is fair: # dd if=/dev/md2 of=/dev/zero bs=128k count=64k 65536+0 records in 65536+0 records out 8589934592 bytes (8.6 GB) copied, 15.6071 seconds, 550 MB/s Somewhat less through ext3 (created with -E stride=64): # dd if=largetestfile of=/dev/zero bs=128k count=64k 65536+0 records in 65536+0 records out 8589934592 bytes (8.6 GB) copied, 26.4103 seconds, 325 MB/s There were no problems up to now. (mkfs.ext3 wants -F to create a filesystem larger than 8TB. The hard maximum is 16TB, so you will need to create partitions, if your drives are larger than 350GB...) Kind regards, Thiemo Nagel - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Raid over 48 disks
Thiemo -- I'm not familiar with RocketRaid. Is it handling the RAID for you, or are you using MD? Thanks, all, for your feedback! I'm still surprised nobody has tried this on one of these Sun boxes yet. I've signed up for some demo hardware. I'll post what I find. Norman On Dec 18, 2007, at 2:34 PM, Thiemo Nagel wrote: Dear Norman, So... we're curious how Linux will handle such a beast. Has anyone run MD software RAID over so many disks? Then piled LVM/ext3 on top of that? Any suggestions? Are we crazy to think this is even possible? I'm running 22x 500GB disks attached to RocketRaid2340 and NFORCE- MCP55 onboard controllers on an Athlon DC 5000+ with 1GB RAM: 9746150400 blocks super 1.2 level 6, 256k chunk, algorithm 2 [22/22] Performance of the raw device is fair: # dd if=/dev/md2 of=/dev/zero bs=128k count=64k 65536+0 records in 65536+0 records out 8589934592 bytes (8.6 GB) copied, 15.6071 seconds, 550 MB/s Somewhat less through ext3 (created with -E stride=64): # dd if=largetestfile of=/dev/zero bs=128k count=64k 65536+0 records in 65536+0 records out 8589934592 bytes (8.6 GB) copied, 26.4103 seconds, 325 MB/s There were no problems up to now. (mkfs.ext3 wants -F to create a filesystem larger than 8TB. The hard maximum is 16TB, so you will need to create partitions, if your drives are larger than 350GB...) Kind regards, Thiemo Nagel - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Raid over 48 disks
Dear Norman, I'm not familiar with RocketRaid. Is it handling the RAID for you, or are you using MD? I'm using md. The controller is in a mode that exports all drives individually. Kind regards, Thiemo - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Raid over 48 disks
On Tue, 18 Dec 2007, Thiemo Nagel wrote: Dear Norman, So... we're curious how Linux will handle such a beast. Has anyone run MD software RAID over so many disks? Then piled LVM/ext3 on top of that? Any suggestions? Are we crazy to think this is even possible? I'm running 22x 500GB disks attached to RocketRaid2340 and NFORCE-MCP55 onboard controllers on an Athlon DC 5000+ with 1GB RAM: 9746150400 blocks super 1.2 level 6, 256k chunk, algorithm 2 [22/22] Performance of the raw device is fair: # dd if=/dev/md2 of=/dev/zero bs=128k count=64k 65536+0 records in 65536+0 records out 8589934592 bytes (8.6 GB) copied, 15.6071 seconds, 550 MB/s Somewhat less through ext3 (created with -E stride=64): # dd if=largetestfile of=/dev/zero bs=128k count=64k 65536+0 records in 65536+0 records out 8589934592 bytes (8.6 GB) copied, 26.4103 seconds, 325 MB/s There were no problems up to now. (mkfs.ext3 wants -F to create a filesystem larger than 8TB. The hard maximum is 16TB, so you will need to create partitions, if your drives are larger than 350GB...) Kind regards, Thiemo Nagel - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Quite slow? 10 disks (raptors) raid 5 on regular sata controllers: # dd if=/dev/md3 of=/dev/zero bs=128k count=64k 65536+0 records in 65536+0 records out 8589934592 bytes (8.6 GB) copied, 10.718 seconds, 801 MB/s # dd if=bigfile of=/dev/zero bs=128k count=64k 27773+1 records in 27773+1 records out 3640379392 bytes (3.6 GB) copied, 6.58454 seconds, 553 MB/s - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Raid over 48 disks
Norman Elton wrote: We're investigating the possibility of running Linux (RHEL) on top of Sun's X4500 Thumper box: http://www.sun.com/servers/x64/x4500/ Neat- 6 8 port SATA controllers! It'll be worth checking to be sure each controller has equal bandwidth. If some controllers are on slower buses than others you may want to consider that and balance the md device layout. So... we're curious how Linux will handle such a beast. Has anyone run MD software RAID over so many disks? Then piled LVM/ext3 on top of that? Any suggestions? There used to be a maximum number of devices allowed in a single md device. Not sure if that is still the case. With this many drives you would be well advised to make smaller raid devices then combine them into a larger md device (or via lvm, etc). Consider a write with a 48 device raid5- the system may need to read blocks from all those drives before a single write! If it were my system, all ports were equally well connected, I'd create 3 16 drive RAID5's with 1 hot spare, then combine them via raid 0 or lvm. That's just my usage scenario, though (modest reliability, excellent read speed, modest write speed). If you put ext3 on time, remember to use the stride option when making the filesystem. Are we crazy to think this is even possible? Crazy, possible, and fun! -- Brendan Conoboy / Red Hat, Inc. / [EMAIL PROTECTED] - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Raid over 48 disks
On Tuesday December 18, [EMAIL PROTECTED] wrote: We're investigating the possibility of running Linux (RHEL) on top of Sun's X4500 Thumper box: http://www.sun.com/servers/x64/x4500/ Basically, it's a server with 48 SATA hard drives. No hardware RAID. It's designed for Sun's ZFS filesystem. So... we're curious how Linux will handle such a beast. Has anyone run MD software RAID over so many disks? Then piled LVM/ext3 on top of that? Any suggestions? Are we crazy to think this is even possible? Certainly possible. The default metadata is limited to 28 devices, but with --metadata=1 you can easily use all 48 drives or more in the one array. I'm not sure if you would want to though. If you just wanted an enormous scratch space and were happy to lose all your data on a drive failure, then you could make a raid0 across all the drives which should work perfectly and give you lots of space. But that probably isn't what you want. I wouldn't create a raid5 or raid6 on all 48 devices. RAID5 only survives a single device failure and with that many devices, the chance of a second failure before you recover becomes appreciable. RAID6 would be much more reliable, but probably much slower. RAID6 always needs to read or write every block in a stripe (i.e. it always uses reconstruct-write to generate the P and Q blocks, It never does a read-modify-write like raid5 does). This means that every write touches every device so you have less possibility for parallelism among your many drives. It might be instructive to try it out though. RAID10 would be a good option if you are happy wit 24 drives worth of space. I would probably choose a largish chunk size (256K) and use the 'offset' layout. Alternately, 8 6drive RAID5s or 6 8raid RAID6s, and use RAID0 to combine them together. This would give you adequate reliability and performance and still a large amount of storage space. Have fun!!! NeilBrown - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Raid over 48 disks
Performance of the raw device is fair: # dd if=/dev/md2 of=/dev/zero bs=128k count=64k 8589934592 bytes (8.6 GB) copied, 15.6071 seconds, 550 MB/s Somewhat less through ext3 (created with -E stride=64): # dd if=largetestfile of=/dev/zero bs=128k count=64k 8589934592 bytes (8.6 GB) copied, 26.4103 seconds, 325 MB/s Quite slow? 10 disks (raptors) raid 5 on regular sata controllers: # dd if=/dev/md3 of=/dev/zero bs=128k count=64k 8589934592 bytes (8.6 GB) copied, 10.718 seconds, 801 MB/s # dd if=bigfile of=/dev/zero bs=128k count=64k 3640379392 bytes (3.6 GB) copied, 6.58454 seconds, 553 MB/s Interesting. Any ideas what could be the reason? How much do you get from a single drive? -- The Samsung HD501LJ that I'm using gives ~84MB/s when reading from the beginning of the disk. With RAID 5 I'm getting slightly better results (though I really wonder why, since naively I would expect identical read performance) but that does only account for a small part of the difference: 16k read64k write chunk sizeRAID 5 RAID 6 RAID 5 RAID 6 128k492 497 268 270 256k615 530 288 270 512k625 607 230 174 1024k 650 620 170 75 Kind regards, Thiemo - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Raid over 48 disks
On 12/18/07, Thiemo Nagel [EMAIL PROTECTED] wrote: Performance of the raw device is fair: # dd if=/dev/md2 of=/dev/zero bs=128k count=64k 8589934592 bytes (8.6 GB) copied, 15.6071 seconds, 550 MB/s Somewhat less through ext3 (created with -E stride=64): # dd if=largetestfile of=/dev/zero bs=128k count=64k 8589934592 bytes (8.6 GB) copied, 26.4103 seconds, 325 MB/s Quite slow? 10 disks (raptors) raid 5 on regular sata controllers: # dd if=/dev/md3 of=/dev/zero bs=128k count=64k 8589934592 bytes (8.6 GB) copied, 10.718 seconds, 801 MB/s # dd if=bigfile of=/dev/zero bs=128k count=64k 3640379392 bytes (3.6 GB) copied, 6.58454 seconds, 553 MB/s Interesting. Any ideas what could be the reason? How much do you get from a single drive? -- The Samsung HD501LJ that I'm using gives ~84MB/s when reading from the beginning of the disk. With RAID 5 I'm getting slightly better results (though I really wonder why, since naively I would expect identical read performance) but that does only account for a small part of the difference: 16k read64k write chunk sizeRAID 5 RAID 6 RAID 5 RAID 6 128k492 497 268 270 256k615 530 288 270 512k625 607 230 174 1024k 650 620 170 75 It strikes me that these numbers are meaningless without knowing if that is actual data-to-disk or data-to-memcache-and-some-to-disk-too. Later versions of 'dd' offer 'conv=fdatasync' which is really handy (call fdatasync on the output file, syncing JUST the one file, right before close). Otherwise, oflags=direct will (try) to bypass the page/block cache. I can get really impressive numbers, too (over 200MB/s on a single disk capable of 70MB/s) when I (mis)use dd without fdatasync, et al. The variation in reported performance can be really huge without understanding that you aren't actually testing the DISK I/O but *some* disk I/O and *some* memory caching. -- Jon - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Raid over 48 disks
On Tue, 18 Dec 2007, Thiemo Nagel wrote: Performance of the raw device is fair: # dd if=/dev/md2 of=/dev/zero bs=128k count=64k 8589934592 bytes (8.6 GB) copied, 15.6071 seconds, 550 MB/s Somewhat less through ext3 (created with -E stride=64): # dd if=largetestfile of=/dev/zero bs=128k count=64k 8589934592 bytes (8.6 GB) copied, 26.4103 seconds, 325 MB/s Quite slow? 10 disks (raptors) raid 5 on regular sata controllers: # dd if=/dev/md3 of=/dev/zero bs=128k count=64k 8589934592 bytes (8.6 GB) copied, 10.718 seconds, 801 MB/s # dd if=bigfile of=/dev/zero bs=128k count=64k 3640379392 bytes (3.6 GB) copied, 6.58454 seconds, 553 MB/s Interesting. Any ideas what could be the reason? How much do you get from a single drive? -- The Samsung HD501LJ that I'm using gives ~84MB/s when reading from the beginning of the disk. With RAID 5 I'm getting slightly better results (though I really wonder why, since naively I would expect identical read performance) but that does only account for a small part of the difference: 16k read64k write chunk sizeRAID 5 RAID 6 RAID 5 RAID 6 128k492 497 268 270 256k615 530 288 270 512k625 607 230 174 1024k 650 620 170 75 Kind regards, Thiemo # dd if=/dev/sdc of=/dev/null bs=1M count=1024 1024+0 records in 1024+0 records out 1073741824 bytes (1.1 GB) copied, 13.8108 seconds, 77.7 MB/s With more than 2x the drives I'd think you'd have faster speed, perhaps the contoller is the problem? I am using ICH8R (but the raid within linux) and 2 port SATA cards, each has their own dedicated bandwidth via PCI-e bus. I have also tried (on 3ware controllers exporting as JBOD etc, sw RAID5) with 10 disks, I saw similar performance with read but not write. Justin. - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Raid over 48 disks
16k read64k write chunk sizeRAID 5 RAID 6 RAID 5 RAID 6 128k492 497 268 270 256k615 530 288 270 512k625 607 230 174 1024k 650 620 170 75 It strikes me that these numbers are meaningless without knowing if that is actual data-to-disk or data-to-memcache-and-some-to-disk-too. Later versions of 'dd' offer 'conv=fdatasync' which is really handy (call fdatasync on the output file, syncing JUST the one file, right before close). Otherwise, oflags=direct will (try) to bypass the page/block cache. I can get really impressive numbers, too (over 200MB/s on a single disk capable of 70MB/s) when I (mis)use dd without fdatasync, et al. The variation in reported performance can be really huge without understanding that you aren't actually testing the DISK I/O but *some* disk I/O and *some* memory caching. I did these benchmarks with 32GB of data on a machine with 1GB of RAM, therefore the memory cache contribution should be small. Kind regards, Thiemo - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Raid over 48 disks
On Tue, 18 Dec 2007, Jon Nelson wrote: On 12/18/07, Thiemo Nagel [EMAIL PROTECTED] wrote: Performance of the raw device is fair: # dd if=/dev/md2 of=/dev/zero bs=128k count=64k 8589934592 bytes (8.6 GB) copied, 15.6071 seconds, 550 MB/s Somewhat less through ext3 (created with -E stride=64): # dd if=largetestfile of=/dev/zero bs=128k count=64k 8589934592 bytes (8.6 GB) copied, 26.4103 seconds, 325 MB/s Quite slow? 10 disks (raptors) raid 5 on regular sata controllers: # dd if=/dev/md3 of=/dev/zero bs=128k count=64k 8589934592 bytes (8.6 GB) copied, 10.718 seconds, 801 MB/s # dd if=bigfile of=/dev/zero bs=128k count=64k 3640379392 bytes (3.6 GB) copied, 6.58454 seconds, 553 MB/s Interesting. Any ideas what could be the reason? How much do you get from a single drive? -- The Samsung HD501LJ that I'm using gives ~84MB/s when reading from the beginning of the disk. With RAID 5 I'm getting slightly better results (though I really wonder why, since naively I would expect identical read performance) but that does only account for a small part of the difference: 16k read64k write chunk sizeRAID 5 RAID 6 RAID 5 RAID 6 128k492 497 268 270 256k615 530 288 270 512k625 607 230 174 1024k 650 620 170 75 It strikes me that these numbers are meaningless without knowing if that is actual data-to-disk or data-to-memcache-and-some-to-disk-too. Later versions of 'dd' offer 'conv=fdatasync' which is really handy (call fdatasync on the output file, syncing JUST the one file, right before close). Otherwise, oflags=direct will (try) to bypass the page/block cache. I can get really impressive numbers, too (over 200MB/s on a single disk capable of 70MB/s) when I (mis)use dd without fdatasync, et al. The variation in reported performance can be really huge without understanding that you aren't actually testing the DISK I/O but *some* disk I/O and *some* memory caching. Ok-- How's this for caching, a DD over the entire RAID device: $ /usr/bin/time dd if=/dev/zero of=file bs=1M dd: writing `file': No space left on device 1070704+0 records in 1070703+0 records out 1122713473024 bytes (1.1 TB) copied, 2565.89 seconds, 438 MB/s - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: Raid over 48 disks
} -Original Message- } From: [EMAIL PROTECTED] [mailto:linux-raid- } [EMAIL PROTECTED] On Behalf Of Brendan Conoboy } Sent: Tuesday, December 18, 2007 3:36 PM } To: Norman Elton } Cc: linux-raid@vger.kernel.org } Subject: Re: Raid over 48 disks } } Norman Elton wrote: } We're investigating the possibility of running Linux (RHEL) on top of } Sun's X4500 Thumper box: } } http://www.sun.com/servers/x64/x4500/ } } Neat- 6 8 port SATA controllers! It'll be worth checking to be sure } each controller has equal bandwidth. If some controllers are on slower } buses than others you may want to consider that and balance the md } device layout. Assuming the 6 controllers are equal, I would make 3 16 disk RAID6 arrays using 2 disks from each controller. That way any 1 controller can fail and your system will still be running. 6 disks will be used for redundancy. Or 6 8 disk RAID6 arrays using 1 disk from each controller). That way any 2 controllers can fail and your system will still be running. 12 disks will be used for redundancy. Might be too excessive! Combine them into a RAID0 array. Guy - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: Raid over 48 disks
On Tue, 18 Dec 2007, Guy Watkins wrote: } -Original Message- } From: [EMAIL PROTECTED] [mailto:linux-raid- } [EMAIL PROTECTED] On Behalf Of Brendan Conoboy } Sent: Tuesday, December 18, 2007 3:36 PM } To: Norman Elton } Cc: linux-raid@vger.kernel.org } Subject: Re: Raid over 48 disks } } Norman Elton wrote: } We're investigating the possibility of running Linux (RHEL) on top of } Sun's X4500 Thumper box: } } http://www.sun.com/servers/x64/x4500/ } } Neat- 6 8 port SATA controllers! It'll be worth checking to be sure } each controller has equal bandwidth. If some controllers are on slower } buses than others you may want to consider that and balance the md } device layout. Assuming the 6 controllers are equal, I would make 3 16 disk RAID6 arrays using 2 disks from each controller. That way any 1 controller can fail and your system will still be running. 6 disks will be used for redundancy. Or 6 8 disk RAID6 arrays using 1 disk from each controller). That way any 2 controllers can fail and your system will still be running. 12 disks will be used for redundancy. Might be too excessive! Combine them into a RAID0 array. Guy - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html I'd be curious what the maximum aggregate bandwidth would be with RAID 0 of 48 disks on that controller.. - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: Raid over 48 disks
On Tue, 18 Dec 2007, Justin Piszcz wrote: On Tue, 18 Dec 2007, Guy Watkins wrote: } -Original Message- } From: [EMAIL PROTECTED] [mailto:linux-raid- } [EMAIL PROTECTED] On Behalf Of Brendan Conoboy } Sent: Tuesday, December 18, 2007 3:36 PM } To: Norman Elton } Cc: linux-raid@vger.kernel.org } Subject: Re: Raid over 48 disks } } Norman Elton wrote: } We're investigating the possibility of running Linux (RHEL) on top of } Sun's X4500 Thumper box: } } http://www.sun.com/servers/x64/x4500/ } } Neat- 6 8 port SATA controllers! It'll be worth checking to be sure } each controller has equal bandwidth. If some controllers are on slower } buses than others you may want to consider that and balance the md } device layout. Assuming the 6 controllers are equal, I would make 3 16 disk RAID6 arrays using 2 disks from each controller. That way any 1 controller can fail and your system will still be running. 6 disks will be used for redundancy. Or 6 8 disk RAID6 arrays using 1 disk from each controller). That way any 2 controllers can fail and your system will still be running. 12 disks will be used for redundancy. Might be too excessive! Combine them into a RAID0 array. Guy - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html I'd be curious what the maximum aggregate bandwidth would be with RAID 0 of 48 disks on that controller.. A RAID 0 over all of the controllers rather, if possible.. - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html