Re: [PATCH] Use new sb type
David Greaves wrote: Jan Engelhardt wrote: Feel free to argue that the manpage is clear on this - but as we know, not everyone reads the manpages in depth... That is indeed suboptimal (but I would not care since I know the implications of an SB at the front) Neil cares even less and probably doesn't even need mdadm - heck he probably just echos the raw superblock into place via dd... http://xkcd.com/378/ I don't know why this makes me think of APL... -- Bill Davidsen <[EMAIL PROTECTED]> "Woe unto the statesman who makes war without a reason that will still be valid when the war is over..." Otto von Bismark - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Use new sb type
Jan Engelhardt wrote: On Feb 10 2008 12:27, David Greaves wrote: I do not see anything wrong by specifying the SB location as a metadata version. Why should not location be an element of the raid type? It's fine the way it is IMHO. (Just the default is not :) There was quite a discussion about it. For me the main argument is that for most people seeing superblock versions (even the manpage terminology is version and subversion) will correlate incremental versions with improvement. They will therefore see v1.2 as 'the latest and best'. Feel free to argue that the manpage is clear on this - but as we know, not everyone reads the manpages in depth... That is indeed suboptimal (but I would not care since I know the implications of an SB at the front); Naming it "[EMAIL PROTECTED]" / "[EMAIL PROTECTED]" / "[EMAIL PROTECTED]" or so would address this. We have already discussed names and Neil has expressed satisfaction with my earlier suggestion. Since "@" is sort of a semi-special character to the shell, I suspect we are better off avoiding it. -- Bill Davidsen <[EMAIL PROTECTED]> "Woe unto the statesman who makes war without a reason that will still be valid when the war is over..." Otto von Bismark - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Any inexpensive hardware recommendations for PCI interface cards?
Steve Fairbairn wrote: Can anyone see any issues with what I'm trying to do? No. Are there any known issues with IT8212 cards (They worked as straight disks on linux fine)? No idea, don't have that card. Is anyone using an array with disks on PCI interface cards? Works. I've mixed PATA, SATA, onboard, PCI, and firewire (lack of controllers is the mother of invention). As long as the device under the raid works, the raid should work. Is there an issue with mixing motherboard interfaces and PCI card based ones? Not that I've found. Does anyone recommend any inexpensive (probably SATA-II) PCI interface cards? Not I. Large drives have have cured me of FrankenRAID setups recently, other than to build little arrays out of USB devices for backup. -- Bill Davidsen <[EMAIL PROTECTED]> "Woe unto the statesman who makes war without a reason that will still be valid when the war is over..." Otto von Bismark - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Deleting mdadm RAID arrays
Marcin Krol wrote: Thursday 07 February 2008 22:35:45 Bill Davidsen napisał(a): As you may remember, I have configured udev to associate /dev/d_* devices with serial numbers (to keep them from changing depending on boot module loading sequence). Why do you care? Because /dev/sd* devices get swapped randomly depending on boot module insertion sequence, as I explained earlier. So there's no functional problem, just cosmetic? If you are using UUID for all the arrays and mounts does this buy you anything? This is exactly what is not clear for me: what is it that identifies drive/partition as part of the array? /dev/sd name? UUID as part of superblock? /dev/d_n? If it's UUID I should be safe regardless of /dev/sd* designation? Yes or no? Yes, absolutely. And more to the point, the first time a drive fails and you replace it, will it cause you a problem? Require maintaining the serial to name data manually? That's not the problem. I just want my array to be intact. I miss the benefit of forcing this instead of just building the information at boot time and dropping it in a file. I would prefer that, too - if it worked. I was getting both arrays messed up randomly on boot. "messed up" in the sense of arrays being composed of different /dev/sd devices. Different devices? Or just different names for the same devices? I assume just the names change, and I still don't see why you care... subtle beyond my understanding. And I made *damn* sure I zeroed all the superblocks before reassembling the arrays. Yet it still shows the old partitions on those arrays! As I noted before, you said you had these on whole devices before, did you zero the superblocks on the whole devices or the partitions? From what I read, it was the partitions. I tried it both ways actually (rebuilt arrays a few times, just udev didn't want to associate WD-serialnumber-part1 as /dev/d_1p1 as it was told, it still claimed it was /dev/d_1). I'm not talking about building the array, but zeroing the superblocks. Did you use the partition name, /dev/sdb1, when you ran mdadm with "zero-super" or did you zero the whole device, /dev/sdb, which is what you were using when you first built the array with whole devices. If you didn't zero the superblock for the whole device it may explain why a superblock is still found. -- Bill Davidsen <[EMAIL PROTECTED]> "Woe unto the statesman who makes war without a reason that will still be valid when the war is over..." Otto von Bismark - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: using update-initramfs: how to get new mdadm.conf into the /boot? Or is it XFS?
Bill Davidsen wrote: Moshe Yudkowsky wrote: maximilian attems wrote: error 15 is an *grub* error. grub is known for it's dislike of xfs, so with this whole setup use ext3 rerun grub-install and you should be fine. I should mention that something *did* change. When attempting to use XFS, grub would give me a note about "18 partitions used" (I forget the exact language). This was different than I'd remembered; when I switched back to using reiserfs, grub reports using 19 partitions. So there's something definitely interesting about XFS and booting. As an additional note, if I use the grub boot-time commands to edit root to read, e.g., root=/dev/sda2 or root=/dev/sdb2, I get the same Error 15 error message. It may be that grub is complaining about grub and resiserfs, but I suspect that it has a true complain about the file system and what's on the partitions. I think you have two choices, convert /boot to ext2 and be sure you are going down the best-tested code path, or fight and debug, read code, learn grub source, play with the init parts of the boot sequence, and then convert /boot to ext2 anyway. No matter how "better" something else might be, /boot has nothing I use except at boot, I don't need features or performance, I just want it to work. Unless you are so frustrated you have entered "I am going to make this *work* if it takes forever" mode, I would try the easy solution first. Just my take on it. Or you can get lucky and someone will have seen this before and hand you a solution... ;-) -- Bill Davidsen <[EMAIL PROTECTED]> "Woe unto the statesman who makes war without a reason that will still be valid when the war is over..." Otto von Bismark - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Deleting mdadm RAID arrays
Marcin Krol wrote: Thursday 07 February 2008 03:36:31 Neil Brown napisał(a): 8 0 390711384 sda 8 1 390708801 sda1 816 390711384 sdb 817 390708801 sdb1 832 390711384 sdc 833 390708801 sdc1 848 390710327 sdd 849 390708801 sdd1 864 390711384 sde 865 390708801 sde1 880 390711384 sdf 881 390708801 sdf1 364 78150744 hdb 3651951866 hdb1 3667815622 hdb2 3674883760 hdb3 368 1 hdb4 369 979933 hdb5 370 979933 hdb6 371 61536951 hdb7 9 1 781417472 md1 9 0 781417472 md0 So all the expected partitions are known to the kernel - good. It 's not good really!! I can't trust /dev/sd* devices - they get swapped randomly depending on sequence of module loading!! I have two drivers, ahci for onboard SATA controllers and sata_sil for additional controller. Sometimes the system boots ahci first and sata_sil later, sometimes in reverse sequence. Then, sda becomes sdc, sdb becomes sdd, etc. It is exactly the problem that I cannot rely on kernel's information which physical drive is which logical drive! Then mdadm /dev/md0 -f /dev/d_1 will fail d_1, abort the recovery, and release d_1. Then mdadm --zero-superblock /dev/d_1 should work. Thanks, though I managed to fail the drives, remove them, zero superblocks and reassemble the arrays anyway. The problem I have now is that mdadm seems to be of 'two minds' when it comes to where it gets the info on which disk is what part of the array. As you may remember, I have configured udev to associate /dev/d_* devices with serial numbers (to keep them from changing depending on boot module loading sequence). Why do you care? If you are using UUID for all the arrays and mounts does this buy you anything? And more to the point, the first time a drive fails and you replace it, will it cause you a problem? Require maintaining the serial to name data manually? I miss the benefit of forcing this instead of just building the information at boot time and dropping it in a file. Now, when I swap two (random) drives in order to test if it keeps device names associated with serial numbers I get the following effect: 1. mdadm -Q --detail /dev/md* gives correct results before *and* after the swapping: % mdadm -Q --detail /dev/md0 /dev/md0: [...] Number Major Minor RaidDevice State 0 810 active sync /dev/d_1 1 8 171 active sync /dev/d_2 2 8 812 active sync /dev/d_3 % mdadm -Q --detail /dev/md1 /dev/md1: [...] Number Major Minor RaidDevice State 0 8 490 active sync /dev/d_4 1 8 651 active sync /dev/d_5 2 8 332 active sync /dev/d_6 2. However, cat /proc/mdstat gives shows different layout of the arrays! BEFORE the swap: % cat mdstat-16_51 Personalities : [raid6] [raid5] [raid4] md1 : active raid5 sdb1[2] sdf1[0] sda1[1] 781417472 blocks level 5, 64k chunk, algorithm 2 [3/3] [UUU] md0 : active raid5 sde1[2] sdc1[0] sdd1[1] 781417472 blocks level 5, 64k chunk, algorithm 2 [3/3] [UUU] unused devices: AFTER the swap: % cat /proc/mdstat Personalities : [raid6] [raid5] [raid4] md1 : active(auto-read-only) raid5 sdd1[0] sdc1[2] sde1[1] 781417472 blocks level 5, 64k chunk, algorithm 2 [3/3] [UUU] md0 : active(auto-read-only) raid5 sda1[0] sdf1[2] sdb1[1] 781417472 blocks level 5, 64k chunk, algorithm 2 [3/3] [UUU] unused devices: I have no idea now if the array is functioning (it keeps the drives according to /dev/d_* devices and superblock info is unimportant) or if my arrays fell apart because of that swapping. And I made *damn* sure I zeroed all the superblocks before reassembling the arrays. Yet it still shows the old partitions on those arrays! As I noted before, you said you had these on whole devices before, did you zero the superblocks on the whole devices or the partitions? From what I read, it was the partitions. -- Bill Davidsen <[EMAIL PROTECTED]> "Woe unto the statesman who makes war without a reason that will still be valid when the war is over..." Otto von Bismark - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: using update-initramfs: how to get new mdadm.conf into the /boot? Or is it XFS?
Moshe Yudkowsky wrote: maximilian attems wrote: error 15 is an *grub* error. grub is known for it's dislike of xfs, so with this whole setup use ext3 rerun grub-install and you should be fine. I should mention that something *did* change. When attempting to use XFS, grub would give me a note about "18 partitions used" (I forget the exact language). This was different than I'd remembered; when I switched back to using reiserfs, grub reports using 19 partitions. So there's something definitely interesting about XFS and booting. As an additional note, if I use the grub boot-time commands to edit root to read, e.g., root=/dev/sda2 or root=/dev/sdb2, I get the same Error 15 error message. It may be that grub is complaining about grub and resiserfs, but I suspect that it has a true complain about the file system and what's on the partitions. I think you have two choices, convert /boot to ext2 and be sure you are going down the best-tested code path, or fight and debug, read code, learn grub source, play with the init parts of the boot sequence, and then convert /boot to ext2 anyway. No matter how "better" something else might be, /boot has nothing I use except at boot, I don't need features or performance, I just want it to work. Unless you are so frustrated you have entered "I am going to make this *work* if it takes forever" mode, I would try the easy solution first. Just my take on it. -- Bill Davidsen <[EMAIL PROTECTED]> "Woe unto the statesman who makes war without a reason that will still be valid when the war is over..." Otto von Bismark - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: mdadm 2.6.4 : How i can check out current status of reshaping ?
Andreas-Sokov wrote: Hello, Neil. . Possible you have bad memory, or a bad CPU, or you are overclocking the CPU, or it is getting hot, or something. As seems to me all my problems has been started after i have started update MDADM. This is server worked normaly (but only not like soft-raid) more 2-3 years. Last 6 months it worked as soft-raid. All was normaly, Even I have added successfully 4th hdd into raid5 )when it stared was 3 hdd). And then Reshaping have been passed fine. Yesterday i have did memtest86 onto it server and 10 passes was WITH OUT any errors. Temperature of server is about 25 grad celsius. No overlocking, all set to default. What did you find when you loaded the module with gdb as Neil suggested? If the code in the module doesn't match the code in memory you have a hardware error. memtest86 is a useful tool, but it is not a definitive test because it doesn't use all CPUs and do i/o at the same time to load the memory bus. Realy i do not know what to do because off wee nedd grow our storage, and we can not. unfortunately, At this moment - Mdadm do not help us in this decision, but very want it get. I would pull out half my memory and retest. If it still fails I would swap to the other half of memory. If that didn't show a change I would check that the code in the module is what Neil showed in his last message (I assume you already have), and then reseat all of the cables, etc. I agree with Neil: But you clearly have a hardware error. NeilBrown -- Bill Davidsen <[EMAIL PROTECTED]> "Woe unto the statesman who makes war without a reason that will still be valid when the war is over..." Otto von Bismark - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: recommendations for stripe/chunk size
Wolfgang Denk wrote: In message <[EMAIL PROTECTED]> you wrote: I actually think the kernel should operate with block sizes like this and not wth 4 kiB blocks. It is the readahead and the elevator algorithms that save us from randomly reading 4 kb a time. Exactly, and nothing save a R-A-RW cycle if the write is a partial chunk. Indeed kernel page size is an important factor in such optimizations. But you have to keep in mind that this is mostly efficient for (very) large strictly sequential I/O operations only - actual file system traffic may be *very* different. That was actually what I meant by page size, that of the file system rather than the memory, ie. the "block size" typically used for writes. Or multiples thereof, obviously. We implemented the option to select kernel page sizes of 4, 16, 64 and 256 kB for some PowerPC systems (440SPe, to be precise). A nice graphics of the effect can be found here: https://www.amcc.com/MyAMCC/retrieveDocument/PowerPC/440SPe/RAIDinLinux_PB_0529a.pdf I started that online and pulled a download to print, very neat stuff. Thanks for the link. Best regards, Wolfgang Denk -- Bill Davidsen <[EMAIL PROTECTED]> "Woe unto the statesman who makes war without a reason that will still be valid when the war is over..." Otto von Bismark - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: raid10 on three discs - few questions.
Jon Nelson wrote: On Feb 6, 2008 12:43 PM, Bill Davidsen <[EMAIL PROTECTED] <mailto:[EMAIL PROTECTED]>> wrote: Can you create a raid10 with one drive "missing" and add it later? I know, I should try it when I get a machine free... but I'm being lazy today. Yes you can. With 3 drives, however, performance will be awful (at least with layout far, 2 copies). Well, the question didn't include being fast. ;-) But if he really wants to create the array now and be able to add to it later, it might still be useful, particularly if "later" is a small time like "when my other drive ships." Thanks for the input, I thought that was possible, but reading code isn't the same as testing. IMO raid10,f2 is a great balance of speed and redundancy. it''s faster than raid5 for reading, about the same for writing. it's even potentially faster than raid0 for reading, actually. With 3 disks one should be able to get 3.0 times the speed of one disk, or slightly more, and each stripe involves only *one* disk instead of 2 as it does with raid5. I have used raid10 swap on 3 or more drives fairly often. Other than the Fedora rescue CD not using the space until I start it manually, I find it really fast, and helpful for huge image work. -- Bill Davidsen <[EMAIL PROTECTED]> "Woe unto the statesman who makes war without a reason that will still be valid when the war is over..." Otto von Bismark - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: recommendations for stripe/chunk size
Keld Jørn Simonsen wrote: Hi I am looking at revising our howto. I see a number of places where a chunk size of 32 kiB is recommended, and even recommendations on maybe using sizes of 4 kiB. Depending on the raid level, a write smaller than the chunk size causes the chunk to be read, altered, and rewritten, vs. just written if the write is a multiple of chunk size. Many filesystems by default use a 4k page size and writes. I believe this is the reasoning behind the suggestion of small chunk sizes. Sequential vs. random and raid level are important here, there's no one size to work best in all cases. My own take on that is that this really hurts performance. Normal disks have a rotation speed of between 5400 (laptop) 7200 (ide/sata) and 1 (SCSI) rounds per minute, giving an average spinning time for one round of 6 to 12 ms, and average latency of half this, that is 3 to 6 ms. Then you need to add head movement which is something like 2 to 20 ms - in total average seek time 5 to 26 ms, averaging around 13-17 ms. Having a write not some multiple of chunk size would seem to require a read-alter- wait_for_disk_rotation-write, and for large sustained sequential i/o using multiple drives helps transfer. for small random i/o small chunks are good, I find little benefit to chunks over 256 or maybe 1024k. in about 15 ms you can read on current SATA-II (300 MB/s) or ATA/133 something like between 600 to 1200 kB, actual transfer rates of 80 MB/s on SATA-II and 40 MB/s on ATA/133. So to get some bang for the buck, and transfer some data you should have something like 256/512 kiB chunks. With a transfer rate of 50 MB/s and chunk sizes of 256 kiB giving about a time of 20 ms per transaction you should be able with random reads to transfer 12 MB/s - my actual figures is about 30 MB/s which is possibly because of the elevator effect of the file system driver. With a size of 4 kb per chunk you should have a time of 15 ms per transaction, or 66 transactions per second, or a transfer rate of 250 kb/s. So 256 kb vs 4 kb speeds up the transfer by a factor of 50. If you actually see anything like this your write caching and readahead aren't doing what they should! I actually think the kernel should operate with block sizes like this and not wth 4 kiB blocks. It is the readahead and the elevator algorithms that save us from randomly reading 4 kb a time. Exactly, and nothing save a R-A-RW cycle if the write is a partial chunk. I also see that there are some memory constrints on this. Having maybe 1000 processes reading, as for my mirror service, 256 kib buffers would be acceptable, occupying 256 MB RAM. That is reasonable, and I could even tolerate 512 MB ram used. But going to 1 MiB buffers would be overdoing it for my configuration. What would be the recommended chunk size for todays equipment? I think usage is more important than hardware. My opinion only. Best regards Keld -- Bill Davidsen <[EMAIL PROTECTED]> "Woe unto the statesman who makes war without a reason that will still be valid when the war is over..." Otto von Bismark - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Deleting mdadm RAID arrays
Neil Brown wrote: On Tuesday February 5, [EMAIL PROTECTED] wrote: % mdadm --zero-superblock /dev/sdb1 mdadm: Couldn't open /dev/sdb1 for write - not zeroing That's weird. Why can't it open it? I suspect that (a) he's not root and has read-only access to the device (I have group read for certain groups, too). And since he had the arrays on raw devices, shouldn't he zero the superblocks using the whole device as well? Depending on what type of superblock it might not be found otherwise. It sure can't hurt to zero all the superblocks of the whole devices and then check the partitions to see if they are present, then create the array again with --force and be really sure the superblock is present and sane. Maybe you aren't running as root (The '%' prompt is suspicious). Maybe the kernel has been told to forget about the partitions of /dev/sdb. mdadm will sometimes tell it to do that, but only if you try to assemble arrays out of whole components. If that is the problem, then blockdev --rereadpt /dev/sdb will fix it. NeilBrown -- Bill Davidsen <[EMAIL PROTECTED]> "Woe unto the statesman who makes war without a reason that will still be valid when the war is over..." Otto von Bismark - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: raid1 or raid10 for /boot
Keld Jørn Simonsen wrote: I understand that lilo and grub only can boot partitions that look like a normal single-drive partition. And then I understand that a plain raid10 has a layout which is equivalent to raid1. Can such a raid10 partition be used with grub or lilo for booting? And would there be any advantages in this, for example better disk utilization in the raid10 driver compared with raid? I don't know about you, but my /boot goes with zero use between boots, efficiency and performance improvements strike as a distinction without a difference, while adding complexity without benefit is always a bad idea. I suggest that you avoid having a "learning experience" and stick with raid1. -- Bill Davidsen <[EMAIL PROTECTED]> "Woe unto the statesman who makes war without a reason that will still be valid when the war is over..." Otto von Bismark - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: raid10 on three discs - few questions.
Neil Brown wrote: On Sunday February 3, [EMAIL PROTECTED] wrote: Hi, Maybe I'll buy three HDDs to put a raid10 on them. And get the total capacity of 1.5 of a disc. 'man 4 md' indicates that this is possible and should work. I'm wondering - how a single disc failure is handled in such configuration? 1. does the array continue to work in a degraded state? Yes. 2. after the failure I can disconnect faulty drive, connect a new one, start the computer, add disc to array and it will sync automatically? Yes. Question seems a bit obvious, but the configuration is, at least for me, a bit unusual. This is why I'm asking. Anybody here tested such configuration, has some experience? 3. Another thing - would raid10,far=2 work when three drives are used? Would it increase the read performance? Yes. 4. Would it be possible to later '--grow' the array to use 4 discs in raid10 ? Even with far=2 ? No. Well if by "later" you mean "in five years", then maybe. But the code doesn't currently exist. That's a reason to avoid raid10 for certain applications, then, and go with a more manual 1+0 or similar. Can you create a raid10 with one drive "missing" and add it later? I know, I should try it when I get a machine free... but I'm being lazy today. -- Bill Davidsen <[EMAIL PROTECTED]> "Woe unto the statesman who makes war without a reason that will still be valid when the war is over..." Otto von Bismark - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: raid1 and raid 10 always writes all data to all disks?
Keld Jørn Simonsen wrote: On Sun, Feb 03, 2008 at 10:56:01AM -0500, Bill Davidsen wrote: Keld Jørn Simonsen wrote: I found a sentence in the HOWTO: "raid1 and raid 10 always writes all data to all disks" I think this is wrong for raid10. eg a raid10,f2 of 4 disks only writes to two of the disks - not all 4 disks. Is that true? I suspect that really should have read "all mirror copies," in the raid10 case. OK, I changed the text to: raid1 always writes all data to all disks. Just to be really pedantic, you might say "devices" instead of disks, since many or most arrays are on partitions. Otherwise I like this, it's much clearer. raid10 always writes all data to the number of copies that the raid holds. For example on a raid10,f2 or raid10,o2 of 6 disks, the data will only be written 2 times. Best regards Keld -- Bill Davidsen <[EMAIL PROTECTED]> "Woe unto the statesman who makes war without a reason that will still be valid when the war is over..." Otto von Bismark - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: draft howto on making raids for surviving a disk crash
Keld Jørn Simonsen wrote: On Sun, Feb 03, 2008 at 10:53:51AM -0500, Bill Davidsen wrote: Keld Jørn Simonsen wrote: This is intended for the linux raid howto. Please give comments. It is not fully ready /keld Howto prepare for a failing disk 6. /etc/mdadm.conf Something here on /etc/mdadm.conf. What would be safe, allowing a system to boot even if a disk has crashed? Recommend PARTITIONS by used Thanks Bill for your suggestions, which I have incorporated in the text. However, I do not understand what to do with the remark above. Please explain. The mdadm.conf file should contain the "DEVICE partitions" statement to identify all possible partitions regardless of name changes. See "man mdadm.conf" for more discussion. This protects against udev doing something innovative in device naming. -- Bill Davidsen <[EMAIL PROTECTED]> "Woe unto the statesman who makes war without a reason that will still be valid when the war is over..." Otto von Bismark - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: raid1 and raid 10 always writes all data to all disks?
Keld Jørn Simonsen wrote: I found a sentence in the HOWTO: "raid1 and raid 10 always writes all data to all disks" I think this is wrong for raid10. eg a raid10,f2 of 4 disks only writes to two of the disks - not all 4 disks. Is that true? I suspect that really should have read "all mirror copies," in the raid10 case. -- Bill Davidsen <[EMAIL PROTECTED]> "Woe unto the statesman who makes war without a reason that will still be valid when the war is over..." Otto von Bismark - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: draft howto on making raids for surviving a disk crash
s for the system, or vital jobs on the system. You can prevent the failing of the processes by having the swap partitions on a raid. The swap area needed is normally relatively small compared to the overall disk space available, so we recommend the faster raid types over the more space economic. The raid10,f2 type seems to be the fastest here, other relevant raid types could be raid10,o2 or raid1. Given that you have created a raid array, you can just make the swap partition directly on it: mdadm --create /dev/md2 --chunk=256 -R -l 10 -n 2 -p f2 /dev/sda3 /dev/sdb3 sfdisk -c /dev/md 2 82 mkswap /dev/md2 WARNING: some "recovery" CDs will not use raid10 as swap. This may be a problem on small memory systems, and the swap may need to be started and enabled manually. Maybe something on /var and /tmp could go here. 5. The rest of the file systems. Other file systems can also be protected against one failing disk. Which technique to recommend depends on your purpose with the disk space. You may mix the different raid types if you have different types of use on the same server, eg a data base and servicing of large files from the same server. (This is one of the advantages of software raid over hardware raid: you may have different types of raids on a disk with a software raid, where a hardware raid only may take one type for the whole disk.) Is disk capacity the main priority, and you have more than 2 drives, then raid5 is recommended. Raid5 only uses 1 drive for securing the data, while raid1 and raid10 use at least half the capacity. For example with 4 drives, raid5 provides 75 % of the total disk space as usable, while raid1 and raid10 at most (dependent on the number of copies) give a 50 % usability of the disk space. This becomes even better for raid5 with more disks, with 10 disks you only use 10 % for security. Is speed your main priority, then raid10,f2 raid10,o2 or raid1 would give you most speed during normal operation. This even works if you only have 2 drives. Is speed with a failed disk a concern, then raid10,o2 could be the choice, as raid10,f2 is somewhat slower in operation, when a disk has failed. Examples: mdadm --create /dev/md3 --chunk=256 -R -l 10 -n 2 -p f2 /dev/sda5 /dev/sdb5 mdadm --create /dev/md3 --chunk=256 -R -l 10 -n 2 -p o2 /dev/sd[ab]5 mdadm --create /dev/md3 --chunk=256 -R -l 5 -n 4 /dev/sd[abcd]5 6. /etc/mdadm.conf Something here on /etc/mdadm.conf. What would be safe, allowing a system to boot even if a disk has crashed? Recommend PARTITIONS by used 7. Recommendation for the setup of larger servers. Given a larger server setup, with more disks, it is possible to survive more than one disk crash. The raid6 array type can be used to be able to survive 2 disk crashes, at the expense of the space of 2 disks. The /boot, root and swap partitions can be set up with more disks, eg a /boot partition made up from a raid1 of 3 disks, and root and swap partitons made up from raid10,f3 arrays. Given that raid6 cannot survive more than the chashes TYPO: s/chashes/crashes/and "failure" would be better of 2 disks, the system disks need not be prepared for more than 2 craches TYPO: s/craches/crashes/ or "disk failures" either, and you can use the rest of the disk IO capacity to speed up the system. -- Bill Davidsen <[EMAIL PROTECTED]> "Woe unto the statesman who makes war without a reason that will still be valid when the war is over..." Otto von Bismark - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RAID 1 and grub
Richard Scobie wrote: A followup for the archives: I found this document very useful: http://lists.us.dell.com/pipermail/linux-poweredge/2003-July/008898.html After modifying my grub.conf to refer to (hd0,0), reinstalling grub on hdc with: grub> device (hd0) /dev/hdc grub> root (hd0,0) grub> (hd0) and rebooting with the bios set to boot off hdc, everything burst back into life. I shall now be checking all my Fedora/Centos RAID1 installs for grub installed on both drives. Have you actually tested this by removing the first hd and booting? Depending on the BIOS I believe that the fallback drive will be called hdc by the BIOS but will be hdd in the system. That was with RHEL3, but worth testing. -- Bill Davidsen <[EMAIL PROTECTED]> "Woe unto the statesman who makes war without a reason that will still be valid when the war is over..." Otto von Bismark - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: raid problem: after every reboot /dev/sdb1 is removed?
Berni wrote: Hi I created the raid arrays during install with the text-installer-cd. So first the raid array was created and then the system was installed on it. I don't have a extra /boot partition its on the root (/) partition and the root is the md0 in the raid. Every partition for ubuntu (also swap) is in the raid. What exactly means rerunning grub? (to put both hdd into the mbr)? I can't find the "mkinitrd" into ubuntu. I made a update-initramfs but it didn't help. I think you need some ubuntu guru to help, I always create a small raid1 for /boot and then use other arrays for whatever the system is doing. I don't know if ubuntu uses mkinitrd or what, but it clearly didn't get it right without a little help from you. thanks How about some input, ubuntu users (or Debian, isn't ubuntu really Debian?). On Sat, 02 Feb 2008 14:47:50 -0500 Bill Davidsen <[EMAIL PROTECTED]> wrote: Berni wrote: Hi! I have the following problem with my softraid (raid 1). I'm running Ubuntu 7.10 64bit with kernel 2.6.22-14-generic. After every reboot my first boot partition in md0 is not synchron. One of the disks (the sdb1) is removed. After a resynch every partition is synching. But after a reboot the state is "removed". The disks are new and both seagate 250gb with exactly the same partition table. Did you create the raid arrays and then install on them? Or add them after the fact? I have seen this type of problem when the initrd doesn't start the array before pivotroot, usually because the raid capabilities aren't in the boot image. In that case rerunning grub and mkinitrd may help. I run raid on Redhat distributions, and some Slackware, so I can't speak for Ubuntu from great experience, but that's what it sounds like. When you boot, is the /boot mounted on a degraded array or on the raw partition? -- Bill Davidsen <[EMAIL PROTECTED]> "Woe unto the statesman who makes war without a reason that will still be valid when the war is over..." Otto von Bismark - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: In this partition scheme, grub does not find md information?
Bill Davidsen wrote: Moshe Yudkowsky wrote: Michael Tokarev wrote: To return to that peformance question, since I have to create at least 2 md drives using different partitions, I wonder if it's smarter to create multiple md drives for better performance. /dev/sd[abcd]1 -- RAID1, the /boot, /dev, /bin/, /sbin /dev/sd[abcd]2 -- RAID5, most of the rest of the file system /dev/sd[abcd]3 -- RAID10 o2, a drive that does a lot of downloading (writes) I think the speed of downloads is so far below the capacity of an array that you won't notice, and hopefully you will use things you download more than once, so you still get more reads than writes. For typical filesystem usage, raid5 works good for both reads and (cached, delayed) writes. It's workloads like databases where raid5 performs badly. Ah, very interesting. Is this true even for (dare I say it?) bittorrent downloads? What do you have for bandwidth? Probably not more than a T3 (145Mbit) which will max out at ~15MB/s, far below the write performance of a single drive, much less an array (even raid5). It has been pointed out that I have a double typo there, I meant OC3 not T3, and 155Mbit. Still, the most someone is likely to have, even in a large company. Still not a large chance of being faster than the disk in raid-10 mode. -- Bill Davidsen <[EMAIL PROTECTED]> "Woe unto the statesman who makes war without a reason that will still be valid when the war is over..." Otto von Bismark - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: raid problem: after every reboot /dev/sdb1 is removed?
Berni wrote: Hi! I have the following problem with my softraid (raid 1). I'm running Ubuntu 7.10 64bit with kernel 2.6.22-14-generic. After every reboot my first boot partition in md0 is not synchron. One of the disks (the sdb1) is removed. After a resynch every partition is synching. But after a reboot the state is "removed". The disks are new and both seagate 250gb with exactly the same partition table. Did you create the raid arrays and then install on them? Or add them after the fact? I have seen this type of problem when the initrd doesn't start the array before pivotroot, usually because the raid capabilities aren't in the boot image. In that case rerunning grub and mkinitrd may help. I run raid on Redhat distributions, and some Slackware, so I can't speak for Ubuntu from great experience, but that's what it sounds like. When you boot, is the /boot mounted on a degraded array or on the raw partition? Here some config files: #cat /proc/mdstat Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10] md2 : active raid1 sda6[0] sdb6[1] 117185984 blocks [2/2] [UU] md1 : active raid1 sda5[0] sdb5[1] 1951744 blocks [2/2] [UU] md0 : active raid1 sda1[0] 19534912 blocks [2/1] [U_]<<<<<<<< this is the problem: looks like U_ after reboot unused devices: #fdisk /dev/sda Device Boot Start End Blocks Id System /dev/sda1 1243219535008+ fd Linux raid autodetect /dev/sda22433 17264 1191380405 Extended /dev/sda3 * 17265 2045125599577+ 7 HPFS/NTFS /dev/sda4 20452 3040079915342+ 7 HPFS/NTFS /dev/sda524332675 1951866 fd Linux raid autodetect /dev/sda62676 17264 117186111 fd Linux raid autodetect #fdisk /dev/sdb Device Boot Start End Blocks Id System /dev/sdb1 1243219535008+ fd Linux raid autodetect /dev/sdb22433 17264 1191380405 Extended /dev/sdb3 17265 30400 1055149207 HPFS/NTFS /dev/sdb524332675 1951866 fd Linux raid autodetect /dev/sdb62676 17264 117186111 fd Linux raid autodetect # mount /dev/md0 on / type reiserfs (rw,notail) proc on /proc type proc (rw,noexec,nosuid,nodev) /sys on /sys type sysfs (rw,noexec,nosuid,nodev) varrun on /var/run type tmpfs (rw,noexec,nosuid,nodev,mode=0755) varlock on /var/lock type tmpfs (rw,noexec,nosuid,nodev,mode=1777) udev on /dev type tmpfs (rw,mode=0755) devshm on /dev/shm type tmpfs (rw) devpts on /dev/pts type devpts (rw,gid=5,mode=620) lrm on /lib/modules/2.6.22-14-generic/volatile type tmpfs (rw) /dev/md2 on /home type reiserfs (rw) securityfs on /sys/kernel/security type securityfs (rw) Could anyone help me to solve this problem? thanks greets Berni - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html -- Bill Davidsen <[EMAIL PROTECTED]> "Woe unto the statesman who makes war without a reason that will still be valid when the war is over..." Otto von Bismark - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: which raid level gives maximum overall speed? (raid-10,f2 vs. raid-0)
Janek Kozicki wrote: Hello, Yes, I know that some levels give faster reading and slower writing, etc. I want to talk here about a typical workstation usage: compiling stuff (like kernel), editing openoffice docs, browsing web, reading email (email: I have a webdir format, and in boost mailing list directory I have 14000 files (posts), opening this directory takes circa 10 seconds in sylpheed). Moreover, opening .pdf files, more compiling of C++ stuff, etc... In other words, like most systems, more reads than writes. And while write can be (and usually are) cached and buffered, when you need the next bit of data the program waits for it, far more user visible. If this suggests tuning for acceptable write and max read speed, and setting the readahead higher than default, then you have reached the same conclusion as I did. I have a remote backup system configured (with rsnapshot), which does backups two times a day. So I'm not afraid to lose all my data due to disc failure. I want absolute speed. Currently I have Raid-0, because I was thinking that this one is fastest. But I also don't need twice the capacity. I could use Raid-1 as well, if it was faster. Due to recent discussion about Raid-10,f2 I'm getting worried that Raid-0 is not the fastest solution, but instead a Raid-10,f2 is faster. So how really is it, which level gives maximum overall speed? I would like to make a benchmark, but currently, technically, I'm not able to. I'll be able to do it next month, and then - as a result of this discussion - I will switch to other level and post here benchmark results. How does overall performance change with the number of available drives? Perhaps Raid-0 is best for 2 drives, while Raid-10 is best for 3, 4 and more drives? best regards -- Bill Davidsen <[EMAIL PROTECTED]> "Woe unto the statesman who makes war without a reason that will still be valid when the war is over..." Otto von Bismark - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: In this partition scheme, grub does not find md information?
Moshe Yudkowsky wrote: Michael Tokarev wrote: You only write to root (including /bin and /lib and so on) during software (re)install and during some configuration work (writing /etc/password and the like). First is very infrequent, and both needs only a few writes, -- so write speed isn't important. Thanks, but I didn't make myself clear. The preformance problem I'm concerned about was having different md drives accessing different partitions. For example, I can partition the drives as follows: /dev/sd[abcd]1 -- RAID1, /boot /dev/sd[abcd]2 -- RAID5, the rest of the file system I originally had asked, way back when, if having different md drives on different partitions of the *same* disk was a problem for perfomance -- or if, for some reason (e.g., threading) it was actually smarter to do it that way. The answer I received was from Iustin Pop, who said : Iustin Pop wrote: md code works better if it's only one array per physical drive, because it keeps statistics per array (like last accessed sector, etc.) and if you combine two arrays on the same drive these statistics are not exactly true anymore So if I use /boot on its own drive and it's only accessed at startup, the /boot will only be accessed that one time and afterwards won't cause problems for the drive statistics. However, if I use put /boot, /bin, and /sbin on this RAID1 drive, it will always be accessed and it might create a performance issue. I always put /boot on a separate partition, just to run raid1 which I don't use elsewhere. To return to that peformance question, since I have to create at least 2 md drives using different partitions, I wonder if it's smarter to create multiple md drives for better performance. /dev/sd[abcd]1 -- RAID1, the /boot, /dev, /bin/, /sbin /dev/sd[abcd]2 -- RAID5, most of the rest of the file system /dev/sd[abcd]3 -- RAID10 o2, a drive that does a lot of downloading (writes) I think the speed of downloads is so far below the capacity of an array that you won't notice, and hopefully you will use things you download more than once, so you still get more reads than writes. For typical filesystem usage, raid5 works good for both reads and (cached, delayed) writes. It's workloads like databases where raid5 performs badly. Ah, very interesting. Is this true even for (dare I say it?) bittorrent downloads? What do you have for bandwidth? Probably not more than a T3 (145Mbit) which will max out at ~15MB/s, far below the write performance of a single drive, much less an array (even raid5). What you do care about is your data integrity. It's not really interesting to reinstall a system or lose your data in case if something goes wrong, and it's best to have recovery tools as easily available as possible. Plus, amount of space you need. Sure, I understand. And backing up in case someone steals your server. But did you have something specific in mind when you wrote this? Don't all these configurations (RAID5 vs. RAID10) have equal recovery tools? Or were you referring to the file system? Reiserfs and XFS both seem to have decent recovery tools. LVM is a little tempting because it allows for snapshots, but on the other hand I wonder if I'd find it useful. If you are worried about performance, perhaps some reading of comments on LVM would be in order. I personally view it as a trade-off of performance for flexibility. Also, placing /dev on a tmpfs helps alot to minimize number of writes necessary for root fs. Another interesting idea. I'm not familiar with using tmpfs (no need, until now); but I wonder how you create the devices you need when you're doing a rescue. When you start udev, your /dev will be on tmpfs. Sure, that's what mount shows me right now -- using a standard Debian install. What did you suggest I change? -- Bill Davidsen <[EMAIL PROTECTED]> "Woe unto the statesman who makes war without a reason that will still be valid when the war is over..." Otto von Bismark - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: In this partition scheme, grub does not find md information?
Moshe Yudkowsky wrote: Keld Jørn Simonsen wrote: On Tue, Jan 29, 2008 at 06:32:54PM -0600, Moshe Yudkowsky wrote: Hmm, why would you put swap on a raid10? I would in a production environment always put it on separate swap partitions, possibly a number, given that a number of drives are available. In a production server, however, I'd use swap on RAID in order to prevent server downtime if a disk fails -- a suddenly bad swap can easily (will absolutely?) cause the server to crash (even though you can boot the server up again afterwards on the surviving swap partitions). I see. Which file system type would be good for this? I normally use XFS but maybe other FS is better, given that swap is used very randomly 8read/write). Will a bad swap crash the system? Well, Peter says it will, and that's good enough for me. :-) I've done unplanned research into this, it will crash the system, and if you're unlucky some part of what's needed for a graceful crash will be swapped out :-( As for which file system: I would use fdisk to partition the md disk and then use mkswap on the partition to make it into a swap partition. It's a naive approach but I suspect it's almost certainly the correct one. I generally dedicate a partition of each drive to swap, but the type is "raid array." Then I create a raid10 on that set of partitions and mkswap on the md device. While raid10 is fast and reliable, raid[56] have similar reliability and a higher usable space from any given configuration. -- Bill Davidsen <[EMAIL PROTECTED]> "Woe unto the statesman who makes war without a reason that will still be valid when the war is over..." Otto von Bismark - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: In this partition scheme, grub does not find md information?
Peter Rabbitson wrote: Keld Jørn Simonsen wrote: On Tue, Jan 29, 2008 at 06:44:20PM -0500, Bill Davidsen wrote: Depending on near/far choices, raid10 should be faster than raid5, with far read should be quite a bit faster. You can't boot off raid10, and if you put your swap on it many recovery CDs won't use it. But for general use and swap on a normally booted system it is quite fast. Hmm, why would you put swap on a raid10? I would in a production environment always put it on separate swap partitions, possibly a number, given that a number of drives are available. Because you want some redundancy for the swap as well. A swap partition/file becoming inaccessible is equivalent to yanking out a stick of memory out of your motherboard. I can't say it better. Losing a swap area will make the system fail in one way or the other, in my systems typicalls expressed as a crash of varying severity. I use raid10 because it is the fastest reliable level I've found. -- Bill Davidsen <[EMAIL PROTECTED]> "Woe unto the statesman who makes war without a reason that will still be valid when the war is over..." Otto von Bismark - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: In this partition scheme, grub does not find md information?
Moshe Yudkowsky wrote: Bill Davidsen wrote: According to man md(4), the o2 is likely to offer the best combination of read and write performance. Why would you consider f2 instead? f2 is faster for read, most systems spend more time reading than writing. According to md(4), offset "should give similar read characteristics to 'far' if a suitably large chunk size is used, but without as much seeking for writes." Is the man page not correct, conditionally true, or simply not understood by me (most likely case)? I wonder what "suitably large" is... My personal experience is that as chunk gets larger random write gets slower, sequential gets faster. I don't have numbers any more, but 20-30% is sort of the limit of what I saw for any chunk size I consider reasonable. f2 is faster for sequential reading, tune your system to annoy you least. ;-) -- Bill Davidsen <[EMAIL PROTECTED]> "Woe unto the statesman who makes war without a reason that will still be valid when the war is over..." Otto von Bismark - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: In this partition scheme, grub does not find md information?
Moshe Yudkowsky wrote: Keld Jørn Simonsen wrote: Based on your reports of better performance on RAID10 -- which are more significant that I'd expected -- I'll just go with RAID10. The only question now is if LVM is worth the performance hit or not. I would be interested if you would experiment with this wrt boot time, for example the difference between /root on a raid5, raid10,f2 and raid10,o2. According to man md(4), the o2 is likely to offer the best combination of read and write performance. Why would you consider f2 instead? f2 is faster for read, most systems spend more time reading than writing. I'm unlike to do any testing beyond running bonnie++ or something similar once it's installed. -- Bill Davidsen <[EMAIL PROTECTED]> "Woe unto the statesman who makes war without a reason that will still be valid when the war is over..." Otto von Bismark - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: In this partition scheme, grub does not find md information?
Moshe Yudkowsky wrote: I'd like to thank everyone who wrote in with comments and explanations. And in particular it's nice to see that I'm not the only one who's confused. I'm going to convert back to the RAID 1 setup I had before for /boot, 2 hot and 2 spare across four drives. No, that's wrong: 4 hot makes the most sense. And given that RAID 10 doesn't seem to confer (for me, as far as I can tell) advantages in speed or reliability -- or the ability to mount just one surviving disk of a mirrored pair -- over RAID 5, I think I'll convert back to RAID 5, put in a hot spare, and do regular backups (as always). Oh, and use reiserfs with data=journal. Depending on near/far choices, raid10 should be faster than raid5, with far read should be quite a bit faster. You can't boot off raid10, and if you put your swap on it many recovery CDs won't use it. But for general use and swap on a normally booted system it is quite fast. Comments back: Peter Rabbitson wrote: Maybe you are, depending on your settings, but this is beyond the point. No matter what 1+0 you have (linux, classic, or otherwise) you can not boot from it, as there is no way to see the underlying filesystem without the RAID layer. Sir, thank you for this unequivocal comment. This comment clears up all my confusion. I had a wrong mental model of how file system maps work. With the current state of affairs (available mainstream bootloaders) the rule is: Block devices containing the kernel/initrd image _must_ be either: * a regular block device (/sda1, /hda, /fd0, etc.) * or a linux RAID 1 with the superblock at the end of the device (0.9 or 1.2) Thaks even more: 1.2 it is. This is how you find the actual raid version: mdadm -D /dev/md[X] | grep Version This will return a string of the form XX.YY.ZZ. Your superblock version is XX.YY. Ah hah! Mr. Tokarev wrote: By the way, on all our systems I use small (256Mb for small-software systems, sometimes 512M, but 1G should be sufficient) partition for a root filesystem (/etc, /bin, /sbin, /lib, and /boot), and put it on a raid1 on all... ... doing [it] this way, you always have all the tools necessary to repair a damaged system even in case your raid didn't start, or you forgot where your root disk is etc etc. An excellent idea. I was going to put just /boot on the RAID 1, but there's no reason why I can't add a bit more room and put them all there. (Because I was having so much fun on the install, I'm using 4GB that I was going to use for swap space to mount base install and I'm working from their to build the RAID. Same idea.) Hmmm... I wonder if this more expansive /bin, /sbin, and /lib causes hits on the RAID1 drive which ultimately degrade overall performance? /lib is hit only at boot time to load the kernel, I'll guess, but /bin includes such common tools as bash and grep. Also, placing /dev on a tmpfs helps alot to minimize number of writes necessary for root fs. Another interesting idea. I'm not familiar with using tmpfs (no need, until now); but I wonder how you create the devices you need when you're doing a rescue. Again, my thanks to everyone who responded and clarified. -- Bill Davidsen <[EMAIL PROTECTED]> "Woe unto the statesman who makes war without a reason that will still be valid when the war is over..." Otto von Bismark - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Use new sb type
David Greaves wrote: Jan Engelhardt wrote: This makes 1.0 the default sb type for new arrays. IIRC there was a discussion a while back on renaming mdadm options (google "Time to deprecate old RAID formats?") and the superblocks to emphasise the location and data structure. Would it be good to introduce the new names at the same time as changing the default format/on-disk-location? Yes, I suggested some layout names, as did a few other people, and a few changes to separate metadata type and position were discussed. BUT, changing the default layout, no matter how "better" it seems, is trumped by "breaks existing setups and user practice." For all of the reasons something else is preferable, 1.0 *works*. -- Bill Davidsen <[EMAIL PROTECTED]> "Woe unto the statesman who makes war without a reason that will still be valid when the war is over..." Otto von Bismark - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 2.6.24-rc6 reproducible raid5 hang
Carlos Carvalho wrote: Tim Southerwood ([EMAIL PROTECTED]) wrote on 28 January 2008 17:29: >Subtitle: Patch to mainline yet? > >Hi > >I don't see evidence of Neil's patch in 2.6.24, so I applied it by hand >on my server. I applied all 4 pending patches to .24. It's been better than .22 and .23... Unfortunately the bitmap and rai1 patch don't go in .22.16. Neil, have these been sent up against 24-stable and 23-stable? -- Bill Davidsen <[EMAIL PROTECTED]> "Woe unto the statesman who makes war without a reason that will still be valid when the war is over..." Otto von Bismark - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: striping of a 4 drive raid10
Keld Jørn Simonsen wrote: On Mon, Jan 28, 2008 at 07:13:30AM +1100, Neil Brown wrote: On Sunday January 27, [EMAIL PROTECTED] wrote: Hi I have tried to make a striping raid out of my new 4 x 1 TB SATA-2 disks. I tried raid10,f2 in several ways: 1: md0 = raid10,f2 of sda1+sdb1, md1= raid10,f2 of sdc1+sdd1, md2 = raid0 of md0+md1 2: md0 = raid0 of sda1+sdb1, md1= raid0 of sdc1+sdd1, md2 = raid01,f2 of md0+md1 3: md0 = raid10,f2 of sda1+sdb1, md1= raid10,f2 of sdc1+sdd1, chunksize of md0 =md1 =128 KB, md2 = raid0 of md0+md1 chunksize = 256 KB 4: md0 = raid0 of sda1+sdb1, md1= raid0 of sdc1+sdd1, chunksize of md0 = md1 = 128 KB, md2 = raid01,f2 of md0+md1 chunksize = 256 KB 5: md0= raid10,f4 of sda1+sdb1+sdc1+sdd1 Try 6: md0 = raid10,f2 of sda1+sdb1+sdc1+sdd1 That I already tried, (and I wrongly stated that I used f4 in stead of f2). I had two times a thruput of about 300 MB/s but since then I could not reproduce the behaviour. Are there errors on this that has been corrected in newer kernels? Also try raid10,o2 with a largeish chunksize (256KB is probably big enough). I tried that too, but my mdadm did not allow me to use the o flag. My kernel is 2.6.12 and mdadm is v1.12.0 - 14 June 2005. can I upgrade the mdadm alone to a newer version, and then which is recommendable? I doubt that updating the mdadm is going to help, the kernel is old and lacks a number of improvements in the last few years. I don't think you will see any major improvements without a kernel upgrade. -- Bill Davidsen <[EMAIL PROTECTED]> "Woe unto the statesman who makes war without a reason that will still be valid when the war is over..." Otto von Bismark - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: idle array consuming cpu ??!!
Carlos Carvalho wrote: Bill Davidsen ([EMAIL PROTECTED]) wrote on 22 January 2008 17:53: >Carlos Carvalho wrote: >> Neil Brown ([EMAIL PROTECTED]) wrote on 21 January 2008 12:15: >> >On Sunday January 20, [EMAIL PROTECTED] wrote: >> >> A raid6 array with a spare and bitmap is idle: not mounted and with no >> >> IO to it or any of its disks (obviously), as shown by iostat. However >> >> it's consuming cpu: since reboot it used about 11min in 24h, which is quite >> >> a lot even for a busy array (the cpus are fast). The array was cleanly >> >> shutdown so there's been no reconstruction/check or anything else. >> >> >> >> How can this be? Kernel is 2.6.22.16 with the two patches for the >> >> deadlock ("[PATCH 004 of 4] md: Fix an occasional deadlock in raid5 - >> >> FIX") and the previous one. >> > >> >Maybe the bitmap code is waking up regularly to do nothing. >> > >> >Would you be happy to experiment? Remove the bitmap with >> > mdadm --grow /dev/mdX --bitmap=none >> > >> >and see how that affects cpu usage? >> >> Confirmed, removing the bitmap stopped cpu consumption. > >Looks like quite a bit of CPU going into idle arrays here, too. I don't mind the cpu time (in the machines where we use it here), what worries me is that it shouldn't happen when the disks are completely idle. Looks like there's a bug somewhere. That's my feeling, I have one array with an internal bitmap and one with no bitmap, and the internal bitmap uses CPU even when the machine is idle. I have *not* tried an external bitmap. -- Bill Davidsen <[EMAIL PROTECTED]> "Woe unto the statesman who makes war without a reason that will still be valid when the war is over..." Otto von Bismark - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: idle array consuming cpu ??!!
Carlos Carvalho wrote: Neil Brown ([EMAIL PROTECTED]) wrote on 21 January 2008 12:15: >On Sunday January 20, [EMAIL PROTECTED] wrote: >> A raid6 array with a spare and bitmap is idle: not mounted and with no >> IO to it or any of its disks (obviously), as shown by iostat. However >> it's consuming cpu: since reboot it used about 11min in 24h, which is quite >> a lot even for a busy array (the cpus are fast). The array was cleanly >> shutdown so there's been no reconstruction/check or anything else. >> >> How can this be? Kernel is 2.6.22.16 with the two patches for the >> deadlock ("[PATCH 004 of 4] md: Fix an occasional deadlock in raid5 - >> FIX") and the previous one. > >Maybe the bitmap code is waking up regularly to do nothing. > >Would you be happy to experiment? Remove the bitmap with > mdadm --grow /dev/mdX --bitmap=none > >and see how that affects cpu usage? Confirmed, removing the bitmap stopped cpu consumption. Looks like quite a bit of CPU going into idle arrays here, too. -- Bill Davidsen <[EMAIL PROTECTED]> "Woe unto the statesman who makes war without a reason that will still be valid when the war is over..." Otto von Bismark - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: One Large md or Many Smaller md for Better Peformance?
Moshe Yudkowsky wrote: Carlos Carvalho wrote: I use reiser3 and xfs. reiser3 is very good with many small files. A simple test shows interactively perceptible results: removing large files is faster with xfs, removing large directories (ex. the kernel tree) is faster with reiser3. My current main concern about XFS and reiser3 is writebacks. The default mode for ext3 is "journal," which in case of power failure is more robust than the writeback modes of XFS, reiser3, or JFS -- or so I'm given to understand. On the other hand, I have a UPS and it should shut down gracefully regardless if there's a power failure. I wonder if I'm being too cautious? No. If you haven't actually *tested* the UPS failover code to be sure your system is talking to the UPS properly, and that the UPS is able to hold up power long enough for a shutdown after the system detects the problem, then you don't know if you actually have protection. Even then, if you don't proactively replace batteries on schedule, etc, then you aren't as protected as you might wish to be. And CPU fans fail, capacitors pop, power supplies fail, etc. These are things which have happened here in the last ten years. I also had a charging circuit in a UPS half-fail (from full wave rectifier to half). So the UPS would discharge until it ran out of power, then the system would fail hard. By the time I got on site and rebooted the UPS had trickle charged and would run the system. After replacing two "intermittent power supplies" in the system, the UPS was swapped on general principles and the real problem was isolated. Shit happens, don't rely on graceful shutdowns (or recovery, have backups). -- Bill Davidsen <[EMAIL PROTECTED]> "Woe unto the statesman who makes war without a reason that will still be valid when the war is over..." Otto von Bismark - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: One Large md or Many Smaller md for Better Peformance?
Moshe Yudkowsky wrote: Question: with the same number of physical drives, do I get better performance with one large md-based drive, or do I get better performance if I have several smaller md-based drives? Situation: dual CPU, 4 drives (which I will set up as RAID-1 after being terrorized by the anti-RAID-5 polemics included in the Debian distro of mdadm). I've two choices: 1. Allocate all the drive space into a single large partition, place into a single RAID array (either 10 or 1 + LVM, a separate question). One partitionable RAID-10, perhaps, then partition as needed. Read the discussion here about performance of LVM and RAID. I personally don't do LVM unless I know I will have to have great flexibility of configuration and can give up performance to get it. Other report different results, so make up your own mind. 2. Allocate each drive into several smaller partitions. Make each set of smaller partitions into a separate RAID 1 array and use separate RAID md drives for the various file systems. Example use case: While working other problems, I download a large torrent in the background. The torrent writes to its own, separate file system called /foo. If /foo is mounted on its own RAID 10 or 1-LVM array, will that help or hinder overall system responsiveness? It would seem a "no brainer" that giving each major filesystem its own array would allow for better threading and responsiveness, but I'm picking up hints in various piece of documentation that the performance can be counter-intuitive. I've even considered the possibility of giving /var and /usr separate RAID arrays (data vs. executables). If an expert could chime in, I'd appreciate it a great deal. -- Bill Davidsen <[EMAIL PROTECTED]> "Woe unto the statesman who makes war without a reason that will still be valid when the war is over..." Otto von Bismark - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: how to create a degraded raid1 with only 1 of 2 drives ??
Mitchell Laks wrote: Hi mdadm raid gurus, I wanted to make a raid1 array, but at the moment I have only 1 drive available. The other disk is in the mail. I wanted to make a raid1 that i will use as a backup. But I need to do the backup now, before the second drive comes. So I did this. formated /dev/sda creating /dev/sda1 with type fd. then I tried to run mdadm -C /dev/md0 --level=1 --raid-devices=1 /dev/sda1 but I got an error message mdadm: 1 is an unusual numner of drives for an array so it is probably a mistake. If you really mean it you will need to specify --force before setting the number of drives so then i tried mdadm -C /dev/md0 --level=1 --force --raid-devices=1 /dev/sda1 mdadm: /dev/sda1 is too small:0K mdadm: create aborted now what does that mean? fdisk -l /dev/sda shows device boot start end blocks Id System /dev/sda1 1 60801 488384001 fdlinux raid autodetect so what do I do? I need to back up my data. If I simply format /dev/sda1 as an ext3 file system then I can't "add" the second drive later on. How can I set it up as a `degraded` raid1 array so I can later on add in the second drive and sync? Specify two drives and one as missing. -- Bill Davidsen <[EMAIL PROTECTED]> "Woe unto the statesman who makes war without a reason that will still be valid when the war is over..." Otto von Bismark - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Linux Software RAID 5 + XFS Multi-Benchmarks / 10 Raptors Again
Justin Piszcz wrote: On Thu, 17 Jan 2008, Al Boldi wrote: Justin Piszcz wrote: On Wed, 16 Jan 2008, Al Boldi wrote: Also, can you retest using dd with different block-sizes? I can do this, moment.. I know about oflag=direct but I choose to use dd with sync and measure the total time it takes. /usr/bin/time -f %E -o ~/$i=chunk.txt bash -c 'dd if=/dev/zero of=/r1/bigfile bs=1M count=10240; sync' So I was asked on the mailing list to test dd with various chunk sizes, here is the length of time it took to write 10 GiB and sync per each chunk size: 4=chunk.txt:0:25.46 8=chunk.txt:0:25.63 16=chunk.txt:0:25.26 32=chunk.txt:0:25.08 64=chunk.txt:0:25.55 128=chunk.txt:0:25.26 256=chunk.txt:0:24.72 512=chunk.txt:0:24.71 1024=chunk.txt:0:25.40 2048=chunk.txt:0:25.71 4096=chunk.txt:0:27.18 8192=chunk.txt:0:29.00 16384=chunk.txt:0:31.43 32768=chunk.txt:0:50.11 65536=chunk.txt:2:20.80 What do you get with bs=512,1k,2k,4k,8k,16k... Thanks! -- Al - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html root 4621 0.0 0.0 12404 760 pts/2D+ 17:53 0:00 mdadm -S /dev/md3 root 4664 0.0 0.0 4264 728 pts/5S+ 17:54 0:00 grep D Tried to stop it when it was re-syncing, DEADLOCK :( [ 305.464904] md: md3 still in use. [ 314.595281] md: md_do_sync() got signal ... exiting Anyhow, done testing, time to move data back on if I can kill the resync process w/out deadlock. So does that indicate that there is still a deadlock issue, or that you don't have the latest patches installed? -- Bill Davidsen <[EMAIL PROTECTED]> "Woe unto the statesman who makes war without a reason that will still be valid when the war is over..." Otto von Bismark - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: md rotates RAID5 spare at boot
Sense: 00 3a 00 00 [ 40.949396] sd 7:0:1:0: [sdg] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA [ 40.949519] sd 7:0:1:0: [sdg] 488397168 512-byte hardware sectors (250059 MB) [ 40.949588] sd 7:0:1:0: [sdg] Write Protect is off [ 40.949640] sd 7:0:1:0: [sdg] Mode Sense: 00 3a 00 00 [ 40.949668] sd 7:0:1:0: [sdg] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA [ 40.949734] sdg: sdg1 sdg2 [ 40.969827] sd 7:0:1:0: [sdg] Attached SCSI disk [ 40.969926] sd 7:0:1:0: Attached scsi generic sg7 type 0 [ 41.206078] md: md0 stopped. [ 41.206137] md: unbind [ 41.206187] md: export_rdev(sdb1) [ 41.206253] md: unbind [ 41.206302] md: export_rdev(sdc1) [ 41.206360] md: unbind [ 41.206408] md: export_rdev(sda1) [ 41.247389] md: bind [ 41.247584] md: bind [ 41.247787] md: bind [ 41.247971] md: bind [ 41.248151] md: bind [ 41.248325] md: bind [ 41.256718] raid5: device sde1 operational as raid disk 0 [ 41.256771] raid5: device sdc1 operational as raid disk 4 [ 41.256821] raid5: device sda1 operational as raid disk 3 [ 41.256870] raid5: device sdb1 operational as raid disk 2 [ 41.256919] raid5: device sdf1 operational as raid disk 1 [ 41.257426] raid5: allocated 5245kB for md0 [ 41.257476] raid5: raid level 5 set md0 active with 5 out of 5 devices, algorithm 2 [ 41.257538] RAID5 conf printout: [ 41.257584] --- rd:5 wd:5 [ 41.257631] disk 0, o:1, dev:sde1 [ 41.257677] disk 1, o:1, dev:sdf1 [ 41.257724] disk 2, o:1, dev:sdb1 [ 41.257771] disk 3, o:1, dev:sda1 [ 41.257817] disk 4, o:1, dev:sdc1 [ 41.257952] md: md1 stopped. [ 41.258009] md: unbind [ 41.258060] md: export_rdev(sdb2) [ 41.258128] md: unbind [ 41.258179] md: export_rdev(sda2) [ 41.258248] md: unbind [ 41.258306] md: export_rdev(sdc2) [ 41.283067] md: bind [ 41.283297] md: bind [ 41.285235] md: bind [ 41.306753] md: md1 stopped. [ 41.306818] md: unbind [ 41.306878] md: export_rdev(sdb2) [ 41.306956] md: unbind [ 41.307007] md: export_rdev(sda2) [ 41.307075] md: unbind [ 41.307130] md: export_rdev(sdc2) [ 41.312250] md: bind [ 41.312476] md: bind [ 41.312711] md: bind [ 41.312922] md: bind [ 41.313138] md: bind [ 41.313343] md: bind [ 41.313452] md: md1: raid array is not clean -- starting background reconstruction [ 41.322189] raid5: device sde2 operational as raid disk 0 [ 41.322243] raid5: device sdc2 operational as raid disk 4 [ 41.322292] raid5: device sdg2 operational as raid disk 3 [ 41.322342] raid5: device sdb2 operational as raid disk 2 [ 41.322391] raid5: device sdf2 operational as raid disk 1 [ 41.322823] raid5: allocated 5245kB for md1 [ 41.322872] raid5: raid level 5 set md1 active with 5 out of 5 devices, algorithm 2 [ 41.322934] RAID5 conf printout: [ 41.322980] --- rd:5 wd:5 [ 41.323026] disk 0, o:1, dev:sde2 [ 41.323073] disk 1, o:1, dev:sdf2 [ 41.323119] disk 2, o:1, dev:sdb2 [ 41.323165] disk 3, o:1, dev:sdg2 [ 41.323212] disk 4, o:1, dev:sdc2 [ 41.323316] md: resync of RAID array md1 [ 41.323364] md: minimum _guaranteed_ speed: 1000 KB/sec/disk. [ 41.323415] md: using maximum available idle IO bandwidth (but not more than 20 KB/sec) for resync. [ 41.323492] md: using 128k window, over a total of 231978496 blocks. - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html -- Bill Davidsen <[EMAIL PROTECTED]> "Woe unto the statesman who makes war without a reason that will still be valid when the war is over..." Otto von Bismark - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 2.6.24-rc6 reproducible raid5 hang
Neil Brown wrote: On Wednesday January 9, [EMAIL PROTECTED] wrote: On Jan 9, 2008 5:09 PM, Neil Brown <[EMAIL PROTECTED]> wrote: On Wednesday January 9, [EMAIL PROTECTED] wrote: Can you test it please? This passes my failure case. Thanks! Does it seem reasonable? What do you think about limiting the number of stripes the submitting thread handles to be equal to what it submitted? If I'm a stripe that only submits 1 stripe worth of work should I get stuck handling the rest of the cache? Dunno Someone has to do the work, and leaving it all to raid5d means that it all gets done on one CPU. I expect that most of the time the queue of ready stripes is empty so make_request will mostly only handle it's own stripes anyway. The times that it handles other thread's stripes will probably balance out with the times that other threads handle this threads stripes. So I'm incline to leave it as "do as much work as is available to be done" as that is simplest. But I can probably be talked out of it with a convincing argument How about "it will perform better (defined as faster) during conditions of unusual i/o activity?" Is that a convincing argument to use your solution as offered? How about "complexity and maintainability are a zero-sum problem?" -- Bill Davidsen <[EMAIL PROTECTED]> "Woe unto the statesman who makes war without a reason that will still be valid when the war is over..." Otto von Bismark - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Raid 1, can't get the second disk added back in.
Neil Brown wrote: On Monday January 7, [EMAIL PROTECTED] wrote: Problem is not raid, or at least not obviously raid related. The problem is that the whole disk, /dev/hdb is unavailable. Maybe check /sys/block/hdb/holders ? lsof /dev/hdb ? good luck :-) losetup -a may help, lsof doesn't seem to show files used in loop mounts. Yes, long shot... -- Bill Davidsen <[EMAIL PROTECTED]> "Woe unto the statesman who makes war without a reason that will still be valid when the war is over..." Otto von Bismark - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Last ditch plea on remote double raid5 disk failure
Michael Tokarev wrote: Neil Brown wrote: On Monday December 31, [EMAIL PROTECTED] wrote: I'm hoping that if I can get raid5 to continue despite the errors, I can bring back up enough of the server to continue, a bit like the remount-ro option in ext2/ext3. If not, oh well... Sorry, but it is "oh well". And another thought around all this. Linux sw raid definitely need a way to proactively replace a (probably failing) drive, without removing it from the array first. Something like, mdadm --add /dev/md0 /dev/sdNEW --inplace /dev/sdFAILING so that sdNEW will be a mirror of sdFAILING, and once the "recovery" procedure finishes (which may use data from other drives in case of I/O error reading sdFAILING - unlike described scenario of making a superblock-less mirror of sdNEW and sdFAILING), mdadm --remove /dev/md0 /dev/sdFAILING, which does not involve any further reconstructions anymore. I really like that idea, it addresses the same problem as the various posts regarding creating little raid1 arrays of the old and new drive, etc. I would like an option to keep a drive with bad sectors in an array if removing the drive would prevent the array from running (or starting). I don't think that should be default, but there are times when some data is way better than none. I would think the options are fail the drive, set the array r/o, and return an error and keep going. -- Bill Davidsen <[EMAIL PROTECTED]> "Woe unto the statesman who makes war without a reason that will still be valid when the war is over..." Otto von Bismark - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: New XFS benchmarks using David Chinner's recommendations for XFS-based optimizations.
Justin Piszcz wrote: Dave's original e-mail: # mkfs.xfs -f -l lazy-count=1,version=2,size=128m -i attr=2 -d agcount=4 # mount -o logbsize=256k And if you don't care about filsystem corruption on power loss: # mount -o logbsize=256k,nobarrier Those mkfs values (except for log size) will be hte defaults in the next release of xfsprogs. Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group - I used his mkfs.xfs options verbatim but I use my own mount options: noatime,nodiratime,logbufs=8,logbsize=26214 Here are the results, the results of 3 bonnie++ averaged together for each test: http://home.comcast.net/~jpiszcz/xfs1/result.html Thanks Dave, this looks nice--the more optimizations the better! --- I also find it rather pecuilar that in some of my (other) benchmarks my RAID 5 is just as fast as RAID 0 for extracting large files (uncompressed) files: RAID 5 (1024k CHUNK) 26.95user 6.72system 0:37.89elapsed 88%CPU (0avgtext+0avgdata 0maxresident)k0inputs+0outputs (6major+526minor)pagefaults 0swaps Compare with RAID 0 for the same operation: (as with RAID5, it appears 256k-1024k..2048k possibly) is the sweet spot. Why does mdadm still use 64k for the default chunk size? Write performance with small files, I would think. There is some information in old posts, but I don't seem to find them as quickly as I would like. And another quick question, would there be any benefit to use (if it were possible) a block size of > 4096 bytes with XFS (I assume only IA64/similar arch can support it), e.g. x86_64 cannot because the page_size is 4096. [ 8265.407137] XFS: only pagesize (4096) or less will currently work. -- Bill Davidsen <[EMAIL PROTECTED]> "Woe unto the statesman who makes war without a reason that will still be valid when the war is over..." Otto von Bismark - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: On the subject of RAID-6 corruption recovery
H. Peter Anvin wrote: I got a private email a while ago from Thiemo Nagel claiming that some of the conclusions in my RAID-6 paper was incorrect. This was combined with a "proof" which was plain wrong, and could easily be disproven using basic enthropy accounting (i.e. how much information is around to play with.) However, it did cause me to clarify the text portion a little bit. In particular, *in practice* in may be possible to *probabilistically* detect multidisk corruption. Probabilistic detection means that the detection is not guaranteed, but it can be taken advantage of opportunistically. If this means that there can be no false positives for multidisk corruption but may be false negatives, fine. If it means something else, please restate one more time. -- Bill Davidsen <[EMAIL PROTECTED]> "Woe unto the statesman who makes war without a reason that will still be valid when the war is over..." Otto von Bismark - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: raid5 grow reshaping speed is unchangeable
Cody Yellan wrote: I forgot the version information: mdadm - v2.5.4 - 13 October 2006 kernel 2.6.18-53.el5 #1 SMP Would anyone consider it unsafe to upgrade to the latest version of mdadm on a production machine using Neil Brown's srpm? I wouldn't expect any problems, although I don't think there will be a benefit, either. The problem with RHEL is that while it's stable in terms of bug fixes, it also doesn't get any performance benefits. Why not wait until Neil gets back from the holiday and see if he has any words of advice? -- Bill Davidsen <[EMAIL PROTECTED]> "Woe unto the statesman who makes war without a reason that will still be valid when the war is over..." Otto von Bismark - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: raid5 grow reshaping speed is unchangeable
Cody Yellan wrote: I had a 4x500GB SATA2 array, md0. I added one 500GB drive and reshaping began at ~2500K/sec. Changing /proc/sys/dev/raid/speed_limit_m{in,ax} or /sys/block/md0/md/sync_speed_m{in,ax} had no effect. I shut down all unnecessary services and the array is offline (not mounted). I have read that the throttling code is "fragile" (esp. with regard to raid5) but does this make sense? I will wait (in)patiently for it to finish, but I do wonder why the configuration parameters have no effect. This is a dual quad 2GHz Xeon machine with 8GB of memory running RHEL5. Is this the maximum speed? Something else going on, I do better than that adding drives on USB! I don't have a clue what the issue is, and I don't see anything in your information which looks unusual. -- Bill Davidsen <[EMAIL PROTECTED]> "Woe unto the statesman who makes war without a reason that will still be valid when the war is over..." Otto von Bismark - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: raid10 performance question
Peter Grandi wrote: On Tue, 25 Dec 2007 19:08:15 +, [EMAIL PROTECTED] (Peter Grandi) said: [ ... ] It's the raid10,f2 *read* performance in degraded mode that is strange - I get almost exactly 50% of the non-degraded mode read performance. Why is that? [ ... ] the mirror blocks have to be read from the inner cylinders of the next disk, which are usually a lot slower than the outer ones. [ ... ] Just to be complete there is of course the other issue that affect sustained writes too, which is extra seeks. If disk B fails the situation becomes: DISK A X C D 1 X 3 4 . . . . . . . . . . . . --- 4 X 2 3 . . . . . . . . . . . . Not only must block 2 be read from an inner cylinder, but to read block 3 there must be a seek to an outer cylinder on the same disk. Which is the same well known issue when doing sustained writes with RAID10 'f2'. I have often wondered why the elevator code doesn't do better on this sustained load, grouping the writes at the drive extremities so there would be lots of writes to nearby cylinders then a big seek and lots of writes near the next position. I tried bumping the stripe_cache, changing to alternate elevators, and just increasing the physical memory, and never saw any serious improvement beyond the speed with default settings. -- Bill Davidsen <[EMAIL PROTECTED]> "Woe unto the statesman who makes war without a reason that will still be valid when the war is over..." Otto von Bismark - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Raid over 48 disks
Peter Grandi wrote: On Wed, 19 Dec 2007 07:28:20 +1100, Neil Brown <[EMAIL PROTECTED]> said: [ ... what to do with 48 drive Sun Thumpers ... ] neilb> I wouldn't create a raid5 or raid6 on all 48 devices. neilb> RAID5 only survives a single device failure and with that neilb> many devices, the chance of a second failure before you neilb> recover becomes appreciable. That's just one of the many problems, other are: * If a drive fails, rebuild traffic is going to hit hard, with reading in parallel 47 blocks to compute a new 48th. * With a parity strip length of 48 it will be that much harder to avoid read-modify before write, as it will be avoidable only for writes of at least 48 blocks aligned on 48 block boundaries. And reading 47 blocks to write one is going to be quite painful. [ ... ] neilb> RAID10 would be a good option if you are happy wit 24 neilb> drives worth of space. [ ... ] That sounds like the only feasible option (except for the 3 drive case in most cases). Parity RAID does not scale much beyond 3-4 drives. neilb> Alternately, 8 6drive RAID5s or 6 8raid RAID6s, and use neilb> RAID0 to combine them together. This would give you neilb> adequate reliability and performance and still a large neilb> amount of storage space. That sounds optimistic to me: the reason to do a RAID50 of 8x(5+1) can only be to have a single filesystem, else one could have 8 distinct filesystems each with a subtree of the whole. With a single filesystem the failure of any one of the 8 RAID5 components of the RAID0 will cause the loss of the whole lot. So in the 47+1 case a loss of any two drives would lead to complete loss; in the 8x(5+1) case only a loss of two drives in the same RAID5 will. It does not sound like a great improvement to me (especially considering the thoroughly inane practice of building arrays out of disks of the same make and model taken out of the same box). Quality control just isn't that good that "same box" make a big difference, assuming that you have an appropriate number of hot spares online. Note that I said "big difference," is there some clustering of failures? Some, but damn little. A few years ago I was working with multiple 6TB machines and 20+ 1TB machines, all using small, fast, drives in RAID5E. I can't remember a case where a drive failed before rebuild was complete, and only one or two where there was a failure to degraded mode before the hot spare was replaced. That said, RAID5E typically can rebuild a lot faster than a typical hot spare as a unit drive, at least for any given impact on performance. This undoubtedly reduce our exposure time. There are also modest improvements in the RMW strip size and in the cost of a rebuild after a single drive loss. Probably the reduction in the RMW strip size is the best improvement. Anyhow, let's assume 0.5TB drives; with a 47+1 we get a single 23.5TB filesystem, and with 8*(5+1) we get a 20TB filesystem. With current filesystem technology either size is worrying, for example as to time needed for an 'fsck'. Given that someone is putting a typical filesystem full of small files on a big raid, I agree. But fsck with large files is pretty fast on a given filesystem (200GB files on a 6TB ext3, for instance), due to the small number of inodes in play. While the bitmap resolution is a factor, it's pretty linear, fsck with lots of files gets really slow. And let's face it, the objective of raid is to avoid doing that fsck in the first place ;-) -- Bill Davidsen <[EMAIL PROTECTED]> "Woe unto the statesman who makes war without a reason that will still be valid when the war is over..." Otto von Bismark - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: raid10: unfair disk load?
Richard Scobie wrote: Jon Nelson wrote: My own tests on identical hardware (same mobo, disks, partitions, everything) and same software, with the only difference being how mdadm is invoked (the only changes here being level and possibly layout) show that raid0 is about 15% faster on reads than the very fast raid10, f2 layout. raid10,f2 is approx. 50% of the write speed of raid0. This more or less matches my testing. Have you tested a stacked RAID 10 made up of 2 drive RAID1 arrays, striped together into a RAID0. That is not raid10, that's raid1+0. See man md. I have found this configuration to offer very good performance, at the cost of slightly more complexity. It does, raid0 can be striped over many configurations, raid[156] being most common. -- Bill Davidsen <[EMAIL PROTECTED]> "Woe unto the statesman who makes war without a reason that will still be valid when the war is over..." Otto von Bismark - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Linux RAID Partition Offset 63 cylinders / 30% performance hit?
Robin Hill wrote: On Wed Dec 19, 2007 at 09:50:16AM -0500, Justin Piszcz wrote: The (up to) 30% percent figure is mentioned here: http://insights.oetiker.ch/linux/raidoptimization.html That looks to be referring to partitioning a RAID device - this'll only apply to hardware RAID or partitionable software RAID, not to the normal use case. When you're creating an array out of standard partitions then you know the array stripe size will align with the disks (there's no way it cannot), and you can set the filesystem stripe size to align as well (XFS will do this automatically). I've actually done tests on this with hardware RAID to try to find the correct partition offset, but wasn't able to see any difference (using bonnie++ and moving the partition start by one sector at a time). # fdisk -l /dev/sdc Disk /dev/sdc: 150.0 GB, 150039945216 bytes 255 heads, 63 sectors/track, 18241 cylinders Units = cylinders of 16065 * 512 = 8225280 bytes Disk identifier: 0x5667c24a Device Boot Start End Blocks Id System /dev/sdc1 1 18241 146520801 fd Linux raid autodetect This looks to be a normal disk - the partition offsets shouldn't be relevant here (barring any knowledge of the actual physical disk layout anyway, and block remapping may well make that rather irrelevant). The issue I'm thinking about is hardware sector size, which on modern drives may be larger than 512b and therefore entail a read-alter-rewrite (RAR) cycle when writing a 512b block. With larger writes, if the alignment is poor and the write size is some multiple of 512, it's possible to have an RAR at each end of the write. The only way to have a hope of controlling the alignment is to write a raw device or use a filesystem which can be configured to have blocks which are a multiple of the sector size and to do all i/o in block size starting each file on a block boundary. That may be possible with ext[234] set up properly. Why this is important: the physical layout of the drive is useful, but for a large write the drive will have to make some number of steps from on cylinder to another. By carefully choosing the starting point, the best improvement will be to eliminate 2 track-to-track seek times, one at the start and one at the end. If the writes are small only one t2t saving is possible. Now consider a RAR process. The drive is spinning typically at 7200 rpm, or 8.333 ms/rev. A read might take .5 rev on average, and a RAR will take 1.5 rev, because it takes a full revolution after the original data is read before the altered data can be rewritten. Larger sectors give more capacity, but reduced performance for write. And doing small writes can result in paying the RAR penalty on every write. So there may be a measurable benefit to getting that alignment right at the drive level. -- Bill Davidsen <[EMAIL PROTECTED]> "Woe unto the statesman who makes war without a reason that will still be valid when the war is over..." Otto von Bismark - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Raid over 48 disks
Thiemo Nagel wrote: Bill Davidsen wrote: 16k read64k write chunk sizeRAID 5RAID 6RAID 5RAID 6 128k492497268270 256k615530288270 512k625607230174 1024k 65062017075 What is your stripe cache size? I didn't fiddle with the default when I did these tests. Now (with 256k chunk size) I had # cat stripe_cache_size 256 but increasing that to 1024 didn't show a noticeable improvement for reading. Still around 550MB/s. You can use blockdev to raise the readahead, either on the drives or the array. That may make a difference, I use 4-8MB on the drive, more on the array depending on how I use it. Kind regards, Thiemo -- Bill Davidsen <[EMAIL PROTECTED]> "Woe unto the statesman who makes war without a reason that will still be valid when the war is over..." Otto von Bismark - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Linux RAID Partition Offset 63 cylinders / 30% performance hit?
Justin Piszcz wrote: On Wed, 19 Dec 2007, Bill Davidsen wrote: I'm going to try another approach, I'll describe it when I get results (or not). http://home.comcast.net/~jpiszcz/align_vs_noalign/ Hardly any difference at whatsoever, only on the per char for read/write is it any faster..? Am I misreading what you are doing here... you have the underlying data on the actual hardware devices 64k aligned by using either the whole device or starting a partition on a 64k boundary? I'm dubious that you will see a difference any other way, after all the translations take place. I'm trying creating a raid array using loop devices created with the "offset" parameter, but I suspect that I will wind up doing a test after just repartitioning the drives, painful as that will be. Average of 3 runs taken: $ cat align/*log|grep , p63,8G,57683,94,86479,13,55242,8,63495,98,147647,11,434.8,0,16:10:16/64,1334210,10,330,2,120,1,3978,10,312,2 p63,8G,57973,95,76702,11,50830,7,62291,99,136477,10,388.3,0,16:10:16/64,1252548,6,296,1,115,1,7927,20,373,2 p63,8G,57758,95,80847,12,52144,8,63874,98,144747,11,443.4,0,16:10:16/64,1242445,6,303,1,117,1,6767,17,359,2 $ cat noalign/*log|grep , p63,8G,57641,94,85494,12,55669,8,63802,98,146925,11,434.8,0,16:10:16/64,1353180,8,314,1,117,1,8684,22,283,2 p63,8G,57705,94,85929,12,56708,8,63855,99,143437,11,436.2,0,16:10:16/64,12211519,29,297,1,113,1,3218,8,325,2 p63,8G,57783,94,78226,11,48580,7,63487,98,137721,10,438.7,0,16:10:16/64,1243229,8,307,1,120,1,4247,11,313,2 -- Bill Davidsen <[EMAIL PROTECTED]> "Woe unto the statesman who makes war without a reason that will still be valid when the war is over..." Otto von Bismark - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: help diagnosing bad disk
Jon Sabo wrote: I think I got it now. Thanks for your help! Now just make our holiday cheer complete by waiting until the resync is complete and rebooting to be sure that everything is *really* back as it should be. ;-) [EMAIL PROTECTED]:/home/illsci# mdadm --detail /dev/md0 /dev/md0: Version : 00.90.03 Creation Time : Mon Jul 30 21:47:14 2007 Raid Level : raid1 Array Size : 1951744 (1906.32 MiB 1998.59 MB) Device Size : 1951744 (1906.32 MiB 1998.59 MB) Raid Devices : 2 Total Devices : 1 Preferred Minor : 0 Persistence : Superblock is persistent Update Time : Wed Dec 19 14:15:31 2007 State : clean, degraded Active Devices : 1 Working Devices : 1 Failed Devices : 0 Spare Devices : 0 UUID : 157f716c:0e7aebca:c20741f6:bb6099c9 Events : 0.48 Number Major Minor RaidDevice State 0 810 active sync /dev/sda1 1 001 removed [EMAIL PROTECTED]:/home/illsci# mdadm --detail /dev/md1 /dev/md1: Version : 00.90.03 Creation Time : Mon Jul 30 21:47:47 2007 Raid Level : raid1 Array Size : 974808064 (929.65 GiB 998.20 GB) Device Size : 974808064 (929.65 GiB 998.20 GB) Raid Devices : 2 Total Devices : 1 Preferred Minor : 1 Persistence : Superblock is persistent Update Time : Wed Dec 19 14:19:06 2007 State : clean, degraded Active Devices : 1 Working Devices : 1 Failed Devices : 0 Spare Devices : 0 UUID : 156a030e:9a6f8eb3:9b0c439e:d718e744 Events : 0.1498998 Number Major Minor RaidDevice State 0 000 removed 1 8 181 active sync /dev/sdb2 [EMAIL PROTECTED]:/home/illsci# mdadm /dev/md0 -a /dev/sdb1 mdadm: re-added /dev/sdb1 [EMAIL PROTECTED]:/home/illsci# mdadm /dev/md1 -a /dev/sda2 mdadm: re-added /dev/sda2 [EMAIL PROTECTED]:/home/illsci# cat /proc/mdstat Personalities : [multipath] [raid1] md1 : active raid1 sda2[2] sdb2[1] 974808064 blocks [2/1] [_U] resync=DELAYED md0 : active raid1 sdb1[2] sda1[0] 1951744 blocks [2/1] [U_] [=>...] recovery = 86.6% (1693504/1951744) finish=0.0min speed=80643K/sec unused devices: [EMAIL PROTECTED]:/home/illsci# cat /proc/mdstat Personalities : [multipath] [raid1] md1 : active raid1 sda2[2] sdb2[1] 974808064 blocks [2/1] [_U] [>] recovery = 0.0% (86848/974808064) finish=186.9min speed=86848K/sec md0 : active raid1 sdb1[1] sda1[0] 1951744 blocks [2/2] [UU] unused devices: -- Bill Davidsen <[EMAIL PROTECTED]> "Woe unto the statesman who makes war without a reason that will still be valid when the war is over..." Otto von Bismark - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Linux RAID Partition Offset 63 cylinders / 30% performance hit?
Justin Piszcz wrote: On Wed, 19 Dec 2007, Bill Davidsen wrote: Justin Piszcz wrote: On Wed, 19 Dec 2007, Mattias Wadenstein wrote: On Wed, 19 Dec 2007, Justin Piszcz wrote: -- Now to my setup / question: # fdisk -l /dev/sdc Disk /dev/sdc: 150.0 GB, 150039945216 bytes 255 heads, 63 sectors/track, 18241 cylinders Units = cylinders of 16065 * 512 = 8225280 bytes Disk identifier: 0x5667c24a Device Boot Start End Blocks Id System /dev/sdc1 1 18241 146520801 fd Linux raid autodetect --- If I use 10-disk RAID5 with 1024 KiB stripe, what would be the correct start and end size if I wanted to make sure the RAID5 was stripe aligned? Or is there a better way to do this, does parted handle this situation better? From that setup it seems simple, scrap the partition table and use the disk device for raid. This is what we do for all data storage disks (hw raid) and sw raid members. /Mattias Wadenstein Is there any downside to doing that? I remember when I had to take my machine apart for a BIOS downgrade when I plugged in the sata devices again I did not plug them back in the same order, everything worked of course but when I ran LILO it said it was not part of the RAID set, because /dev/sda had become /dev/sdg and overwrote the MBR on the disk, if I had not used partitions here, I'd have lost (or more of the drives) due to a bad LILO run? As other posts have detailed, putting the partition on a 64k aligned boundary can address the performance problems. However, a poor choice of chunk size, cache_buffer size, or just random i/o in small sizes can eat up a lot of the benefit. I don't think you need to give up your partitions to get the benefit of alignment. -- Bill Davidsen <[EMAIL PROTECTED]> "Woe unto the statesman who makes war without a reason that will still be valid when the war is over..." Otto von Bismark Hrmm.. I am doing a benchmark now with: 6 x 400GB (SATA) / 256 KiB stripe with unaligned vs. aligned raid setup. unligned, just fdisk /dev/sdc, mkpartition, fd raid. aligned, fdisk, expert, start at 512 as the off-set Per a Microsoft KB: Example of alignment calculations in kilobytes for a 256-KB stripe unit size: (63 * .5) / 256 = 0.123046875 (64 * .5) / 256 = 0.125 (128 * .5) / 256 = 0.25 (256 * .5) / 256 = 0.5 (512 * .5) / 256 = 1 These examples shows that the partition is not aligned correctly for a 256-KB stripe unit size until the partition is created by using an offset of 512 sectors (512 bytes per sector). So I should start at 512 for a 256k chunk size. I ran bonnie++ three consecutive times and took the average for the unaligned, rebuilding the RAID5 now and then I will re-execute the test 3 additional times and take the average of that. I'm going to try another approach, I'll describe it when I get results (or not). -- Bill Davidsen <[EMAIL PROTECTED]> "Woe unto the statesman who makes war without a reason that will still be valid when the war is over..." Otto von Bismark - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: help diagnosing bad disk
Jon Sabo wrote: So I was trying to copy over some Indiana Jones wav files and it wasn't going my way. I noticed that my software raid device showed: /dev/md1 on / type ext3 (rw,errors=remount-ro) Is this saying that it was remounted, read only because it found a problem with the md1 meta device? That's what it looks like it's saying but I can still write to /. mdadm --detail showed: [EMAIL PROTECTED]:/home/illsci# mdadm --detail /dev/md0 /dev/md0: Version : 00.90.03 Creation Time : Mon Jul 30 21:47:14 2007 Raid Level : raid1 Array Size : 1951744 ( 1906.32 MiB 1998.59 MB) Device Size : 1951744 (1906.32 MiB 1998.59 MB) Raid Devices : 2 Total Devices : 1 Preferred Minor : 0 Persistence : Superblock is persistent Update Time : Wed Dec 19 12:59:56 2007 State : clean, degraded Active Devices : 1 Working Devices : 1 Failed Devices : 0 Spare Devices : 0 UUID : 157f716c:0e7aebca:c20741f6 :bb6099c9 Events : 0.28 Number Major Minor RaidDevice State 0 810 active sync /dev/sda1 1 001 removed [EMAIL PROTECTED]:/home/illsci# mdadm --detail /dev/md1 /dev/md1: Version : 00.90.03 Creation Time : Mon Jul 30 21:47:47 2007 Raid Level : raid1 Array Size : 974808064 (929.65 GiB 998.20 GB) Device Size : 974808064 (929.65 GiB 998.20 GB) Raid Devices : 2 Total Devices : 1 Preferred Minor : 1 Persistence : Superblock is persistent Update Time : Wed Dec 19 13:14:53 2007 State : clean, degraded Active Devices : 1 Working Devices : 1 Failed Devices : 0 Spare Devices : 0 UUID : 156a030e:9a6f8eb3:9b0c439e:d718e744 Events : 0.1990 Number Major Minor RaidDevice State 0 820 active sync /dev/sda2 1 001 removed I have two 1 terabyte sata drives in this box. From what I was reading wouldn't it show an F for the failed drive? I thought I would see that /dev/sdb1 and /dev/sdb2 were failed and it would show an F. What is this saying and how do you know that its /dev/sdb and not some other drive? It shows removed and that the state is clean, degraded. Is that something you can recover from with out returning this disk and putting in a new one to add to the raid1 array? You can try adding the partitions back to your array, but I suspect something bad has happened to your sdb drive, since it's failed out of both arrays. You can use dmesg to look for any additional information. Justin gave you the rest of the info you need to investigate, I'll not repeat it. ;-) -- Bill Davidsen <[EMAIL PROTECTED]> "Woe unto the statesman who makes war without a reason that will still be valid when the war is over..." Otto von Bismark - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Linux RAID Partition Offset 63 cylinders / 30% performance hit?
Justin Piszcz wrote: On Wed, 19 Dec 2007, Mattias Wadenstein wrote: On Wed, 19 Dec 2007, Justin Piszcz wrote: -- Now to my setup / question: # fdisk -l /dev/sdc Disk /dev/sdc: 150.0 GB, 150039945216 bytes 255 heads, 63 sectors/track, 18241 cylinders Units = cylinders of 16065 * 512 = 8225280 bytes Disk identifier: 0x5667c24a Device Boot Start End Blocks Id System /dev/sdc1 1 18241 146520801 fd Linux raid autodetect --- If I use 10-disk RAID5 with 1024 KiB stripe, what would be the correct start and end size if I wanted to make sure the RAID5 was stripe aligned? Or is there a better way to do this, does parted handle this situation better? From that setup it seems simple, scrap the partition table and use the disk device for raid. This is what we do for all data storage disks (hw raid) and sw raid members. /Mattias Wadenstein Is there any downside to doing that? I remember when I had to take my machine apart for a BIOS downgrade when I plugged in the sata devices again I did not plug them back in the same order, everything worked of course but when I ran LILO it said it was not part of the RAID set, because /dev/sda had become /dev/sdg and overwrote the MBR on the disk, if I had not used partitions here, I'd have lost (or more of the drives) due to a bad LILO run? As other posts have detailed, putting the partition on a 64k aligned boundary can address the performance problems. However, a poor choice of chunk size, cache_buffer size, or just random i/o in small sizes can eat up a lot of the benefit. I don't think you need to give up your partitions to get the benefit of alignment. -- Bill Davidsen <[EMAIL PROTECTED]> "Woe unto the statesman who makes war without a reason that will still be valid when the war is over..." Otto von Bismark - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: ERROR] scsi.c: In function 'scsi_get_serial_number_page'
Thierry Iceta wrote: Hi I would like to use raidtools-1.00.3 on Rhel5 distribution but I got thie error Could you tell me if a new version is available or if a patch exists to use raidtools on Rhel5 raidtools is old and unmaintained. Use mdadm. -- Bill Davidsen <[EMAIL PROTECTED]> "Woe unto the statesman who makes war without a reason that will still be valid when the war is over..." Otto von Bismark - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Raid over 48 disks
Mattias Wadenstein wrote: On Wed, 19 Dec 2007, Neil Brown wrote: On Tuesday December 18, [EMAIL PROTECTED] wrote: We're investigating the possibility of running Linux (RHEL) on top of Sun's X4500 Thumper box: http://www.sun.com/servers/x64/x4500/ Basically, it's a server with 48 SATA hard drives. No hardware RAID. It's designed for Sun's ZFS filesystem. So... we're curious how Linux will handle such a beast. Has anyone run MD software RAID over so many disks? Then piled LVM/ext3 on top of that? Any suggestions? There are those that have run Linux MD RAID on thumpers before. I vaguely recall some driver issues (unrelated to MD) that made it less suitable than solaris, but that might be fixed in recent kernels. Alternately, 8 6drive RAID5s or 6 8raid RAID6s, and use RAID0 to combine them together. This would give you adequate reliability and performance and still a large amount of storage space. My personal suggestion would be 5 9-disk raid6s, one raid1 root mirror and one hot spare. Then raid0, lvm, or separate filesystem on those 5 raidsets for data, depending on your needs. Other than thinking raid-10 better than raid-1for performance, I like it. You get almost as much data space as with the 6 8-disk raid6s, and have a separate pair of disks for all the small updates (logging, metadata, etc), so this makes alot of sense if most of the data is bulk file access. -- Bill Davidsen <[EMAIL PROTECTED]> "Woe unto the statesman who makes war without a reason that will still be valid when the war is over..." Otto von Bismark - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Raid over 48 disks
Thiemo Nagel wrote: Performance of the raw device is fair: # dd if=/dev/md2 of=/dev/zero bs=128k count=64k 8589934592 bytes (8.6 GB) copied, 15.6071 seconds, 550 MB/s Somewhat less through ext3 (created with -E stride=64): # dd if=largetestfile of=/dev/zero bs=128k count=64k 8589934592 bytes (8.6 GB) copied, 26.4103 seconds, 325 MB/s Quite slow? 10 disks (raptors) raid 5 on regular sata controllers: # dd if=/dev/md3 of=/dev/zero bs=128k count=64k 8589934592 bytes (8.6 GB) copied, 10.718 seconds, 801 MB/s # dd if=bigfile of=/dev/zero bs=128k count=64k 3640379392 bytes (3.6 GB) copied, 6.58454 seconds, 553 MB/s Interesting. Any ideas what could be the reason? How much do you get from a single drive? -- The Samsung HD501LJ that I'm using gives ~84MB/s when reading from the beginning of the disk. With RAID 5 I'm getting slightly better results (though I really wonder why, since naively I would expect identical read performance) but that does only account for a small part of the difference: 16k read64k write chunk sizeRAID 5RAID 6RAID 5RAID 6 128k492497268270 256k615530288270 512k625607230174 1024k 65062017075 What is your stripe cache size? -- Bill Davidsen <[EMAIL PROTECTED]> "Woe unto the statesman who makes war without a reason that will still be valid when the war is over..." Otto von Bismark - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: mdadm break / restore soft mirror
Brett Maton wrote: Hi, Question for you guys. A brief history: RHEL 4 AS I have a partition with way to many small files on (Usually around a couple of million) that needs to be backed up, standard methods mean that a restore is impossibly slow due to the sheer volume of files. Solution, raw backup /restore of the device. However the partition is permanently being accessed. Proposed solution is to use software raid mirror. Before backup starts, break the soft mirror unmount and backup partition restore soft mirror and let it resync / rebuild itself. Would the above intentional break/fix of the mirror cause any problems? Probably. If by "accessed" you mean read-only, you can do this, but if the data is changing you have a serious problem that the data on the disk and queued in memory may leave that part on the disk as an inconsistent data set. If there is a means of backing up a set of files which are changing other than stopping the updates in a known valid state, it's not something which I've seen really work in all cases. DM has some snapshot capabilities, but in fact they have the same limitation, the data on a partition can be backed up, but unless you can ensure that the data is in a consistent state when it's frozen, your backup will have some small possibility of failure. Database programs have ways to freeze the data to do backups, but if an application doesn't have a means to force the data on the disk valid, it will only be a "pretty good" backup. I suggest looking at things like rsync, which will not solve the changing data problem, but may do the backup quickly enough to be as useful as what you propose. Of course a full backup is likely to take a long time however you do it. -- Bill Davidsen <[EMAIL PROTECTED]> "Woe unto the statesman who makes war without a reason that will still be valid when the war is over..." Otto von Bismark - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Kernel 2.6.23.9 / P35 Chipset + WD 750GB Drives (reset port)
Tejun Heo wrote: Bill Davidsen wrote: Jan Engelhardt wrote: On Dec 1 2007 06:26, Justin Piszcz wrote: I ran the following: dd if=/dev/zero of=/dev/sdc dd if=/dev/zero of=/dev/sdd dd if=/dev/zero of=/dev/sde (as it is always a very good idea to do this with any new disk) Why would you care about what's on the disk? fdisk, mkfs and the day-to-day operation will overwrite it _anyway_. (If you think the disk is not empty, you should look at it and copy off all usable warez beforehand :-) Do you not test your drive for minimum functionality before using them? I personally don't. Also, if you have the tools to check for relocated sectors before and after doing this, that's a good idea as well. S.M.A.R.T is your friend. And when writing /dev/zero to a drive, if it craps out you have less emotional attachment to the data. Writing all zero isn't too useful tho. Drive failing reallocation on write is catastrophic failure. It means that the drive wanna relocate but can't because it used up all its extra space which usually indicates something else is seriously wrong with the drive. The drive will have to go to the trash can. This is all serious and bad but the catch is that in such cases the problem usually stands like a sore thumb so either vendor doesn't ship such drive or you'll find the failure very early. I personally haven't seen any such failure yet. Maybe I'm lucky. The problem is usually not with what the vendor ships, but what the carrier delivers. Bad handling does happen, "drop ship" can have several meanings, and I have received shipments with the "G sensor" in the case triggered. Zero is a handy source of data, but the important thing is to look at the relocated sector count before and after the write. If there are a lot of bad sectors initially, the drive is probably a poor choice for anything critical. Most data loss occurs when the drive fails to read what it thought it wrote successfully and the opposite - reading and dumping the whole disk to /dev/null periodically is probably much better than writing zeros as it allows the drive to find out deteriorating sector early while it's still readable and relocate. But then again I think it's an overkill. Writing zeros to sectors is more useful as cure rather than prevention. If your drive fails to read a sector, write whatever value to the sector. The drive will forget about the data on the damaged sector and reallocate and write new data to it. Of course, you lose data which was originally on the sector. I personally think it's enough to just throw in an extra disk and make it RAID0 or 5 and rebuild the array if read fails on one of the disks. If write fails or read fail continues, replace the disk. Of course, if you wanna be extra cautious, good for you. :-) -- Bill Davidsen <[EMAIL PROTECTED]> "Woe unto the statesman who makes war without a reason that will still be valid when the war is over..." Otto von Bismark - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: unable to remove failed drive
Jeff Breidenbach wrote: ... and all access to array hangs indefinitely, resulting in unkillable zombie processes. Have to hard reboot the machine. Any thoughts on the matter? === # cat /proc/mdstat Personalities : [raid1] md1 : active raid1 sde1[6](F) sdg1[1] sdb1[4] sdd1[3] sdc1[2] 488383936 blocks [6/4] [__] unused devices: # mdadm --fail /dev/md1 /dev/sde1 mdadm: set /dev/sde1 faulty in /dev/md1 # mdadm --remove /dev/md1 /dev/sde1 mdadm: hot remove failed for /dev/sde1: Device or resource busy # mdadm -D /dev/md1 /dev/md1: Version : 00.90.03 Creation Time : Sun Dec 25 16:12:34 2005 Raid Level : raid1 Array Size : 488383936 (465.76 GiB 500.11 GB) Device Size : 488383936 (465.76 GiB 500.11 GB) Raid Devices : 6 Total Devices : 5 Preferred Minor : 1 Persistence : Superblock is persistent Update Time : Fri Dec 7 11:37:46 2007 State : active, degraded Active Devices : 4 Working Devices : 4 Failed Devices : 1 Spare Devices : 0 UUID : f3ee6aa3:2f1d5767:f443dfc0:c23e80af Events : 0.22331500 Number Major Minor RaidDevice State 0 00- removed 1 8 971 active sync /dev/sdg1 2 8 332 active sync /dev/sdc1 3 8 493 active sync /dev/sdd1 4 8 174 active sync /dev/sdb1 5 00- removed 6 8 650 faulty /dev/sde1 This is without doubt really messed up! You have four active devices, four working devices, five total devices, and six(!) raid devices. And at the end of the output seven(!!) devices, four active, two removed, and one faulty. I wouldn't even be able to make a guess how you go to this point, but I would guess that some system administration was involved. If this is an array you can live without and still have a working system I do have a thought, however. If you can unmount everything on this device and then stop it, you may be able to assemble (-A) it with just the four working drives. If that succeeds you may be able to remove sde1, although I suspect that the two removed drives shown are really caused by partially removal of sde1 in the past. Either that or you have a serious problem with reliability... I'm sure others will have some ideas on this, if it were mine a backup would be my first order of business. -- Bill Davidsen <[EMAIL PROTECTED]> "Woe unto the statesman who makes war without a reason that will still be valid when the war is over..." Otto von Bismark - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Reading takes 100% precedence over writes for mdadm+raid5?
Justin Piszcz wrote: root 2206 1 4 Dec02 ?00:10:37 dd if /dev/zero of 1.out bs 1M root 2207 1 4 Dec02 ?00:10:38 dd if /dev/zero of 2.out bs 1M root 2208 1 4 Dec02 ?00:10:35 dd if /dev/zero of 3.out bs 1M root 2209 1 4 Dec02 ?00:10:45 dd if /dev/zero of 4.out bs 1M root 2210 1 4 Dec02 ?00:10:35 dd if /dev/zero of 5.out bs 1M root 2211 1 4 Dec02 ?00:10:35 dd if /dev/zero of 6.out bs 1M root 2212 1 4 Dec02 ?00:10:30 dd if /dev/zero of 7.out bs 1M root 2213 1 4 Dec02 ?00:10:42 dd if /dev/zero of 8.out bs 1M root 2214 1 4 Dec02 ?00:10:35 dd if /dev/zero of 9.out bs 1M root 2215 1 4 Dec02 ?00:10:37 dd if /dev/zero of 10.out bs 1M root 3080 24.6 0.0 10356 1672 ?D01:22 5:51 dd if /dev/md3 of /dev/null bs 1M Was curious if when running 10 DD's (which are writing to the RAID 5) fine, no issues, suddenly all go into D-state and let the read/give it 100% priority? Is this normal? I'm jumping back to the start of this thread, because after reading all the discussion I noticed that you are mixing apples and oranges here. Your write programs are going to files in the filesystem, and your read is going against the raw device. That may explain why you see something I haven't noticed doing all filesystem i/o. I am going to do a large rsync to another filesystem in the next two days, I will turn on some measurements when I do. But if you are just investigating this behavior, perhaps you could retry with a single read from a file rather than the device. [...snip...] -- Bill Davidsen <[EMAIL PROTECTED]> "Woe unto the statesman who makes war without a reason that will still be valid when the war is over..." Otto von Bismark - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: raid6 check/repair
Peter Grandi wrote: [ ... on RAID1, ... RAID6 error recovery ... ] tn> The use case for the proposed 'repair' would be occasional, tn> low-frequency corruption, for which many sources can be tn> imagined: tn> Any piece of hardware has a certain failure rate, which may tn> depend on things like age, temperature, stability of tn> operating voltage, cosmic rays, etc. but also on variations tn> in the production process. Therefore, hardware may suffer tn> from infrequent glitches, which are seldom enough, to be tn> impossible to trace back to a particular piece of equipment. tn> It would be nice to recover gracefully from that. What has this got to do with RAID6 or RAID in general? I have been following this discussion with a sense of bewilderment as I have started to suspect that parts of it are based on a very large misunderstanding. tn> Kernel bugs or just plain administrator mistakes are another tn> thing. The biggest administrator mistakes are lack of end-to-end checking and backups. Those that don't have them wish their storage systems could detect and recover from arbitrary and otherwise undetected errors (but see below for bad news on silent corruptions). tn> But also the case of power-loss during writing that you have tn> mentioned could profit from that 'repair': With heterogeneous tn> hardware, blocks may be written in unpredictable order, so tn> that in more cases graceful recovery would be possible with tn> 'repair' compared to just recalculating parity. Redundant RAID levels are designed to recover only from _reported_ errors that identify precisely where the error is. Recovering from random block writing is something that seems to me to be quite outside the scope of a low level virtual storage device layer. ms> I just want to give another suggestion. It may or may not be ms> possible to repair inconsistent arrays but in either way some ms> code there MUST at least warn the administrator that ms> something (may) went wrong. tn> Agreed. That sounds instead quite extraordinary to me because it is not clear how to define ''inconsistency'' in the general case never mind detect it reliably, and never mind knowing when it is found how to determine which are the good data bits and which are the bad. Now I am starting to think that this discussion is based on the curious assumption that storage subsystems should solve the so called ''byzantine generals'' problem, that is to operate reliably in the presence of unreliable communications and storage. I had missed that. In fact, after rereading most of the thread I *still* miss that, so perhaps it's not there. What the OP proposed was that in the case where there is incorrect data on exactly one chunk in a raid-6 slice that the incorrect chunk be identified and rewritten with correct data. This is based on the assumptions that (a) this case can be identified, (b) the correct data value for the chunk can be calculated, (c) this only adds processing or i/o overhead when an error condition is identified by the existing code, and (d) this can be done without significant additional i/o other than rewriting the corrected data. Given these assumptions the reasons for not adding this logic would seem to be (a) one of the assumptions is wrong, (b) it would take a huge effort to code or maintain, or (c) it's wrong for raid to fix errors other than hardware, even if it could do so. Although I've looked at the logic in metadata form, and the code for doing the check now, I realize that the assumptions could be wrong, and invite enlightenment. But Thiemo posted metacode which I find appears correct, so I don't think it's a huge job to code, and since it is in a code path which currently always hides an error, it's hard to understand how added code could make things worse than they are. I can actually see the philosophical argument about doing only disk errors in raid code, but at least it should be a clear decision made for that reason, and not hidden by arguments that this happens rarely. Given the state of current hardware, I think virtually all errors happen rarely, the problem is that all problems happen occasionally (ref. Murphy's Law). We have a tool (check) which finds these problems, why not a tools to fix them? BTW: if this can be done in a user program, mdadm, rather than by code in the kernel, that might well make everyone happy. Okay, realistically "less unhappy." -- Bill Davidsen <[EMAIL PROTECTED]> "Woe unto the statesman who makes war without a reason that will still be valid when the war is over..." Otto von Bismark - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Kernel 2.6.23.9 / P35 Chipset + WD 750GB Drives (reset port)
Jan Engelhardt wrote: On Dec 1 2007 06:26, Justin Piszcz wrote: I ran the following: dd if=/dev/zero of=/dev/sdc dd if=/dev/zero of=/dev/sdd dd if=/dev/zero of=/dev/sde (as it is always a very good idea to do this with any new disk) Why would you care about what's on the disk? fdisk, mkfs and the day-to-day operation will overwrite it _anyway_. (If you think the disk is not empty, you should look at it and copy off all usable warez beforehand :-) Do you not test your drive for minimum functionality before using them? Also, if you have the tools to check for relocated sectors before and after doing this, that's a good idea as well. S.M.A.R.T is your friend. And when writing /dev/zero to a drive, if it craps out you have less emotional attachment to the data. -- Bill Davidsen <[EMAIL PROTECTED]> "We have more to fear from the bungling of the incompetent than from the machinations of the wicked." - from Slashdot - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Abysmal write performance on HW RAID5
changes and rerun? So, the RAID5 device has a huge queue of write requests with an average wait time of more than 2 seconds @ 100% utilization? Or is this a bug in iostat? At this point, I'm all ears...I don't even know where to start. Is ext2 not a good format for volumes of this size? Then how to explain the block device xfer rate being so bad, too? Is it that I have one drive in the array that's a different brand? Or that it has a different cache size? Anyone have any ideas? You will get more and maybe better, but this is a start just to see if the problem responds to obvious changes. -- Bill Davidsen <[EMAIL PROTECTED]> "Woe unto the statesman who makes war without a reason that will still be valid when the war is over..." Otto von Bismark - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: raid6 check/repair
Neil Brown wrote: On Thursday November 22, [EMAIL PROTECTED] wrote: Dear Neil, thank you very much for your detailed answer. Neil Brown wrote: While it is possible to use the RAID6 P+Q information to deduce which data block is wrong if it is known that either 0 or 1 datablocks is wrong, it is *not* possible to deduce which block or blocks are wrong if it is possible that more than 1 data block is wrong. If I'm not mistaken, this is only partly correct. Using P+Q redundancy, it *is* possible, to distinguish three cases: a) exactly zero bad blocks b) exactly one bad block c) more than one bad block Of course, it is only possible to recover from b), but one *can* tell, whether the situation is a) or b) or c) and act accordingly. It would seem that either you or Peter Anvin is mistaken. On page 9 of http://www.kernel.org/pub/linux/kernel/people/hpa/raid6.pdf at the end of section 4 it says: Finally, as a word of caution it should be noted that RAID-6 by itself cannot even detect, never mind recover from, dual-disk corruption. If two disks are corrupt in the same byte positions, the above algorithm will in general introduce additional data corruption by corrupting a third drive. The point that I'm trying to make is, that there does exist a specific case, in which recovery is possible, and that implementing recovery for that case will not hurt in any way. Assuming that it true (maybe hpa got it wrong) what specific conditions would lead to one drive having corrupt data, and would correcting it on an occasional 'repair' pass be an appropriate response? Does the value justify the cost of extra code complexity? RAID is not designed to protect again bad RAM, bad cables, chipset bugs drivers bugs etc. It is only designed to protect against drive failure, where the drive failure is apparent. i.e. a read must return either the same data that was last written, or a failure indication. Anything else is beyond the design parameters for RAID. I'm taking a more pragmatic approach here. In my opinion, RAID should "just protect my data", against drive failure, yes, of course, but if it can help me in case of occasional data corruption, I'd happily take that, too, especially if it doesn't cost extra... ;-) Everything costs extra. Code uses bytes of memory, requires maintenance, and possibly introduced new bugs. I'm not convinced the failure mode that you are considering actually happens with a meaningful frequency. People accept the hardware and performance costs of raid-6 in return for the better security of their data. If I run a check and find that I have an error, right now I have to treat that the same way as an unrecoverable failure, because the "repair" function doesn't fix the data, it just makes the symptom go away by redoing the p and q values. This makes the naive user thinks the problem is solved, when in fact it's now worse, he has corrupt data with no indication of a problem. The fact that (most) people who read this list are advanced enough to understand the issue does not protect the majority of users from their ignorance. If that sounds elitist, many of the people on this list are the elite, and even knowing that you need to learn and understand more is a big plus in my book. It's the people who run repair and assume the problem is fixed who get hurt by the current behavior. If you won't fix the recoverable case by recovering, then maybe for raid-6 you could print an error message like can't recover data, fix parity and hide the problem (y/N)? or require a --force flag, and at least give a heads up to the people who just picked the "most reliable raid level" because they're trying to do it right, but need a clue that they have a real and serious problem, and just a "repair" can't fix it. Recovering a filesystem full of "just files" is pretty easy, that's what backups with CRC are for, but a large database recovery often takes hours to restore and run journal files. I personally consider it the job of the kernel to do recovery when it is possible, absent that I would like the tools to tell me clearly that I have a problem and what it is. -- Bill Davidsen <[EMAIL PROTECTED]> "Woe unto the statesman who makes war without a reason that will still be valid when the war is over..." Otto von Bismark - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: raid6 check/repair
Thiemo Nagel wrote: Dear Neil, thank you very much for your detailed answer. Neil Brown wrote: While it is possible to use the RAID6 P+Q information to deduce which data block is wrong if it is known that either 0 or 1 datablocks is wrong, it is *not* possible to deduce which block or blocks are wrong if it is possible that more than 1 data block is wrong. If I'm not mistaken, this is only partly correct. Using P+Q redundancy, it *is* possible, to distinguish three cases: a) exactly zero bad blocks b) exactly one bad block c) more than one bad block Of course, it is only possible to recover from b), but one *can* tell, whether the situation is a) or b) or c) and act accordingly. I was waiting for a response before saying "me too," but that's exactly the case, there is a class of failures other than power failure or total device failure which result in just the "one identifiable bad sector" result. Given that the data needs to be read to realize that it is bad, why not go the extra inch and fix it properly instead of redoing the p+q which just makes the problem invisible rather than fixing it. Obviously this is a subset of all the things which can go wrong, but I suspect it's a sizable subset. -- Bill Davidsen <[EMAIL PROTECTED]> "Woe unto the statesman who makes war without a reason that will still be valid when the war is over..." Otto von Bismark - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: HELP! New disks being dropped from RAID 6 array on every reboot
Joshua Johnson wrote: Greetings, long time listener, first time caller. I recently replaced a disk in my existing 8 disk RAID 6 array. Previously, all disks were PATA drives connected to the motherboard IDE and 3 promise Ultra 100/133 controllers. I replaced one of the Promise controllers with a Via 64xx based controller, which has 2 SATA ports and one PATA port. I connected a new SATA drive to the new card, partitioned the drive and added it to the array. After 5 or 6 hours the resyncing process finished and the array showed up complete. Upon rebooting I discovered that the new drive had not been added to the array when it was assembled on boot. I resynced it and tried again -- still would not persist after a reboot. I moved one of the existing PATA drives to the new controller (so I could have the slot for network), rebooted and rebuilt the array. Now when I reboot BOTH disks are missing from the array (sda and sdb). Upon examining the disks it appears they think they are part of the array, but for some reason they are not being added when the array is being assembled. For example, this is a disk on the new controller which was not added to the array after rebooting: # mdadm --examine /dev/sda1 /dev/sda1: Magic : a92b4efc Version : 00.90.03 UUID : 63ee7d14:a0ac6a6e:aef6fe14:50e047a5 Creation Time : Thu Sep 21 23:52:19 2006 Raid Level : raid6 Device Size : 191157248 (182.30 GiB 195.75 GB) Array Size : 1146943488 (1093.81 GiB 1174.47 GB) Raid Devices : 8 Total Devices : 8 Preferred Minor : 0 Update Time : Fri Nov 23 10:22:57 2007 State : clean Active Devices : 8 Working Devices : 8 Failed Devices : 0 Spare Devices : 0 Checksum : 50df590e - correct Events : 0.96419878 Chunk Size : 256K Number Major Minor RaidDevice State this 6 816 active sync /dev/sda1 0 0 320 active sync /dev/hda2 1 1 5721 active sync /dev/hdk2 2 2 3322 active sync /dev/hde2 3 3 3423 active sync /dev/hdg2 4 4 2224 active sync /dev/hdc2 5 5 5625 active sync /dev/hdi2 6 6 816 active sync /dev/sda1 7 7 8 177 active sync /dev/sdb1 Everything there seems to be correct and current up to the last shutdown. But the disk is not being added on boot. Examining a disk that is currently running in the array shows: # mdadm --examine /dev/hdc2 /dev/hdc2: Magic : a92b4efc Version : 00.90.03 UUID : 63ee7d14:a0ac6a6e:aef6fe14:50e047a5 Creation Time : Thu Sep 21 23:52:19 2006 Raid Level : raid6 Device Size : 191157248 (182.30 GiB 195.75 GB) Array Size : 1146943488 (1093.81 GiB 1174.47 GB) Raid Devices : 8 Total Devices : 6 Preferred Minor : 0 Update Time : Fri Nov 23 10:23:52 2007 State : clean Active Devices : 6 Working Devices : 6 Failed Devices : 2 Spare Devices : 0 Checksum : 50df5934 - correct Events : 0.96419880 Chunk Size : 256K Number Major Minor RaidDevice State this 4 2224 active sync /dev/hdc2 0 0 320 active sync /dev/hda2 1 1 5721 active sync /dev/hdk2 2 2 3322 active sync /dev/hde2 3 3 3423 active sync /dev/hdg2 4 4 2224 active sync /dev/hdc2 5 5 5625 active sync /dev/hdi2 6 6 006 faulty removed 7 7 007 faulty removed Here is my /etc/mdadm/mdadm.conf: DEVICE partitions PROGRAM /bin/echo MAILADDR ARRAY /dev/md0 level=raid6 num-devices=8 UUID=63ee7d14:a0ac6a6e:aef6fe14:50e047a5 Can anyone see anything that is glaringly wrong here? Has anybody experienced similar behavior? I am running Debian using kernel 2.6.23.8. All partitions are set to type 0xFD and it appears the superblocks on the sd* disks were written, why wouldn't they be added to the array on boot? Any help is greatly appreciated! Does that match what's in the init files used at boot? By any chance does the information there explicitly list partitions by name? If you change to "PARTITIONS" in /etc/mdadm.conf it won't bite you until you change the detected partitions so they no longer match what was correct at install time. -- bill davidsen <[EMAIL PROTECTED]> CTO TMR Associates, Inc Doing interesting things with small computers since 1979 - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: md RAID 10 on Linux 2.6.20?
[EMAIL PROTECTED] wrote: Hi all, I am running a home-grown Linux 2.6.20.11 SMP 64-bit build, and I am wondering if there is indeed a RAID 10 "personality" defined in md that can be implemented using mdadm. If so, is it available in 2.6.20.11, or is it in a later kernel version? In the past, to create RAID 10, I created RAID 1's and a RAID 0, so an 8 drive RAID 10 would actually consist of 5 md devices (four RAID 1's and one RAID 0). But if I could just use RAID 10 natively, and simply create one RAID 10, that would of course be better both in terms of management and probably performance I would guess. Is this possible? Yes, and you are correct on the performance. Read the man page section on "near" and "far" copies of data carefully, and some back posts to this list. Most of us find that using far copies generates slightly slower write performance and significantly better read performance. -- bill davidsen <[EMAIL PROTECTED]> CTO TMR Associates, Inc Doing interesting things with small computers since 1979 - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: raid5 failure + libata irq: nobody cared
Vincze, Tamas wrote: Hi, Last night a drive failed in my RAID5 array and it was kicked out of the array, continuing with 3 drives as expected. However a few minutes later this was logged: irq 18: nobody cared (try booting with the "irqpoll" option) Call Trace: {__report_bad_irq+48} {note_interrupt+433} {__do_IRQ+191} IRQ 18 belongs to the SATA controller where all 4 drives are connected. The troubling thing is that the controller was still in use, and there should have been handling for the "nobody cared" interrupt. It sounds as if the failed drive didn't get marked to be ignored, or logged then ignored. I'd love to know what generated the IRQ in the first place. Nothing more was logged, probably because the interrupt got disabled, making it impossible to talk to the drives anymore. It's bad because I ended up with a dirty degraded array the second time this year. How would a RAID-6 handle a crash when a drive is missing? Would that also lead to possible silent corruptions? Or is the only option to avoid silent corruptions is a battery backed hardware controller? Kernel is 2.6.16-1.2133_FC5 There have been a lot of improvements in raid since then. Here's the full log: Nov 16 00:43:10 p4 kernel: ata1: command 0xea timeout, stat 0xd0 host_stat 0x0 Nov 16 00:43:10 p4 kernel: ata1: status=0xd0 { Busy } Nov 16 00:43:10 p4 kernel: ATA: abnormal status 0xD0 on port 0xE407 Nov 16 00:43:10 p4 last message repeated 2 times Nov 16 01:30:06 p4 kernel: ata1: command 0xea timeout, stat 0xd0 host_stat 0x0 Nov 16 01:30:06 p4 kernel: ata1: status=0xd0 { Busy } Nov 16 01:30:06 p4 kernel: ATA: abnormal status 0xD0 on port 0xE407 Nov 16 01:30:06 p4 last message repeated 2 times Nov 16 01:34:13 p4 kernel: ata1: command 0xea timeout, stat 0xd0 host_stat 0x0 Nov 16 01:34:13 p4 kernel: ata1: status=0xd0 { Busy } Nov 16 01:34:13 p4 kernel: ATA: abnormal status 0xD0 on port 0xE407 Nov 16 01:34:13 p4 last message repeated 2 times Nov 16 01:35:13 p4 kernel: ata1: command 0x35 timeout, stat 0xd0 host_stat 0x61 Nov 16 01:35:13 p4 kernel: ata1: status=0xd0 { Busy } Nov 16 01:35:13 p4 kernel: sd 0:0:0:0: SCSI error: return code = 0x802 Nov 16 01:35:13 p4 kernel: sda: Current: sense key: Aborted Command Nov 16 01:35:13 p4 kernel: Additional sense: Scsi parity error Nov 16 01:35:13 p4 kernel: end_request: I/O error, dev sda, sector 781015848 Nov 16 01:35:43 p4 kernel: ATA: abnormal status 0xD0 on port 0xE407 Nov 16 01:35:44 p4 last message repeated 2 times Nov 16 01:35:44 p4 kernel: ata1: command 0xea timeout, stat 0xd0 host_stat 0x0 Nov 16 01:35:44 p4 kernel: ata1: status=0xd0 { Busy } Nov 16 01:35:44 p4 kernel: raid5: Disk failure on sda3, disabling device. Operation continuing on 3 devices Nov 16 01:35:44 p4 kernel: ATA: abnormal status 0xD0 on port 0xE407 Nov 16 01:35:44 p4 kernel: RAID5 conf printout: Nov 16 01:35:44 p4 kernel: --- rd:4 wd:3 fd:1 Nov 16 01:35:44 p4 kernel: disk 0, o:0, dev:sda3 Nov 16 01:35:44 p4 kernel: disk 1, o:1, dev:sdc3 Nov 16 01:35:44 p4 kernel: disk 2, o:1, dev:sdb3 Nov 16 01:35:44 p4 kernel: disk 3, o:1, dev:sdd3 Nov 16 01:35:44 p4 kernel: RAID5 conf printout: Nov 16 01:35:44 p4 kernel: --- rd:4 wd:3 fd:1 Nov 16 01:35:44 p4 kernel: disk 1, o:1, dev:sdc3 Nov 16 01:35:44 p4 kernel: disk 2, o:1, dev:sdb3 Nov 16 01:35:44 p4 kernel: disk 3, o:1, dev:sdd3 Nov 16 01:37:36 p4 kernel: irq 18: nobody cared (try booting with the "irqpoll" option) Nov 16 01:37:36 p4 kernel: Nov 16 01:37:36 p4 kernel: Call Trace: {__report_bad_irq+48} Nov 16 01:37:36 p4 kernel: {note_interrupt+433} {__do_IRQ+191} -Tamas - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html -- bill davidsen <[EMAIL PROTECTED]> CTO TMR Associates, Inc Doing interesting things with small computers since 1979 - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: PROBLEM: raid5 hangs
Justin Piszcz wrote: This is a known bug in 2.6.23 and should be fixed in 2.6.23.2 if the RAID5 bio* patches are applied. Note below he's running 2.6.22.3 which doesn't have the bug unless -STABLE added it. So should not really be in 2.6.22.anything. I assume you're talking the endless write or bio issue? Justin. On Wed, 14 Nov 2007, Peter Magnusson wrote: Hey. [1.] One line summary of the problem: raid5 hangs and use 100% cpu [2.] Full description of the problem/report: I have used 2.6.18 for 284 days or something until my powersupply died, no problem what so ever duing that time. After that forced reboot I did these changes; Put in 2 GB more memory so I have 3 GB instead of 1 GB, two disks in the raid5 got badblocks so I didnt trust them anymore so I bought new disks (I managed to save the raid5). I have 6x300 GB in a raid5. Two of them are now 320 GB so created a small raid1 also. That raid5 is encrypted with aes-cbc-plain. The raid1 is encrypted with aes-cbc-essiv:sha256. I compiled linux-2.6.22.3 and started to use that. I used the same .config as in default FC5, I think i just selected P4 cpu and preemptive kernel type. After 11 or 12 days the computer froze, I wasnt home when it happend and couldnt fix it for like 3 days. It was just to reboot it as it wasnt possible to login remotely or on console. It did respond to ping however. After reboot it rebuilded the raid5. Then it happend again after approx the same time, 11 or 12 days. I noticed that the process md1_raid5 used 100% cpu all the time. After reboot it rebuilded the raid5. I compiled linux-2.6.23. And then... it happend again... After about the same time as before. md1_raid5 used 100% cpu. I also noticed that I wasnt able to save anything in my homedir, it froze during save. I could read from it however. My homedir isnt on raid5 but its encrypted. Its not on any disk that has to do with raid. This problem didnt happend when I used 2.6.18. Currently I use 2.6.18 as I kinda need the computer stable. After reboot it rebuilded the raid5. -- bill davidsen <[EMAIL PROTECTED]> CTO TMR Associates, Inc Doing interesting things with small computers since 1979 - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Proposal: non-striping RAID4
James Lee wrote: >From a quick search through this mailing list, it looks like I can answer my own question regarding RAID1 --> RAID5 conversion. Instead of creating a RAID1 array for the partitions on the two biggest drives, it should just create a 2-drive RAID5 (which is identical, but can be expanded as with any other RAID5 array). So it looks like this should work I guess. I believe what you want to create might be a three drive raid-5 with one failed drive. That way you can just add a drive when you want. mdadm -C -c32 -l5 -n3 -amd /dev/md7 /dev/loop[12] missing Then you can add another drive: mdadm --add /dev/md7 /dev/loop3 The output are at the end of this message. But in general think it would be really great to be able to have a format which would do raid-5 or raid-6 over all the available parts of multiple drives, and since there's some similar logic for raid-10 over a selection of drives it is clearly possible. But in terms of the benefit to be gained, unless it fails out of the code and someone feels the desire to do it, I can't see much joy to ever having such a thing. The feature I would really like to have is raid5e, distributed spare so head motion is spread over all drives. Don't have time to look at that one, either, but it really helps performance under load with small arrays. -- bill davidsen <[EMAIL PROTECTED]> CTO TMR Associates, Inc Doing interesting things with small computers since 1979 - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Building a new raid6 with bitmap does not clear bits during resync
Neil Brown wrote: On Monday November 12, [EMAIL PROTECTED] wrote: Neil Brown wrote: However there is value in regularly updating the bitmap, so add code to periodically pause while all pending sync requests complete, then update the bitmap. Doing this only every few seconds (the same as the bitmap update time) does not notciable affect resync performance. I wonder if a minimum time and minimum number of stripes would be better. If a resync is going slowly because it's going over a slow link to iSCSI, nbd, or a box of cheap drives fed off a single USB port, just writing the updated bitmap may represent as much data as has been resynced in the time slice. Not a suggestion, but a request for your thoughts on that. Thanks for your thoughts. Choosing how often to update the bitmap during a sync is certainly not trivial. In different situations, different requirements might rule. I chose to base it on time, and particularly on the time we already have for "how soon to write back clean bits to the bitmap" because it is fairly easy to users to understand the implications (if I set the time to 30 seconds, then I might have to repeat 30second of resync) and it is already configurable (via the "--delay" option to --create --bitmap). Sounds right, that part of it is pretty user friendly. Presumably if someone has a very slow system and wanted to use bitmaps, they would set --delay relatively large to reduce the cost and still provide significant benefits. This would effect both normal clean-bit writeback and during-resync clean-bit-writeback. Hope that clarifies my approach. Easy to implement and understand is always a strong point, and a user can make an informed decision. Thanks for the discussion. -- bill davidsen <[EMAIL PROTECTED]> CTO TMR Associates, Inc Doing interesting things with small computers since 1979 - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Building a new raid6 with bitmap does not clear bits during resync
Neil Brown wrote: On Thursday November 8, [EMAIL PROTECTED] wrote: Hi, I have created a new raid6: md0 : active raid6 sdb1[0] sdl1[5] sdj1[4] sdh1[3] sdf1[2] sdd1[1] 6834868224 blocks level 6, 512k chunk, algorithm 2 [6/6] [UU] [>] resync = 21.5% (368216964/1708717056) finish=448.5min speed=49808K/sec bitmap: 204/204 pages [816KB], 4096KB chunk The raid is totally idle, not mounted and nothing. So why does the "bitmap: 204/204" not sink? I would expect it to clear bits as it resyncs so it should count slowly down to 0. As a side effect of the bitmap being all dirty the resync will restart from the beginning when the system is hard reset. As you can imagine that is pretty anoying. On the other hand on a clean shutdown it seems the bitmap gets updated before stopping the array: md3 : active raid6 sdc1[0] sdm1[5] sdk1[4] sdi1[3] sdg1[2] sde1[1] 6834868224 blocks level 6, 512k chunk, algorithm 2 [6/6] [UU] [===>.] resync = 38.4% (656155264/1708717056) finish=17846.4min speed=982K/sec bitmap: 187/204 pages [748KB], 4096KB chunk Consequently the rebuild did restart and is already further along. Thanks for the report. Any ideas why that is so? Yes. The following patch should explain (a bit tersely) why this was so, and should also fix it so it will no longer be so. Test reports always welcome. NeilBrown Status: ok Update md bitmap during resync. Currently and md array with a write-intent bitmap does not updated that bitmap to reflect successful partial resync. Rather the entire bitmap is updated when the resync completes. This is because there is no guarentee that resync requests will complete in order, and tracking each request individually is unnecessarily burdensome. However there is value in regularly updating the bitmap, so add code to periodically pause while all pending sync requests complete, then update the bitmap. Doing this only every few seconds (the same as the bitmap update time) does not notciable affect resync performance. I wonder if a minimum time and minimum number of stripes would be better. If a resync is going slowly because it's going over a slow link to iSCSI, nbd, or a box of cheap drives fed off a single USB port, just writing the updated bitmap may represent as much data as has been resynced in the time slice. Not a suggestion, but a request for your thoughts on that. -- bill davidsen <[EMAIL PROTECTED]> CTO TMR Associates, Inc Doing interesting things with small computers since 1979 - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Proposal: non-striping RAID4
cost of maintaining it is justified by the benefit, but not my decision. If you were to set up such a thing using FUSE, keeping it out of the kernel but still providing the functionality, it might be worth doing. On the other hand, setting up the partitions and creating the arrays could probably be done by a perl script which would take only a few hours to write. PS: In case it wasn't clear, the attached code is simply the code the author has released under GPL - it's intended just for reference, not as proposed code for review. Much as I generally like adding functionality, I *really* can't see much in this idea. It seems to me to be in the "clever but not useful" category. -- bill davidsen <[EMAIL PROTECTED]> CTO TMR Associates, Inc Doing interesting things with small computers since 1979 - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Raid5 assemble after dual sata port failure
David Greaves wrote: Chris Eddington wrote: Yes, there is some kind of media error message in dmesg, below. It is not random, it happens at exactly the same moments in each xfs_repair -n run. Nov 11 09:48:25 altair kernel: [37043.300691] res 51/40:00:01:00:00/00:00:00:00:00/e1 Emask 0x9 (media error) Nov 11 09:48:25 altair kernel: [37043.304326] ata4.00: ata_hpa_resize 1: sectors = 976773168, hpa_sectors = 976773168 Nov 11 09:48:25 altair kernel: [37043.307672] ata4.00: ata_hpa_resize 1: sectors = 976773168, hpa_sectors = 976773168 I'm not sure what an ata_hpa_resize error is... HPA = Hardware Protected Area. By any chance is this disk partitioned such that the partition size includes the HPA? If it does, this sounds at least familiar, this mailing list post may get you started: http://osdir.com/ml/linux.ataraid/2005-09/msg2.html In any case, run "fdisk -l" and look at the claimed total disk size and the end point of the last partition. The HPA is not included in the "disk size" so nothing should be trying to do so. It probably explains the problems you've been having with the raid not 'just recovering' though. I saw this: http://www.linuxquestions.org/questions/linux-kernel-70/sata-issues-568894/ May be the same thing. Let us know what fdisk reports. What does smartctl say about your drive? IMO the spare drive is no longer useful for data recovery - you may want to use ddrescue to try and copy this drive to the spare drive. David PS Don't get the ddrescue parameters the wrong way round if you go that route... - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html -- bill davidsen <[EMAIL PROTECTED]> CTO TMR Associates, Inc Doing interesting things with small computers since 1979 - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Software raid - controller options
Lyle Schlueter wrote: Hello, I just started looking into software raid with linux a few weeks ago. I am outgrowing the commercial NAS product that I bought a while back. I've been learning as much as I can, suscribing to this mailing list, reading man pages, experimenting with loopback devices setting up and expanding test arrays. I have a few questions now that I'm sure someone here will be able to enlighten me about. First, I want to run a 12 drive raid 6, honestly, would I be better of going with true hardware raid like the areca ARC-1231ML vs software raid? I would prefer software raid just for the sheer cost savings. But what kind of processing power would it take to match or exceed a mid to high-level hardware controller? I haven't seen much, if any, discussion of this, but how many drives are people putting into software arrays? And how are you going about it? Motherboards seem to max out around 6-8 SATA ports. Do you just add SATA controllers? Looking around on newegg (and some googling) 2-port SATA controllers are pretty easy to find, but once you get to 4 ports the cards all seem to include some sort of built in *raid* functionality. Are there any 4+ port PCI-e SATA controllers cards? Depending on your needs for transfer rate vs. capacity, newegg has at least one external enclosure which holds (from memory) 8 drives, and brings the data down on a single SATA connector. If you need lots of data online but not a high transfer rate, this might be useful. I was offered the enclosure as a "deal" with a DVD burner, don't know what they were thinking there. Ordered the DVD Tues, arrived Wed, but I don't need the hot swap case. Are there any specific chipsets/brands of motherboards or controller cards that you software raid veterans prefer? Thank you for your time and any info you are able to give me! Lyle - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html -- bill davidsen <[EMAIL PROTECTED]> CTO TMR Associates, Inc Doing interesting things with small computers since 1979 - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 2.6.23.1: mdadm/raid5 hung/d-state
Jeff Lessem wrote: Dan Williams wrote: > The following patch, also attached, cleans up cases where the code looks > at sh->ops.pending when it should be looking at the consistent > stack-based snapshot of the operations flags. I tried this patch (against a stock 2.6.23), and it did not work for me. Not only did I/O to the effected RAID5 & XFS partition stop, but also I/O to all other disks. I was not able to capture any debugging information, but I should be able to do that tomorrow when I can hook a serial console to the machine. That can't be good! This is worrisome because Joel is giddy with joy because it fixes his iSCSI problems. I was going to try it with nbd, but perhaps I'll wait a week or so and see if others have more information. Applying patches before a holiday weekend is a good way to avoid time off. :-( I'm not sure if my problem is identical to these others, as mine only seems to manifest with RAID5+XFS. The RAID rebuilds with no problem, and I've not had any problems with RAID5+ext3. Hopefully it's not the raid which is the issue. -- bill davidsen <[EMAIL PROTECTED]> CTO TMR Associates, Inc Doing interesting things with small computers since 1979 - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Was: [RFC PATCH 2.6.23.1] md: add dm-raid1 read balancing
Goswin von Brederlow wrote: Konstantin Sharlaimov <[EMAIL PROTECTED]> writes: On Wed, 2007-11-07 at 10:15 +0100, Goswin von Brederlow wrote: I wonder if there shouldn't be a way to turn this off (or if there already is one). Or more generaly an option to say what is "near". Specifically I would like to teach the raid1 layer that I have 2 external raid boxes with a 16k chunk size. So read/write within a 16k chunk will be the same disk but the next 16k are a different disk and "near" doesn't apply anymore. Currently there is no way to turn this feature off (this is only a "request for comments" patch), but I'm planning to make it configurable via sysfs and module parameters. Thanks for suggestion for the "near" definition. What do you think about adding the "chunk_size" parameter (with the default value of 1 chunk = 1 sector). Setting it to 32 will make all reads within 16k chunk to be considered "near" (with zero distance) so they will go to the same disk. Max distance will also be configurable (after this distance the "read" operation is considered "far" and will go to randomly chosen disk) Regards, Konstantin Maybe you need more parameter: chunk_size- size of a continious chunk on the (multi disk) device stripe_size - size of a stripe of chunks spanning all disks rotation_size - size of multiple stripes before parity rotates to a new disk (sign gives direction of rotation) near_size - size that is considered to be near on a disk I would give all sizes in blocks of 512 bytes or bytes. Why? Would there ever be a time when there would be a significant (or any) benefit from a size other than a multiple of chunk size? If you give the rest of the sizes in multiples of chunk size it invites less human math. Default would be: Default would be "zero" to indicate that the raid system should figure out what to use, allowing the value of "one" to actually mean what it says. It also invites use of zero for the rest of the calculated sizes, indicating the raid subsystem should select values. With coming SSD hardware you may actually want one to mean one. chunk_size= 1 (block) stripe_size = 1 (block) rotation_size = 0 (no rotation) near_size = 256 That would reflect that you have all chunks continious on a normal disk and read/writes are done in 128K chunks. For raid 1 on raid 0: chunk_size = raid chunk size stripe_size = num disks * chunk_size rotation_size = 0 near_size = 256 For raid 1 on raid 5: chunk_size = raid chunk size stripe_size = (num disks - 1) * chunk_size rotation_size = (num disks - 1) * chunk_size (?) near_size = 256 and so on. MfG Goswin - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html -- bill davidsen <[EMAIL PROTECTED]> CTO TMR Associates, Inc Doing interesting things with small computers since 1979 - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Was: [RFC PATCH 2.6.23.1] md: add dm-raid1 read balancing
Rik van Riel wrote: On Thu, 08 Nov 2007 17:28:37 +0100 Goswin von Brederlow <[EMAIL PROTECTED]> wrote: Maybe you need more parameter: Generally a bad idea, unless you can come up with sane defaults (which do not need tuning 99% of the time) or you can derive these parameters automatically from the RAID configuration (unlikely with RAID 1?). Before turning Konstantin's nice and simple code into something complex, it would be good to see benchmark results show that the complexity is actually needed. I was about to post a question about the benefit of all this logic, I'll trim before posting... -- bill davidsen <[EMAIL PROTECTED]> CTO TMR Associates, Inc Doing interesting things with small computers since 1979 - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: switching root fs '/' to boot from RAID1 with grub
H. Peter Anvin wrote: Bill Davidsen wrote: Depends how "bad" the drive is. Just to align the thread on this - If the boot sector is bad - the bios on newer boxes will skip to the next one. But if it is "good", and you boot into garbage - - could be Windows.. does it crash? Right, if the drive is dead almost every BIOS will fail over, if the read gets a CRC or similar most recent BIOS will fail over, but if an error-free read returns bad data, how can the BIOS know. Unfortunately the Linux boot format doesn't contain any sort of integrity check. Otherwise the bootloader could catch this kind of error and throw a failure, letting the next disk boot (or another kernel.) I don't understand your point, unless there's a Linux bootloader in the BIOS it will boot whatever 512 bytes are in sector 0. So if that's crap it doesn't matter what it would do if it was valid, some other bytes came off the drive instead. Maybe Windows, since there seems to be an option in Windows to check the boot sector on boot and rewrite it if it isn't the WinXP one. One of my offspring has that problem, dual boot system, every time he boots Windows he has to boot from rescue and reinstall grub. I think he could install grub in the partition, make that the active partition, and the boot would work, but he tried and only type FAT or VFAT seem to boot, active or not. -- bill davidsen <[EMAIL PROTECTED]> CTO TMR Associates, Inc Doing interesting things with small computers since 1979 - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: switching root fs '/' to boot from RAID1 with grub
berk walker wrote: H. Peter Anvin wrote: Doug Ledford wrote: device /dev/sda (hd0) root (hd0,0) install --stage2=/boot/grub/stage2 /boot/grub/stage1 (hd0) /boot/grub/e2fs_stage1_5 p /boot/grub/stage2 /boot/grub/menu.lst device /dev/hdc (hd0) root (hd0,0) install --stage2=/boot/grub/stage2 /boot/grub/stage1 (hd0) /boot/grub/e2fs_stage1_5 p /boot/grub/stage2 /boot/grub/menu.lst That will install grub on the master boot record of hdc and sda, and in both cases grub will look to whatever drive it is running on for the files to boot instead of going to a specific drive. No, it won't... it'll look for the first drive in the system (BIOS drive 80h). This means that if the BIOS can see the bad drive, but it doesn't work, you're still screwed. -hpa - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Depends how "bad" the drive is. Just to align the thread on this - If the boot sector is bad - the bios on newer boxes will skip to the next one. But if it is "good", and you boot into garbage - - could be Windows.. does it crash? Right, if the drive is dead almost every BIOS will fail over, if the read gets a CRC or similar most recent BIOS will fail over, but if an error-free read returns bad data, how can the BIOS know. -- bill davidsen <[EMAIL PROTECTED]> CTO TMR Associates, Inc Doing interesting things with small computers since 1979 - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Superblocks
Greg Cormier wrote: Any reason 0.9 is the default? Should I be worried about using 1.0 superblocks? And can I "upgrade" my array from 0.9 to 1.0 superblocks? Do understand that Neil may have other reasons... but mainly the 0.9 format is the default because it is most widely supported and allows you to use new mdadm versions on old distributions (I still have one FC1 machine!). As for changing metadata on an existing array, I really can't offer any help. Thanks, Greg On 11/1/07, Neil Brown <[EMAIL PROTECTED]> wrote: On Tuesday October 30, [EMAIL PROTECTED] wrote: Which is the default type of superblock? 0.90 or 1.0? The default default is 0.90. However a local device can be set in mdadm.conf with e.g. CREATE metdata=1.0 NeilBrown -- bill davidsen <[EMAIL PROTECTED]> CTO TMR Associates, Inc Doing interesting things with small computers since 1979 - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: doesm mdadm try to use fastest HDD ?
Janek Kozicki wrote: Hello, My three HHDs have following speeds: hda - speed 70 MB/sec hdc - speed 27 MB/sec sda - speed 60 MB/sec They create a raid1 /dev/md0 and raid5 /dev/md1 arrays. I wanted to ask if mdadm is trying to pick the fastest HDD during operation? Maybe I can "tell" which HDD is preferred? If you are doing raid-1 between hdc and some faster drive, you could try using write-mostly and see go that works for you. For raid-5, it's faster to read the data off the slow drive than reconstruct it with multiple reads to multiple othjer faster drives. This came to my mind when I saw this: # mdadm --query --detail /dev/md1 | grep Prefer Preferred Minor : 1 And also in the manual: -W, --write-mostly [...] "can be useful if mirroring over a slow link." many thanks for all your help! I have two thoughts on this: 1 - if performance is critical, replace the slow drive 2 - for most things you do, I would expect seek to be more important than transfer rate -- bill davidsen <[EMAIL PROTECTED]> CTO TMR Associates, Inc Doing interesting things with small computers since 1979 - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: stride / stripe alignment on LVM ?
Neil Brown wrote: On Thursday November 1, [EMAIL PROTECTED] wrote: Hello, I have raid5 /dev/md1, --chunk=128 --metadata=1.1. On it I have created LVM volume called 'raid5', and finally a logical volume 'backup'. Then I formatted it with command: mkfs.ext3 -b 4096 -E stride=32 -E resize=550292480 /dev/raid5/backup And because LVM is putting its own metadata on /dev/md1, the ext3 partition is shifted by some (unknown for me) amount of bytes from the beginning of /dev/md1. I was wondering, how big is the shift, and would it hurt the performance/safety if the `ext3 stride=32` didn't align perfectly with the physical stripes on HDD? It is probably better to ask this question on an ext3 list as people there might know exactly what 'stride' does. I *think* it causes the inode tables to be offset in different block-groups so that they are not all on the same drive. If that is the case, then an offset causes by LVM isn't going to make any difference at all. Actually, I think that all of the performance evil Doug was mentioning will apply to LVM as well. So if things are poorly aligned, they will be poorly handled, a stripe-sized write will not go in a stripe, but will overlap chunks and cause all the data from all chunks to be read back for a new raid-5 calculation. So I would expect this to make a very large performance difference, so even if it work it would do so slowly. -- bill davidsen <[EMAIL PROTECTED]> CTO TMR Associates, Inc Doing interesting things with small computers since 1979 - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Implementing low level timeouts within MD
Alberto Alonso wrote: On Thu, 2007-11-01 at 15:16 -0400, Doug Ledford wrote: Not in the older kernel versions you were running, no. These "old versions" (specially the RHEL) are supposed to be the official versions supported by Redhat and the hardware vendors, as they were very specific as to what versions of Linux were supported. So the vendors of the failing drives claimed that these kernels were supported? That's great, most vendors don't even consider Linux supported. What response did you get when you reported the problem to Redhat on your RHEL support contract? Did they agree that this hardware, and its use for software raid, was supported and intended? Of all people, I would think you would appreciate that. Sorry if I sound frustrated and upset, but it is clearly a result of what "supported and tested" really means in this case. I don't want to go into a discussion of commercial distros, which are "supported" as this is nor the time nor the place but I don't want to open the door to the excuse of "its an old kernel", it wasn't when it got installed. The problem is in the time travel module. It didn't properly cope with future hardware, and since you have very long uptimes, I'm reasonably sure you haven't updated the kernel to get fixes installed. -- bill davidsen <[EMAIL PROTECTED]> CTO TMR Associates, Inc Doing interesting things with small computers since 1979 - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Superblocks
Neil Brown wrote: On Tuesday October 30, [EMAIL PROTECTED] wrote: Which is the default type of superblock? 0.90 or 1.0? The default default is 0.90. However a local device can be set in mdadm.conf with e.g. CREATE metdata=1.0 If you change to 1.start, 1.ed, 1.4k names for clarity, they need to be accepted here, as well. -- bill davidsen <[EMAIL PROTECTED]> CTO TMR Associates, Inc Doing interesting things with small computers since 1979 - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Time to deprecate old RAID formats?
Neil Brown wrote: On Friday October 26, [EMAIL PROTECTED] wrote: Perhaps you could have called them 1.start, 1.end, and 1.4k in the beginning? Isn't hindsight wonderful? Those names seem good to me. I wonder if it is safe to generate them in "-Eb" output If you agree that they are better, using them in the obvious places would be better now than later. Are you going to put them in the metadata options as well? Let me know, I have looking at the documentation on my list for next week, and could include some text. Maybe the key confusion here is between "version" numbers and "revision" numbers. When you have multiple versions, there is no implicit assumption that one is better than another. "Here is my version of what happened, now let's hear yours". When you have multiple revisions, you do assume ongoing improvement. v1.0 v1.1 and v1.2 are different version of the v1 superblock, which itself is a revision of the v0... Like kernel releases, people assume that the first number means *big* changes, the second incremental change. -- bill davidsen <[EMAIL PROTECTED]> CTO TMR Associates, Inc Doing interesting things with small computers since 1979 - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Requesting migrate device options for raid5/6
Goswin von Brederlow wrote: Hi, I would welcome if someone could work on a new feature for raid5/6 that would allow replacing a disk in a raid5/6 with a new one without having to degrade the array. Consider the following situation: raid5 md0 : sda sdb sdc Now sda gives a "SMART - failure iminent" warning and you want to repalce it with sdd. % mdadm --fail /dev/md0 /dev/sda % mdadm --remove /dev/md0 /dev/sda % mdadm --add /dev/md0 /dev/sdd Further consider that drive sdb will give an I/O error during resync of the array or fail completly. The array is in degraded mode so you experience data loss. That's a two drive failure, so you will lose data. But that is completly avoidable and some hardware raids support disk migration too. Loosly speaking the kernel should do the following: No, it's not "completly avoidable" because have described sda is ready to fail and sdb as "will give an I/O error" so if both happen at once you will lose data because you have no valid copy. That said, some of what you describe below is possible to *reduce* the probability of failure. But if sdb is going to have i/o errors, you really need to replace two drive :-( See below for some thoughts. raid5 md0 : sda sdb sdc -> create internal raid1 or dm-mirror raid1 mdT : sda raid5 md0 : mdT sdb sdc -> hot add sdd to mdT raid1 mdT : sda sdd raid5 md0 : mdT sdb sdc -> resync and then drop sda raid1 mdT : sdd raid5 md0 : mdT sdb sdc -> remove internal mirror raid5 md0 : sdd sdb sdc Thoughts? If there were a "migrate" option, it might work something like this: Given a migrate from sda to sdd, as you noted and raid1 between sda and sdd needs to be created, and obviously all chunks of sdd need to be marked as needing rebuild, but in addition sda needs to be made read-only, to minimize the i/o and to prevent any errors which might come from a failed write, like failed sector relocates, etc. Also, if valid data for a chunk is on sdd, no read would be done to sda. I think there's relevant code in the "write-mostly" bits to implement keep i/o to sda to a minimum, no writes and only mandatory reads when no valid chunk is on sdd yet. This is similar to recovery to a spare, save that most data will be valid on the failing drive and doesn't need to be recreated, only unreadable data must be done the slow way. Care is needed for sda as well, so that if sdd fails during migrate, a last chance attempt to bring sda back to useful content can be made, I'm paranoid that way. Assuming the migrate works correctly, sda is removed from the array, and the superblock should be marked to reflect that. Now sdd is a part of the array, and assemble, at least using UUID, should work. I personally think that a migrate capability would be vastly useful, both for handling failing drives and just moving data to a better place. As you point out, the user commands are not *quite* as robust as an internal implementation could be, and are complex enough to invite user error. I certainly always write down steps before doing migrate, and if possible do it with the system booted from a rescue media. -- bill davidsen <[EMAIL PROTECTED]> CTO TMR Associates, Inc Doing interesting things with small computers since 1979 - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Implementing low level timeouts within MD
Alberto Alonso wrote: On Mon, 2007-10-29 at 13:22 -0400, Doug Ledford wrote: What kernels were these under? Yes, these 3 were all SATA. The kernels (in the same order as above) are: * 2.4.21-4.ELsmp #1 (Basically RHEL v3) * 2.6.18-4-686 #1 SMP on a Fedora Core release 2 * 2.6.17.13 (compiled from vanilla sources) *Old* kernels. If you are going to build your own kernel, get a new one! The RocketRAID was configured for all drives as legacy/normal and software RAID5 across all drives. I wasn't using hardware raid on the last described system when it crashed. -- bill davidsen <[EMAIL PROTECTED]> CTO TMR Associates, Inc Doing interesting things with small computers since 1979 - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Implementing low level timeouts within MD
Alberto Alonso wrote: On Tue, 2007-10-30 at 13:39 -0400, Doug Ledford wrote: Really, you've only been bitten by three so far. Serverworks PATA (which I tend to agree with the other person, I would probably chock 3 types of bugs is too many, it basically affected all my customers with multi-terabyte arrays. Heck, we can also oversimplify things and say that it is really just one type and define everything as kernel type problems (or as some other kernel used to say... general protection error). I am sorry for not having hundreds of RAID servers from which to draw statistical analysis. As I have clearly stated in the past I am trying to come up with a list of known combinations that work. I think my data points are worth something to some people, specially those considering SATA drives and software RAID for their file servers. If you don't consider them important for you that's fine, but please don't belittle them just because they don't match your needs. this up to Serverworks, not PATA), USB storage, and SATA (the SATA stack is arranged similar to the SCSI stack with a core library that all the drivers use, and then hardware dependent driver modules...I suspect that since you got bit on three different hardware versions that you were in fact hitting a core library bug, but that's just a suspicion and I could well be wrong). What you haven't tried is any of the SCSI/SAS/FC stuff, and generally that's what I've always used and had good things to say about. I've only used SATA for my home systems or workstations, not any production servers. The USB array was never meant to be a full production system, just to buy some time until the budget was allocated to buy a real array. Having said that, the raid code is written to withstand the USB disks getting disconnected as far as the driver reports it properly. Since it doesn't, I consider it another case that shows when not to use software RAID thinking that it will work. As for SCSI I think it is a greatly proved and reliable technology, I've dealt with it extensively and have always had great results. I know deal with it mostly on non Linux based systems. But I don't think it is affordable to most SMBs that need multi-terabyte arrays. Actually, SCSI can fail as well. Until recently I was running servers with multi-TB arrays, and regularly, several times a year, a drive would fail and glitch the SCSI bus such that the next i/o to another drive would fail. And I've had SATA drives fail cleanly on small machines, so neither is an "always" config. -- bill davidsen <[EMAIL PROTECTED]> CTO TMR Associates, Inc Doing interesting things with small computers since 1979 - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Raid-10 mount at startup always has problem
Daniel L. Miller wrote: Doug Ledford wrote: Nah. Even if we had concluded that udev was to blame here, I'm not entirely certain that we hadn't left Daniel with the impression that we suspected it versus blamed it, so reiterating it doesn't hurt. And I'm sure no one has given him a fix for the problem (although Neil did request a change that will give debug output, but not solve the problem), so not dropping it entirely would seem appropriate as well. I've opened a bug report on Ubuntu's Launchpad.net. Scott James Remnant asked me to cc him on Neil's incremental reference - we'll see what happens from here. Thanks for the help guys. At the moment, I've changed my mdadm.conf to explicitly list the drives, instead of the auto=partition parameter. We'll see what happens on the next reboot. I don't know if it means anything, but I'm using a self-compiled 2.6.22 kernel - with initrd. At least I THINK I'm using initrd - I have an image, but I don't see an initrd line in my grub config. HmmI'm going to add a stanza that includes the initrd and see what happens also. What did that do? -- bill davidsen <[EMAIL PROTECTED]> CTO TMR Associates, Inc Doing interesting things with small computers since 1979 - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Raid-10 mount at startup always has problem
Luca Berra wrote: On Sun, Oct 28, 2007 at 08:21:34PM -0400, Bill Davidsen wrote: Because you didn't stripe align the partition, your bad. Align to /what/ stripe? Hardware (CHS is fiction), software (of the RAID the real stripe (track) size of the storage, you must read the manual and/or bug technical support for that info. That's my point, there *is* no "real stripe (track) size of the storage" because modern drives use zone bit recording, and sectors per track depends on track, and changes within a partition. See http://www.dewassoc.com/kbase/hard_drives/hard_disk_sector_structures.htm http://www.storagereview.com/guide2000/ref/hdd/op/mediaTracks.html you're about to create), or ??? I don't notice my FC6 or FC7 install programs using any special partition location to start, I have only run (tried to run) FC8-test3 for the live CD, so I can't say what it might do. CentOS4 didn't do anything obvious, either, so unless I really misunderstand your position at redhat, that would be your bad. ;-) If you mean start a partition on a pseudo-CHS boundary, fdisk seems to use what it thinks are cylinders for that. Yes, fdisk will create partition at sector 63 (due to CHS being braindead, other than fictional: 63 sectors-per-track) most arrays use 64 or 128 spt, and array cache are aligned accordingly. So 63 is almost always the wrong choice. As the above links show, there's no right choice. for the default choice you must consider what spt your array uses, iirc (this is from memory, so double check these figures) IBM 64 spt (i think) EMC DMX 64 EMC CX 128??? HDS (and HP XP) except OPEN-V 96 HDS (and HP XP) OPEN-V 128 HP EVA 4/6/8 with XCS 5.x state that no alignment is needed even if i never found a technical explanation about that. previous HP EVA versions did (maybe 64). you might then want to consider how data is laid out on the storage, but i believe the storage cache is enough to deal with that issue. Please note that "0" is always well aligned. Note to people who is now wondering WTH i am talking about. consider a storage with 64 spt, an io size of 4k and partition starting at sector 63. first io request will require two ios from the storage (1 for sector 63, and one for sectors 64 to 70) the next 7 io (71-78,79-86,97-94,95-102,103-110,111-118,119-126) will be on the same track the 8th will again require to be split, and so on. this causes the storage to do 1 unnecessary io every 8. YMMV. No one makes drives with fixed spt any more. Your assumptions are a decade out of date. -- bill davidsen <[EMAIL PROTECTED]> CTO TMR Associates, Inc Doing interesting things with small computers since 1979 - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Time to deprecate old RAID formats?
Doug Ledford wrote: On Fri, 2007-10-26 at 14:41 -0400, Doug Ledford wrote: Actually, after doing some research, here's what I've found: I should note that both the lvm code and raid code are simplistic at the moment. For example, the raid5 mapping only supports the default raid5 layout. If you use any other layout, game over. Getting it to work with version 1.1 or 1.2 superblocks probably wouldn't be that hard, but getting it to the point where it handles all the relevant setups properly would require a reasonable amount of coding. My first thought is that after the /boot partition is read (assuming you use one) restrictions go away. Performance of /boot is not much of an issue, for me at least, but more complex setups are sometimes need for the rest of the system. Thanks for the research. -- bill davidsen <[EMAIL PROTECTED]> CTO TMR Associates, Inc Doing interesting things with small computers since 1979 - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Time to deprecate old RAID formats?
Doug Ledford wrote: On Sat, 2007-10-27 at 11:20 -0400, Bill Davidsen wrote: * When using lilo to boot from a raid device, it automatically installs itself to the mbr, not to the partition. This can not be changed. Only 0.90 and 1.0 superblock types are supported because lilo doesn't understand the offset to the beginning of the fs otherwise. I'm reasonably sure that's wrong, I used to set up dual boot machines by putting LILO in the partition and making that the boot partition, by changing the active partition flag I could just have the machine boot Windows, to keep people from getting confused. Yeah, someone else pointed this out too. The original patch to lilo *did* do as I suggest, so they must have improved on the patch later. * When using grub to boot from a raid device, only 0.90 and 1.0 superblocks are supported[1] (because grub is ignorant of the raid and it requires the fs to start at the start of the partition). You can use either MBR or partition based installs of grub. However, partition based installs require that all bootable partitions be in exactly the same logical block address across all devices. This limitation can be an extremely hazardous limitation in the event a drive dies and you have to replace it with a new drive as newer drives may not share the older drive's geometry and will require starting your boot partition in an odd location to make the logical block addresses match. * When using grub2, there is supposedly already support for raid/lvm devices. However, I do not know if this includes version 1.0, 1.1, or 1.2 superblocks. I intend to find that out today. If you tell grub2 to install to an md device, it searches out all constituent devices and installs to the MBR on each device[2]. This can't be changed (at least right now, probably not ever though). That sounds like a good reason to avoid grub2, frankly. Software which decides that it knows what to do better than the user isn't my preference. If I wanted software which fores me to do things "their way" I'd be running Windows. It's not really all that unreasonable of a restriction. Most people aren't aware than when you put a boot sector at the beginning of a partition, you only have 512 bytes of space, so the boot loader that you put there is basically nothing more than code to read the remainder of the boot loader from the file system space. Now, traditionally, most boot loaders have had to hard code the block addresses of certain key components into these second stage boot loaders. If a user isn't aware of the fact that the boot loader does this at install time (or at kernel selection update time in the case of lilo), then they aren't aware that the files must reside at exactly the same logical block address on all devices. Without that knowledge, they can easily create an unbootable setup by having the various boot partitions in slightly different locations on the disks. And intelligent partition editors like parted can compound the problem because as they insulate the user from having to pick which partition number is used for what partition, etc., they can end up placing the various boot partitions in different areas of different drives. The requirement above is a means of making sure that users aren't surprise by a non-working setup. The whole element of least surprise thing. Of course, if they keep that requirement, then I would expect it to be well documented so that people know this going into putting the boot loader in place, but I would argue that this is at least better than finding out when a drive dies that your system isn't bootable. So, given the above situations, really, superblock format 1.2 is likely to never be needed. None of the shipping boot loaders work with 1.2 regardless, and the boot loader under development won't install to the partition in the event of an md device and therefore doesn't need that 4k buffer that 1.2 provides. Sounds right, although it may have other uses for clever people. [1] Grub won't work with either 1.1 or 1.2 superblocks at the moment. A person could probably hack it to work, but since grub development has stopped in preference to the still under development grub2, they won't take the patches upstream unless they are bug fixes, not new features. If the patches were available, "doesn't work with existing raid formats" would probably qualify as a bug. Possibly. I'm a bit overbooked on other work at the moment, but I may try to squeeze in some work on grub/grub2 to support version 1.1 or 1.2 superblocks. [2] There are two ways to install to a master boot record. The first is to use the first 512 bytes *only* and hardcode the location of the remainder of the boot loader into those 512 bytes. The second way is to use the free space between the
Re: Raid-10 mount at startup always has problem
Doug Ledford wrote: On Fri, 2007-10-26 at 11:15 +0200, Luca Berra wrote: On Thu, Oct 25, 2007 at 02:40:06AM -0400, Doug Ledford wrote: The partition table is the single, (mostly) universally recognized arbiter of what possible data might be on the disk. Having a partition table may not make mdadm recognize the md superblock any better, but it keeps all that other stuff from even trying to access data that it doesn't have a need to access and prevents random luck from turning your day bad. on a pc maybe, but that is 20 years old design. So? Unix is 35+ year old design, I suppose you want to switch to Vista then? partition table design is limited because it is still based on C/H/S, which do not exist anymore. Put a partition table on a big storage, say a DMX, and enjoy a 20% performance decrease. Because you didn't stripe align the partition, your bad. Align to /what/ stripe? Hardware (CHS is fiction), software (of the RAID you're about to create), or ??? I don't notice my FC6 or FC7 install programs using any special partition location to start, I have only run (tried to run) FC8-test3 for the live CD, so I can't say what it might do. CentOS4 didn't do anything obvious, either, so unless I really misunderstand your position at redhat, that would be your bad. ;-) If you mean start a partition on a pseudo-CHS boundary, fdisk seems to use what it thinks are cylinders for that. Please clarify what alignment provides a performance benefit. -- bill davidsen <[EMAIL PROTECTED]> CTO TMR Associates, Inc Doing interesting things with small computers since 1979 - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Software RAID when it works and when it doesn't
Alberto Alonso wrote: On Fri, 2007-10-26 at 18:12 +0200, Goswin von Brederlow wrote: Depending on the hardware you can still access a different disk while another one is reseting. But since there is no timeout in md it won't try to use any other disk while one is stuck. That is exactly what I miss. MfG Goswin - That is exactly what I've been talking about. Can md implement timeouts and not just leave it to the drivers? I can't believe it but last night another array hit the dust when 1 of the 12 drives went bad. This year is just a nightmare for me. It brought all the network down until I was able to mark it failed and reboot to remove it from the array. I'm not sure what kind of drives and drivers you use, but I certainly have drives go bad and they get marked as failed. Both on old PATA drives and newer SATA. All the SCSI I currently use is on IBM hardware RAID (ServeRAID), so I can only assume that failure would be noted. -- bill davidsen <[EMAIL PROTECTED]> CTO TMR Associates, Inc Doing interesting things with small computers since 1979 - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Time to deprecate old RAID formats?
Doug Ledford wrote: On Fri, 2007-10-26 at 10:18 -0400, Bill Davidsen wrote: [___snip___] Actually, after doing some research, here's what I've found: * When using lilo to boot from a raid device, it automatically installs itself to the mbr, not to the partition. This can not be changed. Only 0.90 and 1.0 superblock types are supported because lilo doesn't understand the offset to the beginning of the fs otherwise. I'm reasonably sure that's wrong, I used to set up dual boot machines by putting LILO in the partition and making that the boot partition, by changing the active partition flag I could just have the machine boot Windows, to keep people from getting confused. * When using grub to boot from a raid device, only 0.90 and 1.0 superblocks are supported[1] (because grub is ignorant of the raid and it requires the fs to start at the start of the partition). You can use either MBR or partition based installs of grub. However, partition based installs require that all bootable partitions be in exactly the same logical block address across all devices. This limitation can be an extremely hazardous limitation in the event a drive dies and you have to replace it with a new drive as newer drives may not share the older drive's geometry and will require starting your boot partition in an odd location to make the logical block addresses match. * When using grub2, there is supposedly already support for raid/lvm devices. However, I do not know if this includes version 1.0, 1.1, or 1.2 superblocks. I intend to find that out today. If you tell grub2 to install to an md device, it searches out all constituent devices and installs to the MBR on each device[2]. This can't be changed (at least right now, probably not ever though). That sounds like a good reason to avoid grub2, frankly. Software which decides that it knows what to do better than the user isn't my preference. If I wanted software which fores me to do things "their way" I'd be running Windows. So, given the above situations, really, superblock format 1.2 is likely to never be needed. None of the shipping boot loaders work with 1.2 regardless, and the boot loader under development won't install to the partition in the event of an md device and therefore doesn't need that 4k buffer that 1.2 provides. Sounds right, although it may have other uses for clever people. [1] Grub won't work with either 1.1 or 1.2 superblocks at the moment. A person could probably hack it to work, but since grub development has stopped in preference to the still under development grub2, they won't take the patches upstream unless they are bug fixes, not new features. If the patches were available, "doesn't work with existing raid formats" would probably qualify as a bug. [2] There are two ways to install to a master boot record. The first is to use the first 512 bytes *only* and hardcode the location of the remainder of the boot loader into those 512 bytes. The second way is to use the free space between the MBR and the start of the first partition to embed the remainder of the boot loader. When you point grub2 at an md device, they automatically only use the second method of boot loader installation. This gives them the freedom to be able to modify the second stage boot loader on a boot disk by boot disk basis. The downside to this is that they need lots of room after the MBR and before the first partition in order to put their core.img file in place. I *think*, and I'll know for sure later today, that the core.img file is generated during grub install from the list of optional modules you specify during setup. Eg., the pc module gives partition table support, the lvm module lvm support, etc. You list the modules you need, and grub then builds a core.img out of all those modules. The normal amount of space between the MBR and the first partition is (sectors_per_track - 1). For standard disk geometries, that basically leaves 254 sectors, or 127k of space. This might not be enough for your particular needs if you have a complex boot environment. In that case, you would need to bump at least the starting track of your first partition to make room for your boot loader. Unfortunately, how is a person to know how much room their setup needs until after they've installed and it's too late to bump the partition table start? They can't. So, that's another thing I think I will check out today, what the maximum size of grub2 might be with all modules included, and what a common size might be. Based on your description, it sounds as if grub2 may not have given adequate thought to what users other than the authors might need (that may be a premature conclusion). I have multiple installs on several of my machines, and I assume that the grub2 for 32 and 64 bit will be different. Thanks for the research.
Re: Time to deprecate old RAID formats?
Neil Brown wrote: On Thursday October 25, [EMAIL PROTECTED] wrote: I didn't get a reply to my suggestion of separating the data and location... No. Sorry. ie not talking about superblock versions 0.9, 1.0, 1.1, 1.2 etc but a data format (0.9 vs 1.0) and a location (end,start,offset4k)? This would certainly make things a lot clearer to new (and old!) users: mdadm --create /dev/md0 --metadata 1.0 --meta-location offset4k or mdadm --create /dev/md0 --metadata 1.0 --meta-location start or mdadm --create /dev/md0 --metadata 1.0 --meta-location end I'm happy to support synonyms. How about --metadata 1-end --metadata 1-start ?? Offset? Do you like "1-offset4k" or maybe "1-start4k" or even "1-start+4k" for that? The last is most intuitive but I don't know how you feel about the + in there. resulting in: mdadm --detail /dev/md0 /dev/md0: Version : 01.0 Metadata-locn : End-of-device It already lists the superblock location as a sector offset, but I don't have a problem with reporting: Version : 1.0 (metadata at end of device) Version : 1.1 (metadata at start of device) Would that help? Same comments on the reporting, "metadata at block 4k" or something. Creation Time : Fri Aug 4 23:05:02 2006 Raid Level : raid0 You provide rational defaults for mortals and this approach allows people like Doug to do wacky HA things explicitly. I'm not sure you need any changes to the kernel code - probably just the docs and mdadm. True. It is conceivable that I could change the default, though that would require a decision as to what the new default would be. I think it would have to be 1.0 or it would cause too much confusion. A newer default would be nice. I also suspect that a *lot* of people will assume that the highest superblock version is the best and should be used for new installs etc. Grumble... why can't people expect what I want them to expect? I confess that I thought 1.x was a series of solutions reflecting your evolving opinion on what was best, so maybe in retrospect you made a non-intuitive choice of nomenclature. Or bluntly, you picked confusing names for this and confused people. If 1.0 meant start, 1.1 meant 4k, and 1.2 meant end, at least it would be easy to remember for people who only create a new array a few times a year, or once in the lifetime of a new computer. So if you make 1.0 the default then how many users will try 'the bleeding edge' and use 1.2? So then you have 1.3 which is the same as 1.0? H? So to quote from an old Soap: "Confused, you will be..." Perhaps you could have called them 1.start, 1.end, and 1.4k in the beginning? Isn't hindsight wonderful? -- bill davidsen <[EMAIL PROTECTED]> CTO TMR Associates, Inc Doing interesting things with small computers since 1979 - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html