Re: transferring RAID-1 drives via sneakernet
Jeff Breidenbach wrote: It's not a RAID issue, but make sure you don't have any duplicate volume names. According to Murphy's Law, if there are two / volumes, the wrong one will be chosen upon your next reboot. Thanks for the tip. Since I'm not using volumes or LVM at all, I should be safe from this particular problem. Volumes is being used as a generic term here. You would be safest if, for the disks/partitions you are transferring, you made the partition type 0x83 (linux) instead of 0xfd to prevent the kernel autodetecting. Otherwise there is a risk that /dev/md0 and /dev/md1 will be transposed. Having done that you can manually assemble the array and then configure mdadm.conf to associate the UUID with the correct md device. David - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: transferring RAID-1 drives via sneakernet
Jeff Breidenbach wrote: I'm planning to take some RAID-1 drives out of an old machine and plop them into a new machine. Hoping that mdadm assemble will magically work. There's no reason it shouldn't work. Right? old [ mdadm v1.9.0 / kernel 2.6.17 / Debian Etch / x86-64 ] new [ mdad v2.6.2 / kernel 2.6.22 / Ubuntu 7.10 server ] I've done it several times. Does the new machine have a RAID array already? David - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Use new sb type
Jan Engelhardt wrote: Feel free to argue that the manpage is clear on this - but as we know, not everyone reads the manpages in depth... That is indeed suboptimal (but I would not care since I know the implications of an SB at the front) Neil cares even less and probably doesn't even need mdadm - heck he probably just echos the raw superblock into place via dd... http://xkcd.com/378/ :D David - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: howto and faq
Keld Jørn Simonsen wrote: I am trying to get some order to linux raid info. Help appreciated :) The list description at http://vger.kernel.org/vger-lists.html#linux-raid does list af FAQ, http://www.linuxdoc.org/FAQ/ Yes, that should be amended. Drop them a line about the FAQ too So our FAQ info is pretty out of date. I think it would be nice to have a wiki like we have for the Howto. This would mean that we have much better means to let new people make their mark, and avoid the problem that we have today with really outdated info. There seems to be no point in having separate wikis for the FAQ and HOWTO elements of documentation. Especially since a lot of FAQs are How do I... by definition the answer is a HOWTO. So can we put up a wiki somewhere for this, or should we just extend the wiki howto pages to also include a faq section? So just extend the existing wiki. For the howto, I have asked the VGER people to add info to our list description, that we have a wiki howto at http://linux-raid.osdl.org/ ta. I set the wiki up at osdl to ensure that if a bus hit me then Neil or others would have a rational and responsive organisation to go to to change ownership. I've been writing to some of the other FAQ/Doc organisations sporadically for over a year now and had no response from any of them. It's a very poor aspect of OSS... David - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Use new sb type
Jan Engelhardt wrote: On Jan 29 2008 18:08, Bill Davidsen wrote: IIRC there was a discussion a while back on renaming mdadm options (google Time to deprecate old RAID formats?) and the superblocks to emphasise the location and data structure. Would it be good to introduce the new names at the same time as changing the default format/on-disk-location? Yes, I suggested some layout names, as did a few other people, and a few changes to separate metadata type and position were discussed. BUT, changing the default layout, no matter how better it seems, is trumped by breaks existing setups and user practice. Layout names are a different matter from what the default sb type should be. Indeed they are. Or rather should be. However the current default sb includes a layout element. If the default sb is changed then it seems like an opportunity to detach the data format from the on-disk location. David - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: when is a disk non-fresh?
Dexter Filmore wrote: On Friday 08 February 2008 00:22:36 Neil Brown wrote: On Thursday February 7, [EMAIL PROTECTED] wrote: On Tuesday 05 February 2008 03:02:00 Neil Brown wrote: On Monday February 4, [EMAIL PROTECTED] wrote: Seems the other topic wasn't quite clear... not necessarily. sometimes it helps to repeat your question. there is a lot of noise on the internet and somethings important things get missed... :-) Occasionally a disk is kicked for being non-fresh - what does this mean and what causes it? The 'event' count is too small. Every event that happens on an array causes the event count to be incremented. An 'event' here is any atomic action? Like write byte there or calc XOR? An 'event' is - switch from clean to dirty - switch from dirty to clean - a device fails - a spare finishes recovery things like that. Is there a glossary that explains dirty and such in detail? Not yet. http://linux-raid.osdl.org/index.php?title=Glossary David - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Use new sb type
Jan Engelhardt wrote: On Feb 10 2008 10:34, David Greaves wrote: Jan Engelhardt wrote: On Jan 29 2008 18:08, Bill Davidsen wrote: IIRC there was a discussion a while back on renaming mdadm options (google Time to deprecate old RAID formats?) and the superblocks to emphasise the location and data structure. Would it be good to introduce the new names at the same time as changing the default format/on-disk-location? Yes, I suggested some layout names, as did a few other people, and a few changes to separate metadata type and position were discussed. BUT, changing the default layout, no matter how better it seems, is trumped by breaks existing setups and user practice. Layout names are a different matter from what the default sb type should be. Indeed they are. Or rather should be. However the current default sb includes a layout element. If the default sb is changed then it seems like an opportunity to detach the data format from the on-disk location. I do not see anything wrong by specifying the SB location as a metadata version. Why should not location be an element of the raid type? It's fine the way it is IMHO. (Just the default is not :) There was quite a discussion about it. For me the main argument is that for most people seeing superblock versions (even the manpage terminology is version and subversion) will correlate incremental versions with improvement. They will therefore see v1.2 as 'the latest and best'. We had our first 'in the wild' example just a few days ago. Feel free to argue that the manpage is clear on this - but as we know, not everyone reads the manpages in depth... It's misleading and I would submit that *if* Neil decides to change the default then changing the terminology at the same time would mean a single change that ushers in broader benefit. I acknowledge that I am only talking semantics - OTOH I think semantics can be a very important aspect of communication. David PS I would love to send a patch to mdadm in - I am currently being heavily nagged to sort out our house electrics and get lunch. It may happen though :) - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: howto and faq
Keld Jørn Simonsen wrote: I would then like that to be reflected in the main page. I would rather that this be called Howto and FAQ - Linux raid than Main Page - Linux Raid. Is that possible? Just like C has a main() wiki's have a Main Page :) I guess it could be changed but I think it involves editing the Mediawiki config - maybe next time I'm in there... And then, how do we structure the pages? I think we need a new section for the FAQ. By all means create an FAQ page and link to answers or other relevant sections of the wiki. Bear in mind that this is a reference work and whilst it may contain tutorials the idea is that it contains (reasonably) authoritative information about the linux raid subsystem (linking to the source, kernel docs or man pages if that's more appropriate). And then I would like a clearer statement on the relation between the linux-raid mailing list and the pages, right in the top of the main page. The relationship is loose - the statement as it stands describes the current state of affairs. If Neil feels that he could or would like to help the case by declaring a more official relationship then that's his call. To be fair I work on these pages on and off as the mood takes me :) if I was Neil I'd be keeping an eye on it and waiting for the right level of community involvement. I have had a look at other search engines, yahoo and msn. Our pages do show up within the 10 first hits for linux raid. So that is not that bad. Still, Google has the http://linux-raid.osdl.org/ page as number 127. That is very bad. Maybe something about it being referenced from wikipedia? I'm not an expert at gaming the search engines - more than happy to do rational things like linking from Wikipedia and other reference sites. I am sad that I've had such a poor response from the other linux documentation sites... maybe a Slashdot article not so much about doc-rot but about the difficulty of combating doc-rot would help... Maybe they'd take more notice if I said the linux raid subsystem maintainer says... - dunno. David - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Deleting mdadm RAID arrays
Marcin Krol wrote: Hello everyone, I have had a problem with RAID array (udev messed up disk names, I've had RAID on disks only, without raid partitions) Do you mean that you originally used /dev/sdb for the RAID array? And now you are using /dev/sdb1? Given the system seems confused I wonder if this may be relevant? David - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RAID 1 and grub
Richard Scobie wrote: David Rees wrote: FWIW, this step is clearly marked in the Software-RAID HOWTO under Booting on RAID: http://tldp.org/HOWTO/Software-RAID-HOWTO-7.html#ss7.3 The one place I didn't look... Good - I hope you'll both look here instead: http://linux-raid.osdl.org/index.php/Tweaking%2C_tuning_and_troubleshooting#Booting_on_RAID David - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: In this partition scheme, grub does not find md information?
Peter Rabbitson wrote: I guess I will sit down tonight and craft some patches to the existing md* man pages. Some things are indeed left unsaid. If you want to be more verbose than a man page allows then there's always the wiki/FAQ... http://linux-raid.osdl.org/ Keld Jørn Simonsen wrote: Is there an official web page for mdadm? And maybe the raid faq could be updated? That *is* the linux-raid FAQ brought up to date (with the consent of the original authors) Of course being a wiki means it is now a shared, community responsibility - and to all present and future readers: that means you too ;) David - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: In this partition scheme, grub does not find md information?
On 26 Oct 2007, Neil Brown wrote: On Thursday October 25, [EMAIL PROTECTED] wrote: I also suspect that a *lot* of people will assume that the highest superblock version is the best and should be used for new installs etc. Grumble... why can't people expect what I want them to expect? Moshe Yudkowsky wrote: I expect it's because I used 1.2 superblocks (why not use the latest, I said, foolishly...) and therefore the RAID10 -- Aha - an 'in the wild' example of why we should deprecate '0.9 1.0 1.1, 1.2' and rename the superblocks to data-version + on-disk-location :) David - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: linux raid faq
Keld Jørn Simonsen wrote: Hmm, I read the Linux raid faq on http://www.faqs.org/contrib/linux-raid/x37.html It looks pretty outdated, referring to how to patch 2.2 kernels and not mentioning new mdadm, nor raid10. It was not dated. It seemed to be related to the linux-raid list, telling where to find archives of the list. Maybe time for an update? or is this not the right place to write stuff? http://linux-raid.osdl.org/index.php/Main_Page I have written to faqs.org but got no reply. I'll try again... If I searched on google for raid faq, the first say 5-7 items did not mention raid10. Until people link to and use the new wiki, Google won't find it. Maybe wikipedia is the way to go? I did contribute myself a little there. The software raid howto is dated v. 1.1 3rd of June 2004, http://unthought.net/Software-RAID.HOWTO/Software-RAID.HOWTO.html also pretty old. FYI http://linux-raid.osdl.org/index.php/Credits David - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: WRONG INFO (was Re: In this partition scheme, grub does not find md information?)
Peter Rabbitson wrote: Moshe Yudkowsky wrote: over the other. For example, I've now learned that if I want to set up a RAID1 /boot, it must actually be 1.2 or grub won't be able to read it. (I would therefore argue that if the new version ever becomes default, then the default sub-version ought to be 1.2.) In the discussion yesterday I myself made a serious typo, that should not spread. The only superblock version that will work with current GRUB is 1.0 _not_ 1.2. Ah, the joys of consolidated and yet editable documentation - like a wiki David - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Use new sb type
Bill Davidsen wrote: David Greaves wrote: Jan Engelhardt wrote: This makes 1.0 the default sb type for new arrays. IIRC there was a discussion a while back on renaming mdadm options (google Time to deprecate old RAID formats?) and the superblocks to emphasise the location and data structure. Would it be good to introduce the new names at the same time as changing the default format/on-disk-location? Yes, I suggested some layout names, as did a few other people, and a few changes to separate metadata type and position were discussed. BUT, changing the default layout, no matter how better it seems, is trumped by breaks existing setups and user practice. For all of the reasons something else is preferable, 1.0 *works*. It wasn't my intention to change anything other than the naming. If the default layout was being updated to 1.0 then I thought it would be a good time to introduce 1-start, 1-4k and 1-end names and actually announce a default of 1-end and not 1.0. Although I still prefer a full separation: mdadm --create /dev/md0 --metadata 1 --meta-location start David - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RAID 1 and grub
On Jan 30, 2008 2:06 PM, Richard Scobie [EMAIL PROTECTED] wrote: hda has failed and after spending some time with a rescue disk mounting hdc's /boot partition (hdc1) and changing the grub.conf device parameters, I have no success in booting off it. I then set them back to the original (hd0,0) and moved hdc into hda's position. Booting from there brings up the message: GRUB hard disk error Have you tried re-running grub-install after booting from a rescue disk? -Dave - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RAID 1 and grub
On Jan 30, 2008 6:33 PM, Richard Scobie [EMAIL PROTECTED] wrote: I found this document very useful: http://lists.us.dell.com/pipermail/linux-poweredge/2003-July/008898.html After modifying my grub.conf to refer to (hd0,0), reinstalling grub on hdc with: grub device (hd0) /dev/hdc grub root (hd0,0) grub (hd0) and rebooting with the bios set to boot off hdc, everything burst back into life. FWIW, this step is clearly marked in the Software-RAID HOWTO under Booting on RAID: http://tldp.org/HOWTO/Software-RAID-HOWTO-7.html#ss7.3 If it appears that Fedora isn't doing this when installing on a Software RAID 1 boot device, I suggest you open a bug. BTW, I suspect you are missing the command setup from your 3rd command above, it should be: # grub grub device (hd0) /dev/hdc grub root (hd0,0) grub setup (hd0) I shall now be checking all my Fedora/Centos RAID1 installs for grub installed on both drives. Good idea. Whenever setting up a RAID1 device to boot from, I perform the above 3 steps. I also suggest using labels to identify partitions and testing the two failure modes and that you are able to boot with either drive disconnected. -Dave - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Use new sb type
Jan Engelhardt wrote: This makes 1.0 the default sb type for new arrays. IIRC there was a discussion a while back on renaming mdadm options (google Time to deprecate old RAID formats?) and the superblocks to emphasise the location and data structure. Would it be good to introduce the new names at the same time as changing the default format/on-disk-location? David - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Use new sb type
Peter Rabbitson wrote: David Greaves wrote: Jan Engelhardt wrote: This makes 1.0 the default sb type for new arrays. IIRC there was a discussion a while back on renaming mdadm options (google Time to deprecate old RAID formats?) and the superblocks to emphasise the location and data structure. Would it be good to introduce the new names at the same time as changing the default format/on-disk-location? David Also wasn't the concession to make 1.1 default instead of 1.0 ? IIRC Doug Leford did some digging wrt lilo + grub and found that 1.1 and 1.2 wouldn't work with them. I'd have to review the thread though... David - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: identifying failed disk/s in an array.
Tomasz Chmielewski wrote: Michael Harris schrieb: i have a disk fail say HDC for example, i wont know which disk HDC is as it could be any of the 5 disks in the PC. Is there anyway to make it easier to identify which disk is which?. If the drives have any LEDs, the most reliable way would be: dd if=/dev/drive of=/dev/null Then look which LED is the one which blinks the most. And/or use smartctl to look up the make/model/serial number and look at the drive label. I always do this to make sure I'm pulling the right drive (also useful to RMA the drive) David - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: how to create a degraded raid1 with only 1 of 2 drives ??
Mitchell Laks wrote: I think my error was that maybe I did not do write the fdisk changes to the drive with fdisk w No - your problem was that you needed to use the literal word missing like you did this time: mdadm -C /dev/md0 --level=2 -n2 /dev/sda1 missing [however, this time you also asked for a RAID2 (--level=2) which I doubt would work] David - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Last ditch plea on remote double raid5 disk failure
On Dec 31, 2007 2:39 AM, Marc MERLIN [EMAIL PROTECTED] wrote: new years eve :( I was wondering if I can tell the kernel not to kick a drive out of an array if it sees a block error and just return the block error upstream, but continue otherwise (all my partitions are on a raid5 array, with lvm on top, so even if I were to lose a partition, I would still be likely to get the other ones back up if I can stop the auto kicking-out and killing the md array feature). Best bet is to get a new drive into the machine that is at least the same size as the bad-sector disk, use dd_rescue[1] to copy as much of the bad-sector disk to the new one. Remove the bad-sector disk, reboot and hopefully you'll have a functioning raid array with a bit of bad data on it somewhere. I'm probably missing a step somewhere but you get the general idea... -Dave [1] http://www.garloff.de/kurt/linux/ddrescue/ - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Few questions
Guy Watkins wrote: man md man mdadm and http://linux-raid.osdl.org/index.php/Main_Page :) - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
/proc/mdstat docs (was Re: Few questions)
Michael Makuch wrote: So my questions are: ... - Is this a.o.k for a raid5 array? So I realised that /proc/mdstat isn't documented too well anywhere... http://linux-raid.osdl.org/index.php/Mdstat Comments welcome... David - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Reading takes 100% precedence over writes for mdadm+raid5?
On Dec 6, 2007 1:06 AM, Justin Piszcz [EMAIL PROTECTED] wrote: On Wed, 5 Dec 2007, Jon Nelson wrote: I saw something really similar while moving some very large (300MB to 4GB) files. I was really surprised to see actual disk I/O (as measured by dstat) be really horrible. Any work-arounds, or just don't perform heavy reads the same time as writes? What kernel are you using? (Did I miss it in your OP?) The per-device write throttling in 2.6.24 should help significantly, have you tried the latest -rc and compared to your current kernel? -Dave - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: assemble vs create an array.......
On Thu, Dec 06, 2007 at 07:39:28PM +0300, Michael Tokarev wrote: What to do is to give repairfs a try for each permutation, but again without letting it to actually fix anything. Just run it in read-only mode and see which combination of drives gives less errors, or no fatal errors (there may be several similar combinations, with the same order of drives but with different drive missing). Ugggh. It's sad that xfs refuses mount when structure needs cleaning - the best way here is to actually mount it and see how it looks like, instead of trying repair tools. It self protection - if you try to write to a corrupted filesystem, you'll only make the corruption worse. Mounting involves log recovery, which writes to the filesystem Is there some option to force-mount it still (in readonly mode, knowing it may OOPs kernel etc)? Sure you can: mount -o ro,norecovery dev mtpt But it you hit corruption it will still shut down on you. If the machine oopses then that is a bug. thread prompted me to think. If I can't force-mount it (or browse it using other ways) as I can almost always do with (somewhat?) broken ext[23] just to examine things, maybe I'm trying it before it's mature enough? ;) Hehe ;) For maximum uber-XFS-guru points, learn to browse your filesystem with xfs_db. :P Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: assemble vs create an array.......
Dragos wrote: Thank you for your very fast answers. First I tried 'fsck -n' on the existing array. The answer was that If I wanted to check a XFS partition I should use 'xfs_check'. That seems to say that my array was partitioned with xfs, not reiserfs. Am I correct? Then I tried the different permutations: mdadm --create /dev/md0 --raid-devices=3 --level=5 missing /dev/sda1 /dev/sdb1 mount /dev/md0 temp mdadm --stop --scan mdadm --create /dev/md0 --raid-devices=3 --level=5 /dev/sda1 missing /dev/sdb1 mount /dev/md0 temp mdadm --stop --scan [etc] With some arrays mount reported: mount: you must specify the filesystem type and with others: mount: Structure needs cleaning No choice seems to have been successful. OK, not as good as you could have hoped for. Make sure you have the latest xfs tools. you may want to try xfs_repair and you can use the -n (I think - check man page) option. You may need to force it to ignore the log David - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: assemble vs create an array.......
Neil Brown wrote: On Thursday November 29, [EMAIL PROTECTED] wrote: 2. Do you know of any way to recover from this mistake? Or at least what filesystem it was formated with. It may not have been lost - yet. If you created the same array with the same devices and layout etc, the data will still be there, untouched. Try to assemble the array and use fsck on it. To be safe I'd use fsck -n (check the man page as this is odd for reiserfs) When you create a RAID5 array, all that is changed is the metadata (at the end of the device) and one drive is changed to be the xor of all the others. In other words, one of your 3 drives has just been erased. Unless you know the *exact* command you used and have the dmesg output to hand then we won't know which one. Now what you need to do is to try all the permutations of creating a degraded array using 2 of the drives and specify the 3rd as 'missing': So something like: mdadm --create /dev/md0 --raid-devices=3 --level=5 missing /dev/sdb1 /dev/sdc1 mdadm --create /dev/md0 --raid-devices=3 --level=5 /dev/sdb1 missing /dev/sdc1 mdadm --create /dev/md0 --raid-devices=3 --level=5 /dev/sdb1 /dev/sdc1 missing mdadm --create /dev/md0 --raid-devices=3 --level=5 missing /dev/sdb1 /dev/sdd1 mdadm --create /dev/md0 --raid-devices=3 --level=5 /dev/sdb1 missing /dev/sdd1 mdadm --create /dev/md0 --raid-devices=3 --level=5 /dev/sdb1 /dev/sdd1 missing etc etc It is important to create the array using a 'missing' device so the xor data isn't written. There is a program here: http://linux-raid.osdl.org/index.php/Permute_array.pl that may help... David - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RAID5 Recovery
localhost kernel: [17179584.18] md: md driver hdc is kicked too (again) Nov 13 07:30:24 localhost kernel: [17179584.184000] md: raid5 personality registered as nr 4 Another reboot... Nov 13 07:30:24 localhost kernel: [17179585.068000] md: syncing RAID array md0 Now (I guess) hdg is being restored using hdc data: Nov 13 07:30:24 localhost kernel: [17179684.16] ReiserFS: md0: warning: sh-2021: reiserfs_fill_super: can not find reiserfs on md0 But Reiser is confused. Nov 13 08:57:11 localhost kernel: [17184895.816000] md: md0: sync done. hdg is back up to speed: So hdc looks faulty. Your only hope (IMO) is to use reiserfs recovery tools. You may want to replace hdc to avoid an hdc failure interrupting any rebuild. I think what happened is that hdg failed prior to 2am and you didn't notice (mdadm --monitor is your friend). Then hdc had a real failure - at that point you had data loss (not enough good disks). I don't know why md rebuilt using hdc - I would expect it to have found hdc and hdg stale. If this is a newish kernel then maybe Neil should take a look... David - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Fwd: RAID5 Recovery
Neil Cavan wrote: Thanks for taking a look, David. No problem. Kernel: 2.6.15-27-k7, stock for Ubuntu 6.06 LTS mdadm: mdadm - v1.12.0 - 14 June 2005 OK - fairly old then. Not really worth trying to figure out why hdc got re-added when things had gone wrong. You're right, earlier in /var/log/messages there's a notice that hdg dropped, I missed it before. I use mdadm --monitor, but I recently changed the target email address - I guess it didn't take properly. As for replacing hdc, thanks for the diagnosis but it won't help: the drive is actually fine, as is hdg. I've replaced hdc before, only to have the brand new hdc show the same behaviour, and SMART says the drive is A-OK. There's something flaky about these PCI IDE controllers. I think it's new system time. Any excuse eh? :) Reiserfs recovery-wise: any suggestions? A simple fsck doesn't find a file system superblock. Is --rebuild-sb the way to go here? No idea, sorry. I only ever tried Reiser once and it failed. It was very hard to get recovered so I swapped back to XFS. Good luck on the fscking David - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Raid5 assemble after dual sata port failure
Chris Eddington wrote: Hi, Thanks for the pointer on xfs_repair -n , it actually tells me something (some listed below) but I'm not sure what it means but there seems to be a lot of data loss. One complication is I see an error message in ata6, so I moved the disks around thinking it was a flaky sata port, but I see the error again on ata4 so it seems to follow the disk. But it happens exactly at the same time during xfs_repair sequence, so I don't think it is a flaky disk. Does dmesg have any info/sata errors? xfs_repair will have problems if the disk is bad. You may want to image the disk (possibly onto the 'spare'?) if it is bad. I'll go to the xfs mailing list on this. Very good idea :) Is there a way to be sure the disk order is right? The order looks right to me. xfs_repair wouldn't recognise it as well as it does if the order was wrong. not way out of wack since I'm seeing so much from xfs_repair. Also since I've been moving the disks around, I want to be sure I have the right order. Bear in mind that -n stops the repair fixing a problem. Then as the 'repair' proceeds it becomes very confused by problems that should have been fixed. This is evident in the superblock issue (which also probably explains the failed mount). Is there a way to try restoring using the other disk? No the event count was very out of date. - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Raid5 assemble after dual sata port failure
Chris Eddington wrote: Yes, there is some kind of media error message in dmesg, below. It is not random, it happens at exactly the same moments in each xfs_repair -n run. Nov 11 09:48:25 altair kernel: [37043.300691] res 51/40:00:01:00:00/00:00:00:00:00/e1 Emask 0x9 (media error) Nov 11 09:48:25 altair kernel: [37043.304326] ata4.00: ata_hpa_resize 1: sectors = 976773168, hpa_sectors = 976773168 Nov 11 09:48:25 altair kernel: [37043.307672] ata4.00: ata_hpa_resize 1: sectors = 976773168, hpa_sectors = 976773168 I'm not sure what an ata_hpa_resize error is... It probably explains the problems you've been having with the raid not 'just recovering' though. I saw this: http://www.linuxquestions.org/questions/linux-kernel-70/sata-issues-568894/ What does smartctl say about your drive? IMO the spare drive is no longer useful for data recovery - you may want to use ddrescue to try and copy this drive to the spare drive. David PS Don't get the ddrescue parameters the wrong way round if you go that route... - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Raid5 assemble after dual sata port failure
Ok - it looks like the raid array is up. There will have been an event count mismatch which is why you needed --force. This may well have caused some (hopefully minor) corruption. FWIW, xfs_check is almost never worth running :) (It runs out of memory easily). xfs_repair -n is much better. What does the end of dmesg say after trying to mount the fs? Also try: xfs_repair -n -L I think you then have 2 options: * xfs_repair -L This may well lose data that was being written as the drives crashed. * contact the xfs mailing list David Chris Eddington wrote: Hi David, I ran xfs_check and get this: ERROR: The filesystem has valuable metadata changes in a log which needs to be replayed. Mount the filesystem to replay the log, and unmount it before re-running xfs_check. If you are unable to mount the filesystem, then use the xfs_repair -L option to destroy the log and attempt a repair. Note that destroying the log may cause corruption -- please attempt a mount of the filesystem before doing this. After mounting (which fails) and re-running xfs_check it gives the same message. The array info details are below and seems it is running correctly ?? I interpret the message above as actually a good sign - seems that xfs_check sees the filesystem but the log file and maybe the most currently written data is corrupted or will be lost. But I'd like to hear some advice/guidance before doing anything permanent with xfs_repair. I also would like to confirm somehow that the array is in the right order, etc. Appreciate your feedback. Thks, Chris cat /etc/mdadm/mdadm.conf DEVICE /dev/sda1 /dev/sdb1 /dev/sdc1 /dev/sdd1 ARRAY /dev/md0 level=raid5 num-devices=4 UUID=bc74c21c:9655c1c6:ba6cc37a:df870496 MAILADDR root cat /proc/mdstat Personalities : [raid6] [raid5] [raid4] md0 : active raid5 sda1[0] sdd1[2] sdb1[1] 1465151808 blocks level 5, 64k chunk, algorithm 2 [4/3] [UUU_] unused devices: none mdadm -D /dev/md0 /dev/md0: Version : 00.90.03 Creation Time : Sun Nov 5 14:25:01 2006 Raid Level : raid5 Array Size : 1465151808 (1397.28 GiB 1500.32 GB) Device Size : 488383936 (465.76 GiB 500.11 GB) Raid Devices : 4 Total Devices : 3 Preferred Minor : 0 Persistence : Superblock is persistent Update Time : Fri Nov 9 16:26:31 2007 State : clean, degraded Active Devices : 3 Working Devices : 3 Failed Devices : 0 Spare Devices : 0 Layout : left-symmetric Chunk Size : 64K UUID : bc74c21c:9655c1c6:ba6cc37a:df870496 Events : 0.4880384 Number Major Minor RaidDevice State 0 810 active sync /dev/sda1 1 8 171 active sync /dev/sdb1 2 8 492 active sync /dev/sdd1 3 003 removed Chris Eddington wrote: Thanks David. I've had cable/port failures in the past and after re-adding the drive, the order changed - I'm not sure why, but I noticed it sometime ago but don't remember the exact order. My initial attempt to assemble, it came up with only two drives in the array. Then I tried assembling with --force and that brought up 3 of the drives. At that point I thought I was good, so I tried mount /dev/md0 and it failed. Would that have written to the disk? I'm using XFS. After that, I tried assembling with different drive orders on the command line, i.e. mdadm -Av --force /dev/md0 /dev/sda1, ... thinking that the order might not be right. At the moment I can't access the machine, but I'll try fsck -n and send you the other info later this evening. Many thanks, Chris - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Raid5 assemble after dual sata port failure
Chris Eddington wrote: Hi, Hi While on vacation I had one SATA port/cable fail, and then four hours later a second one fail. After fixing/moving the SATA ports, I can reboot and all drives seem to be OK now, but when assembled it won't recognize the filesystem. That's unusual - if the array comes back then you should be OK. In general if two devices fail then there is a real data loss risk. However if the drives are good and there was just a cable glitch, then unless you're unlucky it's usually fsck fixable. I see mdadm: /dev/md0 has been started with 3 drives (out of 4). which means it's now up and running. And: sda1Events : 0.4880374 sdb1Events : 0.4880374 sdc1Events : 0.4857597 sdd1Events : 0.4880374 so sdc1 is way out of date... we'll add/resync that when everything else is working. but: After futzing around with assemble options like --force and disk order I couldn't get it to work. Let me check... what commands did you use? Just 'assemble' - which doesn't care about disk order - or did you try to re-'create' the array - which does care about disk order and leads us down a different path... err, scratch that: Creation Time : Sun Nov 5 14:25:01 2006 OK, it was created a year ago... so you did use assemble. It is slightly odd to see that the drive order is: /dev/mapper/sda1 /dev/mapper/sdb1 /dev/mapper/sdd1 /dev/mapper/sdc1 Usually people just create them in order. Have you done any fsck's that involve a write? What filesystem are you running? What does your 'fsck -n' (readonly) report? Also, please report the results of: cat /proc/mdadm mdadm -D /dev/md0 cat /etc/mdadm.conf David - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Kernel Module - Raid
Paul VanGundy wrote: All, Hello. I don't know if this is the right place to post this issue but it does deal with RAID so I thought I would try. It deals primarily with linux *software* raid. But stick with it - you may end up doing that... What hardware/distro etc are you using? Is this an expensive (hundreds of £) card? Or an onboard/motherboard chipset? Once you answer this then it may be worth suggesting using sw-raid (in which case we can help out) or pointing you elsewhere... I successfully built a new kernel and am able to boot from it. However, I need to incorporate a specific RAID driver (adpahci.ko) so we can use the on-board RAID. I think this is the adaptec proprietary code - in which case you may need a very specific kernel to run it. You may find others on here who can help but you'll probably need an Adaptec forum/list. I have the adpahci.ko and am unable to get it to compile against any other kernel because I don't have the original kernel module (adpahci.c I assume is what I need). Is there any way I can view the adpahci.ko and copy the contents to make a adpahci.c? No Is it possible to get the kernel object to compile with another kernel only using the adpahci.ko? No Am I making sense? :) Yes That's one of the big reasons proprietary drivers suck on linux. David - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Kernel Module - Raid
Paul VanGundy wrote: Thanks for the prompt replay David. Below are the answers to your questions: What hardware/distro etc are you using? Is this an expensive (hundreds of £) card? Or an onboard/motherboard chipset? The distro is Suse 10.1. As a bit of trivia, Neil (who wrote and maintains linux RAID) works for Suse. It is an onboard chipset. In which case it's not likely to be hardware RAID. See: http://linux-ata.org/faq-sata-raid.html Once you answer this then it may be worth suggesting using sw-raid (in which case we can help out) or pointing you elsewhere... You should probably configure the BIOS to use That's one of the big reasons proprietary drivers suck on linux. Ok. So this chipset has the ability to use an Intel based RAID. Would that be better? mmm, see the link above... In almost any case where you are considering 'onboard' raid, linux software raid (using md and mdadm) is a better choice. Start here: http://linux-raid.osdl.org/index.php/Main_Page (feel free to correct it or ask here for clarification) Also essential reading is the mdadm man page. David - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 2.6.23.1: mdadm/raid5 hung/d-state
Michael Tokarev wrote: Justin Piszcz wrote: On Sun, 4 Nov 2007, Michael Tokarev wrote: [] The next time you come across something like that, do a SysRq-T dump and post that. It shows a stack trace of all processes - and in particular, where exactly each task is stuck. Yes I got it before I rebooted, ran that and then dmesg file. Here it is: [1172609.665902] 80747dc0 80747dc0 80747dc0 80744d80 [1172609.668768] 80747dc0 81015c3aa918 810091c899b4 810091c899a8 That's only partial list. All the kernel threads - which are most important in this context - aren't shown. You ran out of dmesg buffer, and the most interesting entries was at the beginning. If your /var/log partition is working, the stuff should be in /var/log/kern.log or equivalent. If it's not working, there is a way to capture the info still, by stopping syslogd, cat'ing /proc/kmsg to some tmpfs file and scp'ing it elsewhere. or netconsole is actually pretty easy and incredibly useful in this kind of situation even if there's no disk at all :) David - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Implementing low level timeouts within MD
Alberto Alonso wrote: On Thu, 2007-11-01 at 15:16 -0400, Doug Ledford wrote: Not in the older kernel versions you were running, no. These old versions (specially the RHEL) are supposed to be the official versions supported by Redhat and the hardware vendors, as they were very specific as to what versions of Linux were supported. Of all people, I would think you would appreciate that. Sorry if I sound frustrated and upset, but it is clearly a result of what supported and tested really means in this case. I don't want to go into a discussion of commercial distros, which are supported as this is nor the time nor the place but I don't want to open the door to the excuse of its an old kernel, it wasn't when it got installed. It may be worth noting that the context of this email is the upstream linux-raid list. In my time watching the list it is mainly focused on 'current' code and development (but hugely supportive of older environments). In general discussions in this context will have a certain mindset - and it's not going to be the same as that which you'd find in an enterprise product support list. Outside of the rejected suggestion, I just want to figure out when software raid works and when it doesn't. With SATA, my experience is that it doesn't. SATA, or more precisely, error handling in SATA has recently been significantly overhauled by Tejun Heo (IIRC). We're talking post 2.6.18 though (again IIRC) - so as far as SATA EH goes, older kernels bear no relation to the new ones. And the initial SATA EH code was, of course, beta :) David PS I can't really contribute to your list - I'm only using cheap desktop hardware. - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Time to deprecate old RAID formats?
Jeff Garzik wrote: Neil Brown wrote: As for where the metadata should be placed, it is interesting to observe that the SNIA's DDFv1.2 puts it at the end of the device. And as DDF is an industry standard sponsored by multiple companies it must be .. Sorry. I had intended to say correct, but when it came to it, my fingers refused to type that word in that context. For the record, I have no intention of deprecating any of the metadata formats, not even 0.90. strongly agreed I didn't get a reply to my suggestion of separating the data and location... ie not talking about superblock versions 0.9, 1.0, 1.1, 1.2 etc but a data format (0.9 vs 1.0) and a location (end,start,offset4k)? This would certainly make things a lot clearer to new (and old!) users: mdadm --create /dev/md0 --metadata 1.0 --meta-location offset4k or mdadm --create /dev/md0 --metadata 1.0 --meta-location start or mdadm --create /dev/md0 --metadata 1.0 --meta-location end resulting in: mdadm --detail /dev/md0 /dev/md0: Version : 01.0 Metadata-locn : End-of-device Creation Time : Fri Aug 4 23:05:02 2006 Raid Level : raid0 You provide rational defaults for mortals and this approach allows people like Doug to do wacky HA things explicitly. I'm not sure you need any changes to the kernel code - probably just the docs and mdadm. It is conceivable that I could change the default, though that would require a decision as to what the new default would be. I think it would have to be 1.0 or it would cause too much confusion. A newer default would be nice. I also suspect that a *lot* of people will assume that the highest superblock version is the best and should be used for new installs etc. So if you make 1.0 the default then how many users will try 'the bleeding edge' and use 1.2? So then you have 1.3 which is the same as 1.0? H? So to quote from an old Soap: Confused, you will be... David - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: deleting mdadm array?
Janek Kozicki wrote: Hello, I just created a new array /dev/md1 like this: mdadm --create --verbose /dev/md1 --chunk=64 --level=raid5 \ --metadata=1.1 --bitmap=internal \ --raid-devices=3 /dev/hdc2 /dev/sda2 missing But later I changed my mind, and I wanted to use chunk 128. Do I need to delete this array somehow first, or can I just create an array again (overwriting the current one)? How much later? This will, of course, destroy any data on the array (!) and you'll need to mkfs again... To answer the question though: just run mdadm again to create a new array with new parameters. I think the only time you need to 'delete' an array before creating a new one is if you change the superblock version since it quietly writes different superblocks to different disk locations you may end up with 2 superblocks on the disk and then you get confusion :) (I'm not sure if mdadm is clever about this though...) Also, if you don't mind me asking: why did you choose version 1.1 for the metadata/superblock version? David - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Time to deprecate old RAID formats?
Bill Davidsen wrote: Neil Brown wrote: I certainly accept that the documentation is probably less that perfect (by a large margin). I am more than happy to accept patches or concrete suggestions on how to improve that. I always think it is best if a non-developer writes documentation (and a developer reviews it) as then it is more likely to address the issues that a non-developer will want to read about, and in a way that will make sense to a non-developer. (i.e. I'm to close to the subject to write good doco). Patches against what's in 2.6.4 I assume? I can't promise to write anything which pleases even me, but I will take a look at it. The man page is a great place for describing, eg, the superblock location; but don't forget we have http://linux-raid.osdl.org/index.php/Main_Page which is probably a better place for *discussions* (or essays) about the superblock location (eg the LVM / v1.1 comment Janek picked up on) In fact I was going to take some of the writings from this thread and put them up there. David - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Time to deprecate old RAID formats?
Doug Ledford wrote: On Mon, 2007-10-22 at 16:39 -0400, John Stoffel wrote: I don't agree completely. I think the superblock location is a key issue, because if you have a superblock location which moves depending the filesystem or LVM you use to look at the partition (or full disk) then you need to be even more careful about how to poke at things. This is the heart of the matter. When you consider that each file system and each volume management stack has a superblock, and they some store their superblocks at the end of devices and some at the beginning, and they can be stacked, then it becomes next to impossible to make sure a stacked setup is never recognized incorrectly under any circumstance. I wonder if we should not really be talking about superblock versions 1.0, 1.1, 1.2 etc but a data format (0.9 vs 1.0) and a location (end,start,offset4k)? This would certainly make things a lot clearer to new users: mdadm --create /dev/md0 --metadata 1.0 --meta-location offset4k mdadm --detail /dev/md0 /dev/md0: Version : 01.0 Metadata-locn : End-of-device Creation Time : Fri Aug 4 23:05:02 2006 Raid Level : raid0 And there you have the deprecation... only two superblock versions and no real changes to code etc David - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [BUG] Raid1/5 over iSCSI trouble
From: Dan Williams [EMAIL PROTECTED] Date: Wed, 24 Oct 2007 16:49:28 -0700 Hopefully it is as painless to run on sparc as it is on IA: opcontrol --start --vmlinux=/path/to/vmlinux wait opcontrol --stop opreport --image-path=/lib/modules/`uname -r` -l It is painless, I use it all the time. The only caveat is to make sure the /path/to/vmlinux is the pre-stripped kernel image. The images installed under /boot/ are usually stripped and thus not suitable for profiling. - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: very degraded RAID5, or increasing capacity by adding discs
On Tue, Oct 09, 2007 at 01:48:50PM +0400, Michael Tokarev wrote: There still is - at least for ext[23]. Even offline resizers can't do resizes from any to any size, extfs developers recommend to recreate filesystem anyway if size changes significantly. I'm too lazy to find a reference now, it has been mentioned here on linux-raid at least this year. It's sorta like fat (yea, that ms-dog filesystem) - when you resize it from, say, 501Mb to 999Mb, everything is ok, but if you want to go from 501Mb to 1Gb+1, you have to recreate almost all data structures because sizes of all internal fields changes - and here it's much safer to just re-create it from scratch than trying to modify it in place. Sure it's much better for extfs, but the point is still the same. I'll just mention that I once resized a multi-Tera ext3 filesystem and it took 8hours +, a comparable XFS online resize lasted all of 10 seconds! - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
flaky controller or disk error?
Hi, [using kernel 2.6.23 and mdadm 2.6.3+20070929] I have a rather flaky sata controller with which I am trying to resync a raid5 array. It usually starts failing after 40% of the resync is done. Short of changing the controller (which I will do later this week), is there a way to have mdmadm resume the resync where it left at reboot time? Here is the error I am seeing in the syslog. Can this actually be a disk error? Oct 18 11:54:34 sylla kernel: ata1.00: exception Emask 0x10 SAct 0x0 SErr 0x1 action 0x2 frozen Oct 18 11:54:34 sylla kernel: ata1.00: irq_stat 0x0040, PHY RDY changed Oct 18 11:54:34 sylla kernel: ata1.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 0 cdb 0x0 data 0 Oct 18 11:54:34 sylla kernel: res 40/00:00:19:26:33/00:00:3a:00:00/40 Emask 0x10 (ATA bus error) Oct 18 11:54:35 sylla kernel: ata1: soft resetting port Oct 18 11:54:40 sylla kernel: ata1: failed to reset engine (errno=-95)4ata1: port is slow to respond, please be patient (Status 0xd0) Oct 18 11:54:45 sylla kernel: ata1: softreset failed (device not ready) Oct 18 11:54:45 sylla kernel: ata1: hard resetting port Oct 18 11:54:46 sylla kernel: ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 300) Oct 18 11:54:46 sylla kernel: ata1.00: configured for UDMA/133 Oct 18 11:54:46 sylla kernel: ata1: EH complete Oct 18 11:54:46 sylla kernel: sd 0:0:0:0: [sda] 976773168 512-byte hardware sectors (500108 MB) Oct 18 11:54:46 sylla kernel: sd 0:0:0:0: [sda] Write Protect is off Oct 18 11:54:46 sylla kernel: sd 0:0:0:0: [sda] Mode Sense: 00 3a 00 00 Oct 18 11:54:46 sylla kernel: sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA Thanks, - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RAID 5 performance issue.
On 10/3/07, Andrew Clayton [EMAIL PROTECTED] wrote: On Wed, 3 Oct 2007 12:43:24 -0400 (EDT), Justin Piszcz wrote: Have you checked fragmentation? You know, that never even occurred to me. I've gotten into the mind set that it's generally not a problem under Linux. It's probably not the root cause, but certainly doesn't help things. At least with XFS you have an easy way to defrag the filesystem without even taking it offline. # xfs_db -c frag -f /dev/md0 actual 1828276, ideal 1708782, fragmentation factor 6.54% Good or bad? Not bad, but not that good, either. Try running xfs_fsr into a nightly cronjob. By default, it will defrag mounted xfs filesystems for up to 2 hours. Typically this is enough to keep fragmentation well below 1%. -Dave - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: reducing the number of disks a RAID1 expects
Neil Brown wrote: 2.6.12 does support reducing the number of drives in a raid1, but it will only remove drives from the end of the list. e.g. if the state was 58604992 blocks [3/2] [UU_] then it would work. But as it is 58604992 blocks [3/2] [_UU] it won't. You could fail the last drive (hdc8) and then add it back in again. This would move it to the first slot, but it would cause a full resync which is a bit of a waste. Thanks for your help! That's the route I took. It worked ([2/2] [UU]). The only hiccup was that when I rebooted, hdd2 was back in the first slot by itself ([3/1] [U__]). I guess there was some contention in discovery. But all I had to do was physically remove hdd and the remaining two were back to [2/2] [UU]. Since commit 6ea9c07c6c6d1c14d9757dd8470dc4c85bbe9f28 (about 2.6.13-rc4) raid1 will repack the devices to the start of the list when trying to change the number of devices. I couldn't find a newer kernel RPM for FC3, and I was nervous about building a new kernel myself and screwing up my system, so I went the slot rotate route instead. It only took about 20 minutes to resync (a lot faster than trying to build a new kernel). My main concern was that it would discover an unreadable sector while resyncing from the last remaining drive and I would lose the whole array. (That didn't happen, though.) I looked for some mdadm command to check the remaining drive before I failed the last one, to help avoid that worst case scenario, but couldn't find any. Is there some way to do that, for future reference? Cheers, 11011011 - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RAID6 mdadm --grow bug?
Neil, On RHEL5 the kernel is 2.6.18-8.1.8. On Ubuntu 7.04 the kernel is 2.6.20-16. Someone on the Arstechnica forums wrote they see the same thing in Debian etch running kernel 2.6.18. Below is a messages log from the RHEL5 system. I have only included the section for creating the RAID6, adding a spare and trying to grow it. There is a one line error when I do the mdadm --grow command. It is md: couldn't update array info. -22. md: bindloop1 md: bindloop2 md: bindloop3 md: bindloop4 md: md0: raid array is not clean -- starting background reconstruction raid5: device loop4 operational as raid disk 3 raid5: device loop3 operational as raid disk 2 raid5: device loop2 operational as raid disk 1 raid5: device loop1 operational as raid disk 0 raid5: allocated 4204kB for md0 raid5: raid level 6 set md0 active with 4 out of 4 devices, algorithm 2 RAID5 conf printout: --- rd:4 wd:4 fd:0 disk 0, o:1, dev:loop1 disk 1, o:1, dev:loop2 disk 2, o:1, dev:loop3 disk 3, o:1, dev:loop4 md: syncing RAID array md0 md: minimum _guaranteed_ reconstruction speed: 1000 KB/sec/disc. md: using maximum available idle IO bandwidth (but not more than 20 KB/sec) for reconstruction. md: using 128k window, over a total of 102336 blocks. md: md0: sync done. RAID5 conf printout: --- rd:4 wd:4 fd:0 disk 0, o:1, dev:loop1 disk 1, o:1, dev:loop2 disk 2, o:1, dev:loop3 disk 3, o:1, dev:loop4 md: bindloop5 md: couldn't update array info. -22 David. On Sep 13, 2007, at 3:52 AM, Neil Brown wrote: On Wednesday September 12, [EMAIL PROTECTED] wrote: Problem: The mdadm --grow command fails when trying to add disk to a RAID6. .. So far I have replicated this problem on RHEL5 and Ubuntu 7.04 running the latest official updates and patches. I have even tried it with the most latest version of mdadm 2.6.3 under RHEL5. RHEL5 uses version 2.5.4. You don't say what kernel version you are using (as I don't use RHEL5 or Ubunutu, I don't know what 'latest' means). If it is 2.6.23-rcX, then it is a known problem that should be fixed in the next -rc. If it is something else... I need details. Also, any kernel message (run 'dmesg') might be helpful. NeilBrown - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RAID6 mdadm --grow bug?
Problem: The mdadm --grow command fails when trying to add disk to a RAID6. The man page says it can do this. GROW MODE The GROW mode is used for changing the size or shape of an active array. For this to work, the kernel must support the necessary change. Various types of growth are being added dur- ing 2.6 development, including restructuring a raid5 array to have more active devices. Currently the only support available is to · change the size attribute for RAID1, RAID5 and RAID6. · increase the raid-disks attribute of RAID1, RAID5, and RAID6. · add a write-intent bitmap to any array which supports these bitmaps, or remove a write- intent bitmap from such an array. So far I have replicated this problem on RHEL5 and Ubuntu 7.04 running the latest official updates and patches. I have even tried it with the most latest version of mdadm 2.6.3 under RHEL5. RHEL5 uses version 2.5.4. How to replicate the problem: You can either use real physical disks or use the loopback device to create fake disks. Here are the steps using the loopback method as root. cd /tmp dd if=/dev/zero of=rd1 bs=10240 count=10240 cp rd1 rd2;cp rd1 rd3;cp rd1 rd4;cp rd1 rd5 losetup /dev/loop1 rd1;losetup /dev/loop2 rd2;losetup /dev/loop3 rd3;losetup /dev/loop4 rd4;losetup /dev/loop5 rd5 mdadm --create --verbose /dev/md0 --level=6 --raid-devices=4 /dev/ loop1 /dev/loop2 /dev/loop3 /dev/loop4 At this point wait a minute while the raid is being built. mdadm --add /dev/md0 /dev/loop5 mdadm --grow /dev/md0 --raid-devices=5 You should get the following error mdadm: Need to backup 384K of critical section.. mdadm: Cannot set device size/shape for /dev/md0: Invalid argument How to clean up mdadm --stop /dev/md0 mdadm --remove /dev/md0 losetup -d /dev/loop1;losetup -d /dev/loop2;losetup -d /dev/ loop3;losetup -d /dev/loop4;losetup -d /dev/loop5 rm rd1 rd2 rd3 rd4 rd5 David. - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: reducing the number of disks a RAID1 expects
Richard Scobie wrote: Have a look at the Grow Mode section of the mdadm man page. Thanks! I overlooked that, although I did look at the man page before posting. It looks as though you should just need to use the same command you used to grow it to 3 drives, except specify only 2 this time. I think I hot-added it. Anyway, --grow looks like what I need, but I'm having some difficulty with it. The man page says, Change the size or shape of an active array. But I got: [EMAIL PROTECTED] ~]# mdadm --grow /dev/md5 -n2 mdadm: Cannot set device size/shape for /dev/md5: Device or resource busy [EMAIL PROTECTED] ~]# umount /dev/md5 [EMAIL PROTECTED] ~]# mdadm --grow /dev/md5 -n2 mdadm: Cannot set device size/shape for /dev/md5: Device or resource busy So I tried stopping it, but got: [EMAIL PROTECTED] ~]# mdadm --stop /dev/md5 [EMAIL PROTECTED] ~]# mdadm --grow /dev/md5 -n2 mdadm: Cannot get array information for /dev/md5: No such device [EMAIL PROTECTED] ~]# mdadm --query /dev/md5 --scan /dev/md5: is an md device which is not active /dev/md5: is too small to be an md component. [EMAIL PROTECTED] ~]# mdadm --grow /dev/md5 --scan -n2 mdadm: option s not valid in grow mode Am I trying the right thing, but running into some limitation of my version of mdadm or the kernel? Or am I overlooking something fundamental yet again? md5 looked like this in /proc/mdstat before I stopped it: md5 : active raid1 hdc8[2] hdg8[1] 58604992 blocks [3/2] [_UU] For -n the man page says, This number can only be changed using --grow for RAID1 arrays, and only on kernels which provide necessary support. Grow mode says, Various types of growth may be added during 2.6 development, possibly including restructuring a raid5 array to have more active devices. Currently the only support available is to change the size attribute for arrays with redundancy, and the raid-disks attribute of RAID1 arrays. ... When reducing the number of devices in a RAID1 array, the slots which are to be removed from the array must already be vacant. That is, the devices that which were in those slots must be failed and removed. I don't know how I overlooked all that the first time, but I can't see what I'm overlooking now. mdadm - v1.6.0 - 4 June 2004 Linux 2.6.12-1.1381_FC3 #1 Fri Oct 21 03:46:55 EDT 2005 i686 athlon i386 GNU/Linux Cheers, 11011011 - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
reducing the number of disks a RAID1 expects
My /dev/hdd started failing its SMART check, so I removed it from a RAID1: # mdadm /dev/md5 -f /dev/hdd2 -r /dev/hdd2 Now when I boot it looks like this in /proc/mdstat: md5 : active raid1 hdc8[2] hdg8[1] 58604992 blocks [3/2] [_UU] and I get a DegradedArray event on /dev/md5 email on every boot from mdadm monitoring. I only need 2 disks in md5 now. How can I stop it from being considered degraded? I added a 3rd disk a while ago just because I got a new disk with plenty of space, and little /dev/hdd was getting old. mdadm - v1.6.0 - 4 June 2004 Linux 2.6.12-1.1381_FC3 #1 Fri Oct 21 03:46:55 EDT 2005 i686 athlon i386 GNU/Linux Cheers, 11011011 - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Patch for boot-time assembly of v1.x-metadata-based soft (MD) arrays: reasoning and future plans
Dan Williams wrote: On 8/26/07, Abe Skolnik [EMAIL PROTECTED] wrote: Because you can rely on the configuration file to be certain about which disks to pull in and which to ignore. Without the config file the auto-detect routine may not always do the right thing because it will need to make assumptions. But kernel parameters can provide the same data, no? After all, it is Yes, you can get a similar effect of the config file by adding parameters to the kernel command line. My only point is that if the initramfs update tools were as simple as: mkinitrd root=/dev/md0 md=0,v1,/dev/sda1,/dev/sdb1,/dev/sdc1 ...then using an initramfs becomes the same amount of work as editing /etc/grub.conf. I'm not sure if you're aware of this so in the spirit of being helpful I thought I'd point out that this has been discussed quite extensively in the past. You may want to take a look at the list archives. It's quite complex and whilst I too would like autodetect to continue to work (I've been 'bitten' by the new superblocks not autodetecting) I accept the arguments of those wiser than me. I agree that the problem is one of old dogs and the new (!) initram trick. Look at it as an opportunity to learn more about a cool capability. David - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 4 Port eSATA RAID5/JBOD PCI-E 8x Controller
Richard Scobie wrote: This looks like a potentially good, cheap candidate for md use. Although Linux support is not explicitly mentioned, SiI 3124 is used. http://www.addonics.com/products/host_controller/ADSA3GPX8-4e.asp Thanks Richard. FWIW I find this kind of info useful. David - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: SWAP file on a RAID-10 array possible?
Tomas France wrote: Hi everyone, I apologize for asking such a fundamental question on the Linux-RAID list but the answers I found elsewhere have been contradicting one another. So, is it possible to have a swap file on a RAID-10 array? yes. mkswap /dev/mdX swapon /dev/mdX Should you use RAID-10 for swap? That's philosophy :) David - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: SWAP file on a RAID-10 array possible?
Tomas France wrote: Thanks for the answer, David! you're welome By the way, does anyone know if there is a comprehensive how-to on software RAID with mdadm available somewhere? I mean a website where I could get answers to questions like How to convert your system from no RAID to RAID-1, from RAID-1 to RAID-5/10, how to setup LILO/GRUB to boot from a RAID-1 array etc. Don't take me wrong, I have done my homework and found a lot of info on the topic but a lot of it is several years old and many things have changed since then. And it's quite scattered too.. Yes, I got to thinking that. About a year ago I got a copy of the old raid FAQ from the authors and have started to modify it when I have time and it bubbles up my list. Try http://linux-raid.osdl.org/ It's a community wiki - welcome to the community :) Please feel free to edit,,, Also feel free to post questions/suggestions here if you're not sure. David - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Moving RAID distro
Richard Grundy wrote: Hello, I was just wonder if it's possible to move my RAID5 array to another distro, same machine just a different flavor of Linux. Yes. The only problem will be if it is the root filesystem (unlikely). Would it just be a case of running: sudo mdadm --create --verbose /dev/md0 --level=5 –raid-devices=5 /dev/sdb /dev/sdc etc No That's kinda dangerous since it overwrites the superblocks (it prompts first). If you make the slightest error in, eg, the order of the devices (which may be different under a new kernel since modules may load in a different order etc) then you'll end up back here asking about recovering from a broken array (which you can do but you don't want to.) Or do I need to run some other command. yes You want to use mdadm --assemble It's quite possible that your distro will include mdadm support scripts that will find and assemble the array automatically. David - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFD] Layering: Use-Case Composers (was: DRBD - what is it, anyways? [compare with e.g. NBD + MD raid])
Paul Clements wrote: Well, if people would like to see a timeout option, I actually coded up a patch a couple of years ago to do just that, but I never got it into mainline because you can do almost as well by doing a check at user-level (I basically ping the nbd connection periodically and if it fails, I kill -9 the nbd-client). Yes please. David - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFD] Layering: Use-Case Composers (was: DRBD - what is it, anyways? [compare with e.g. NBD + MD raid])
On Mon, 13 Aug 2007, David Greaves wrote: [EMAIL PROTECTED] wrote: per the message below MD (or DM) would need to be modified to work reasonably well with one of the disk components being over an unreliable link (like a network link) are the MD/DM maintainers interested in extending their code in this direction? or would they prefer to keep it simpler by being able to continue to assume that the raid components are connected over a highly reliable connection? if they are interested in adding (and maintaining) this functionality then there is a real possibility that NBD+MD/DM could eliminate the need for DRDB. however if they are not interested in adding all the code to deal with the network type issues, then the argument that DRDB should not be merged becouse you can do the same thing with MD/DM + NBD is invalid and can be dropped/ignored David Lang As a user I'd like to see md/nbd be extended to cope with unreliable links. I think md could be better in handling link exceptions. My unreliable memory recalls sporadic issues with hot-plug leaving md hanging and certain lower level errors (or even very high latency) causing unsatisfactory behaviour in what is supposed to be a fault 'tolerant' subsystem. Would this just be relevant to network devices or would it improve support for jostled usb and sata hot-plugging I wonder? good question, I suspect that some of the error handling would be similar (for devices that are unreachable not haning the system for example), but a lot of the rest would be different (do you really want to try to auto-resync to a drive that you _think_ just reappeared, what if it's a different drive? how can you be sure?) the error rate of a network is gong to be significantly higher then for USB or SATA drives (although I suppose iscsi would be limilar) David Lang - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFD] Layering: Use-Case Composers (was: DRBD - what is it, anyways? [compare with e.g. NBD + MD raid])
[EMAIL PROTECTED] wrote: Would this just be relevant to network devices or would it improve support for jostled usb and sata hot-plugging I wonder? good question, I suspect that some of the error handling would be similar (for devices that are unreachable not haning the system for example), but a lot of the rest would be different (do you really want to try to auto-resync to a drive that you _think_ just reappeared, Well, omit 'think' and the answer may be yes. A lot of systems are quite simple and RAID is common on the desktop now. If jostled USB fits into this category - then yes. what if it's a different drive? how can you be sure? And that's the key isn't it. We have the RAID device UUID and the superblock info. Isn't that enough? If not then given the work involved an extended superblock wouldn't be unreasonable. And I suspect the capability of devices would need recording in the superblock too? eg 'retry-on-fail' I can see how md would fail a device but may now periodically retry it. If a retry shows that it's back then it would validate it (UUID) and then resync it. ) the error rate of a network is gong to be significantly higher then for USB or SATA drives (although I suppose iscsi would be limilar) I do agree - I was looking for value-add for the existing subsystem. If this benefits existing RAID users then it's more likely to be attractive. David - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFD] Layering: Use-Case Composers (was: DRBD - what is it, anyways? [compare with e.g. NBD + MD raid])
On Sun, 12 Aug 2007, Jan Engelhardt wrote: On Aug 12 2007 13:35, Al Boldi wrote: Lars Ellenberg wrote: meanwhile, please, anyone interessted, the drbd paper for LinuxConf Eu 2007 is finalized. http://www.drbd.org/fileadmin/drbd/publications/ drbd8.linux-conf.eu.2007.pdf but it does give a good overview about what DRBD actually is, what exact problems it tries to solve, and what developments to expect in the near future. so you can make up your mind about Do we need it?, and Why DRBD? Why not NBD + MD-RAID? I may have made a mistake when asking for how it compares to NBD+MD. Let me retry: what's the functional difference between GFS2 on a DRBD .vs. GFS2 on a DAS SAN? GFS is a distributed filesystem, DRDB is a replicated block device. you wouldn't do GFS on top of DRDB, you would do ext2/3, XFS, etc DRDB is much closer to the NBD+MD option. now, I am not an expert on either option, but three are a couple things that I would question about the DRDB+MD option 1. when the remote machine is down, how does MD deal with it for reads and writes? 2. MD over local drive will alternate reads between mirrors (or so I've been told), doing so over the network is wrong. 3. when writing, will MD wait for the network I/O to get the data saved on the backup before returning from the syscall? or can it sync the data out lazily Now, shared remote block access should theoretically be handled, as does DRBD, by a block layer driver, but realistically it may be more appropriate to let it be handled by the combining end user, like OCFS or GFS. there are times when you want to replicate at the block layer, and there are times when you want to have a filesystem do the work. don't force a filesystem on use-cases where a block device is the right answer. David Lang - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFD] Layering: Use-Case Composers (was: DRBD - what is it, anyways? [compare with e.g. NBD + MD raid])
per the message below MD (or DM) would need to be modified to work reasonably well with one of the disk components being over an unreliable link (like a network link) are the MD/DM maintainers interested in extending their code in this direction? or would they prefer to keep it simpler by being able to continue to assume that the raid components are connected over a highly reliable connection? if they are interested in adding (and maintaining) this functionality then there is a real possibility that NBD+MD/DM could eliminate the need for DRDB. however if they are not interested in adding all the code to deal with the network type issues, then the argument that DRDB should not be merged becouse you can do the same thing with MD/DM + NBD is invalid and can be dropped/ignored David Lang On Sun, 12 Aug 2007, Paul Clements wrote: Iustin Pop wrote: On Sun, Aug 12, 2007 at 07:03:44PM +0200, Jan Engelhardt wrote: On Aug 12 2007 09:39, [EMAIL PROTECTED] wrote: now, I am not an expert on either option, but three are a couple things that I would question about the DRDB+MD option 1. when the remote machine is down, how does MD deal with it for reads and writes? I suppose it kicks the drive and you'd have to re-add it by hand unless done by a cronjob. Yes, and with a bitmap configured on the raid1, you just resync the blocks that have been written while the connection was down. From my tests, since NBD doesn't have a timeout option, MD hangs in the write to that mirror indefinitely, somewhat like when dealing with a broken IDE driver/chipset/disk. Well, if people would like to see a timeout option, I actually coded up a patch a couple of years ago to do just that, but I never got it into mainline because you can do almost as well by doing a check at user-level (I basically ping the nbd connection periodically and if it fails, I kill -9 the nbd-client). 2. MD over local drive will alternate reads between mirrors (or so I've been told), doing so over the network is wrong. Certainly. In which case you set write_mostly (or even write_only, not sure of its name) on the raid component that is nbd. 3. when writing, will MD wait for the network I/O to get the data saved on the backup before returning from the syscall? or can it sync the data out lazily Can't answer this one - ask Neil :) MD has the write-mostly/write-behind options - which help in this case but only up to a certain amount. You can configure write_behind (aka, asynchronous writes) to buffer as much data as you have RAM to hold. At a certain point, presumably, you'd want to just break the mirror and take the hit of doing a resync once your network leg falls too far behind. -- Paul - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Raid array is not automatically detected.
dean gaudet wrote: On Mon, 16 Jul 2007, David Greaves wrote: Bryan Christ wrote: I do have the type set to 0xfd. Others have said that auto-assemble only works on RAID 0 and 1, but just as Justin mentioned, I too have another box with RAID5 that gets auto assembled by the kernel (also no initrd). I expected the same behavior when I built this array--again using mdadm instead of raidtools. Any md arrays with partition type 0xfd using a 0.9 superblock should be auto-assembled by a standard kernel. no... debian (and probably ubuntu) do not build md into the kernel, they build it as a module, and the module does not auto-detect 0xfd. i don't know anything about slackware, but i just felt it worth commenting that a standard kernel is not really descriptive enough. Good point - I should have mentioned the non-module bit! http://linux-raid.osdl.org/index.php/Autodetect David - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Possible data corruption sata_sil24?
On Wed, Jul 18, 2007 at 05:53:39PM +0900, Tejun Heo wrote: David Shaw wrote: It fails whether I use a raw /dev/sdd or partition it into one large /dev/sdd1, or partition into multiple partitions. sata_sil24 seems to work by itself, as does dm, but as soon as I mix sata_sil24+dm, I get corruption. H Can you reproduce the corruption by accessing both devices simultaneously without using dm? Considering ich5 does fine, it looks like hardware and/or driver problem and I really wanna rule out dm. I think I wasn't clear enough before. The corruption happens when I use dm to create two dm mappings that both reside on the same real device. Using two different devices, or two different partitions on the same physical device works properly. ich5 does fine with these 3 tests, but sata_sil24 fails: * /dev/sdd, create 2 dm linear mappings on it, mke2fs and use those dm devices == corruption * Partition /dev/sdd into /dev/sdd1 and /dev/sdd2, mke2fs and use those partitions == no corruption * Partition /dev/sdd into /dev/sdd1 and /dev/sdd2, create 2 dm linear mappings on /dev/sdd1, mke2fs and use those dm devices == corruption I'm not sure whether this is problem of sata_sil24 or dm layer. Cc'ing linux-raid for help. How much memory do you have? One big difference between ata_piix and sata_sil24 is that sil24 can handle 64bit DMA. Maybe dma mapping or something interacts weirdly with dm there? The machine has 640 megs of RAM. FWIW, I tried this with 512 megs of RAM with the same results. Running Memtest86+ shows the memory is good. David - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Raid array is not automatically detected.
Bryan Christ wrote: I'm now very confused... It's all that top-posting... When I run mdadm --examine /dev/md0 I get the error message: No superblock detected on /dev/md0 However, when I run mdadm -D /dev/md0 the report clearly states Superblock is persistent David Greaves wrote: * are the superblocks version 0.9? (mdadm --examine /dev/component) See where it says 'component' ? :) I wish mdadm --detail and --examine were just aliases and the output varied according to whether you looked at a component (eg /dev/sda1) or an md device (/dev/md0) I get that wrong *all* the time... David - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 3ware 9650 tips
On Mon, Jul 16, 2007 at 12:41:15PM +1000, David Chinner wrote: On Fri, Jul 13, 2007 at 03:36:46PM -0400, Justin Piszcz wrote: On Fri, 13 Jul 2007, Jon Collette wrote: Wouldn't Raid 6 be slower than Raid 5 because of the extra fault tolerance? http://www.enterprisenetworksandservers.com/monthly/art.php?1754 - 20% drop according to this article His 500GB WD drives are 7200RPM compared to the Raptors 10K. So his numbers will be slower. Justin what file system do you have running on the Raptors? I think thats an interesting point made by Joshua. I use XFS: When it comes to bandwidth, there is good reason for that. Trying to stick with a supported config as much as possible, I need to run ext3. As per usual, though, initial ext3 numbers are less than impressive. Using bonnie++ to get a baseline, I get (after doing 'blockdev --setra 65536' on the device): Write: 136MB/s Read: 384MB/s Proving it's not the hardware, with XFS the numbers look like: Write: 333MB/s Read: 465MB/s Those are pretty typical numbers. In my experience, ext3 is limited to about 250MB/s buffered write speed. It's not disk limited, it's design limited. e.g. on a disk subsystem where XFS was getting 4-5GB/s buffered write, ext3 was doing 250MB/s. http://oss.sgi.com/projects/xfs/papers/ols2006/ols-2006-paper.pdf If you've got any sort of serious disk array, ext3 is not the filesystem to use To show what the difference is, I used blktrace and Chris Mason's seekwatcher script on a simple, single threaded dd command on a 12 disk dm RAID0 stripe: # dd if=/dev/zero of=/mnt/scratch/fred bs=1024k count=10k; sync http://oss.sgi.com/~dgc/writes/ext3_write.png http://oss.sgi.com/~dgc/writes/xfs_write.png You can see from the ext3 graph that it comes to a screeching halt every 5s (probably when pdflush runs) and at all other times the seek rate is 10,000 seeks/s. That's pretty bad for a brand new, empty filesystem and the only way it is sustained is the fact that the disks have their write caches turned on. ext4 will probably show better results, but I haven't got any of the tools installed to be able to test it The XFS pattern shows consistently an order of magnitude less seeks and consistent throughput above 600MB/s. To put the number of seeks in context, XFS is doing 512k I/Os at about 1200-1300 per second. The number of seeks? A bit above 10^3 per second or roughly 1 seek per I/O which is pretty much optimal. Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Raid array is not automatically detected.
Bryan Christ wrote: I do have the type set to 0xfd. Others have said that auto-assemble only works on RAID 0 and 1, but just as Justin mentioned, I too have another box with RAID5 that gets auto assembled by the kernel (also no initrd). I expected the same behavior when I built this array--again using mdadm instead of raidtools. Any md arrays with partition type 0xfd using a 0.9 superblock should be auto-assembled by a standard kernel. If you want to boot from them you must ensure the kernel image is on a partition that the bootloader can read - ie RAID 0. This is nothing to do with auto-assembly. So some questions: * are the partitions 0xfd ? yes. * is the kernel standard? * are the superblocks version 0.9? (mdadm --examine /dev/component) David - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 3ware 9650 tips
On Mon, Jul 16, 2007 at 10:50:34AM -0500, Eric Sandeen wrote: David Chinner wrote: On Mon, Jul 16, 2007 at 12:41:15PM +1000, David Chinner wrote: On Fri, Jul 13, 2007 at 03:36:46PM -0400, Justin Piszcz wrote: ... If you've got any sort of serious disk array, ext3 is not the filesystem to use To show what the difference is, I used blktrace and Chris Mason's seekwatcher script on a simple, single threaded dd command on a 12 disk dm RAID0 stripe: # dd if=/dev/zero of=/mnt/scratch/fred bs=1024k count=10k; sync http://oss.sgi.com/~dgc/writes/ext3_write.png http://oss.sgi.com/~dgc/writes/xfs_write.png Were those all with default mkfs mount options? ext3 in writeback mode might be an interesting comparison too. Defaults. i.e. # mkfs.ext3 /dev/mapper/dm0 # mkfs.xfs /dev/mapper/dm0 The mkfs.xfs picked up sunit/swidth correctly from the dm volume. Last time I checked, writeback made little difference to ext3 throughput; maybe 5-10% at most. I'll run it again later today... Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 3ware 9650 tips
On Fri, Jul 13, 2007 at 03:36:46PM -0400, Justin Piszcz wrote: On Fri, 13 Jul 2007, Jon Collette wrote: Wouldn't Raid 6 be slower than Raid 5 because of the extra fault tolerance? http://www.enterprisenetworksandservers.com/monthly/art.php?1754 - 20% drop according to this article His 500GB WD drives are 7200RPM compared to the Raptors 10K. So his numbers will be slower. Justin what file system do you have running on the Raptors? I think thats an interesting point made by Joshua. I use XFS: When it comes to bandwidth, there is good reason for that. Trying to stick with a supported config as much as possible, I need to run ext3. As per usual, though, initial ext3 numbers are less than impressive. Using bonnie++ to get a baseline, I get (after doing 'blockdev --setra 65536' on the device): Write: 136MB/s Read: 384MB/s Proving it's not the hardware, with XFS the numbers look like: Write: 333MB/s Read: 465MB/s Those are pretty typical numbers. In my experience, ext3 is limited to about 250MB/s buffered write speed. It's not disk limited, it's design limited. e.g. on a disk subsystem where XFS was getting 4-5GB/s buffered write, ext3 was doing 250MB/s. http://oss.sgi.com/projects/xfs/papers/ols2006/ols-2006-paper.pdf If you've got any sort of serious disk array, ext3 is not the filesystem to use How many folks are using these? Any tuning tips? Make sure you tell XFS the correct sunit/swidth. For hardware raid5/6, sunit = per-disk chunksize, swidth = number of *data* disks in array. Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: mdadm create to existing raid5
Guy Watkins wrote: } [EMAIL PROTECTED] On Behalf Of Jon Collette } I wasn't thinking and did a mdadm --create to my existing raid5 instead } of --assemble. The syncing process ran and now its not mountable. Is } there anyway to recover from this? Maybe. Not really sure. But don't do anything until someone that really knows answers! I agree - Yes, maybe. What I think... If you did a create with the exact same parameters the data should not have changed. But you can't mount so you must have used different parameters. I'd agree. Only 1 disk was written to during the create. Yep. Only that disk was changed. Yep. If you remove the 1 disk and do another create with the original parameters and put missing for the 1 disk your array will be back to normal, but degraded. Once you confirm this you can add back the 1 disk. Yep. **WARNING** **WARNING** **WARNING** At this point you are relatively safe (!) but as soon as you do an 'add' and initiate another resync then if you got it wrong you will have toasted your data completely!!! **WARNING** **WARNING** **WARNING** You must be able to determine which disk was written to. I don't know how to do that unless you have the output from mdadm -D during the create/syncing. Do you know the *exact* command you issued when you did the initial --create? Do you know the *exact* command you issued when you did the bogus --create? And what version of mdadm you are using? Neil said that it's mdadm, not the kernel, that determines which device is initially degraded during a create. We can look at the code and your command line and guess which device mdadm chose. (Getting this wrong won't matter but it may make recovery quicker.) assuming you have a 4 device raid using /dev/sda1, /dev/sdb1, /dev/sdc1, /dev/sdd1 you'll then do something like: mdadm --create --verbose /dev/md0 --level=5 --raid-devices=4 /dev/sda1 /dev/sdb1 /dev/sdc1 missing try a mount mdadm --create --verbose /dev/md0 --level=5 --raid-devices=4 /dev/sda1 missing /dev/sdc1 /dev/sdb1 try a mount mdadm --create --verbose /dev/md0 --level=5 --raid-devices=4 /dev/sdb1 /dev/sda1 /dev/sdc1 missing try a mount mdadm --create --verbose /dev/md0 --level=5 --raid-devices=4 /dev/sdc1 /dev/sdb1 /dev/sda1 missing try a mount etc etc, So you'll still need to do a trial and error assemble For a simple 4 device array I there are 24 permutations - doable by hand, if you have 5 devices then it's 120, 6 is 720 - getting tricky ;) I'm bored so I'm going to write a script based on something like this: http://www.unix.org.ua/orelly/perl/cookbook/ch04_20.htm Feel free to beat me to it ... The critical thing is that you *must* use 'missing' when doing these trial --create calls. If we've not explained something very well and you don't understand then please ask before trying it out... David - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: mdadm create to existing raid5
David Greaves wrote: For a simple 4 device array I there are 24 permutations - doable by hand, if you have 5 devices then it's 120, 6 is 720 - getting tricky ;) Oh, wait, for 4 devices there are 24 permutations - and you need to do it 4 times, substituting 'missing' for each device - so 96 trials. 4320 trials for a 6 device array. Hmm. I've got a 7 device raid 6 - I think I'll go an make a note of how it's put together... grin Have a look at this section and the linked script. I can't test it until later http://linux-raid.osdl.org/index.php/RAID_Recovery http://linux-raid.osdl.org/index.php/Permute_array.pl David - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Proposed enhancement to mdadm: Allow --write-behind= to be done in grow mode.
Ian Dall wrote: There doesn't seem to be any designated place to send bug reports and feature requests to mdadm, so I hope I am doing the right thing by sending it here. I have a small patch to mdamd which allows the write-behind amount to be set a array grow time (instead of currently only at grow or create time). I have tested this fairly extensively on some arrays built out of loop back devices, and once on a real live array. I haven't lot any data and it seems to work OK, though it is possible I am missing something. Sounds like a useful feature... Did you test the bitmap cases you mentioned? David - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [linux-pm] Re: [linux-lvm] 2.6.22-rc4 XFS fails after hibernate/resume
David Chinner wrote: On Fri, Jun 29, 2007 at 12:16:44AM +0200, Rafael J. Wysocki wrote: There are two solutions possible, IMO. One would be to make these workqueues freezable, which is possible, but hacky and Oleg didn't like that very much. The second would be to freeze XFS from within the hibernation code path, using freeze_bdev(). The second is much more likely to work reliably. If freezing the filesystem leaves something in an inconsistent state, then it's something I can reproduce and debug without needing to suspend/resume. FWIW, don't forget you need to thaw the filesystem on resume. I've been a little distracted recently - sorry. I'll re-read the thread and see if there are any test actions I need to complete. I do know that the corruption problems I've been having: a) only happen after hibernate/resume b) only ever happen on one of 2 XFS filesystems c) happen even when the script does xfs_freeze;sync;hibernate;xfs_thaw What happens if a filesystem is frozen and I hibernate? Will it be thawed when I resume? David - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Fastest Chunk Size w/XFS For MD Software RAID = 1024k
David Chinner wrote: On Wed, Jun 27, 2007 at 07:20:42PM -0400, Justin Piszcz wrote: For drives with 16MB of cache (in this case, raptors). That's four (4) drives, right? I'm pretty sure he's using 10 - email a few days back... Justin Piszcz wrote: Running test with 10 RAPTOR 150 hard drives, expect it to take awhile until I get the results, avg them etc. :) If so, how do you get a block read rate of 578MB/s from 4 drives? That's 145MB/s per drive Which gives a far more reasonable 60MB/s per drive... David - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: mdadm usage: creating arrays with helpful names?
(back on list for google's benefit ;) and because there are some good questions and I don't know all the answers... ) Oh, and Neil 'cos there may be a bug ... Richard Michael wrote: On Wed, Jun 27, 2007 at 08:49:22AM +0100, David Greaves wrote: http://linux-raid.osdl.org/index.php/Partitionable Thanks. I didn't know this site existed (Googling even just 'mdadm' doesn't yield it in the first 100 results), and it's helpful. Good ... I got permission to wikify the 'official' linux raid FAQ but it takes time (and motivation!) to update it :) Hopefully it will snowball as people who use it then contribute back hint ;) As it becomes more valuable to people then more links will be created and Google will notice... What if don't want a partitioned array? I simply want the name to be nicer than the /dev/mdX or /dev/md/XX style. (p1 still gives me /dev/nicename /dev/nicename0, as your page indicates.) --auto md mdadm --create /dev/strawberry --auto md ... [EMAIL PROTECTED]:/tmp # mdadm --detail /dev/strawberry /dev/strawberry: Version : 00.90.03 Creation Time : Thu Jun 28 08:25:06 2007 Raid Level : raid4 Also, when I use --create /dev/nicename --auto=p1 (for example), I also see /dev/md_d126 created. Why? There is then a /sys/block/md_d126 entry (presumably created by the md driver), but no /sys/block/nicename entry. Why? Not sure who creates this, mdadm or udev The code isn't that hard to read and you sound like you'd follow it if you fancied a skim-read... I too would expect that there should be a /sys/block/nicename - is this a bug Neil? These options don't see a lot of use - I recently came across a bug in the --auto pX option... Finally --stop /dev/nicename doesn't remove any of the aforementioned /dev or /sys entries. I don't suppose that it should, but an mdadm command to do this would be helpful. So, how do I remove the oddly named /sys entries? (I removed the /dev entries with rm.) man mdadm indicates --stop releases all resources, but it doesn't (and probably shouldn't). rm ! '--stop' with mdadm does release the 'resources', ie the components you used. It doesn't remove the array. There is no delete - I guess since an rm is just as effective unless you use a nicename... [I think there should be a symmetry to the mdadm options --create/--delete and --start/--stop. It's *convenient* --create also starts the array, but this conflates the issue a bit..] I want to stop and completely remove all trace of the array. (Especially as I'm experimenting with this over loopback, and stuff hanging around irritates the lo driver.) You're possibly mixing two things up here... Releasing the resources with a --stop would let you re-use a lo device in another array. You don't _need_ --delete (or rm). However md does write superblocks to the components and *mdadm* warns you that the loopback has a valid superblock.. mdadm: /dev/loop1 appears to be part of a raid array: level=raid4 devices=6 ctime=Thu Jun 21 09:46:27 2007 [hmm, I can see why you may think it's part of an 'active' array] You could do mdadm --zero-superblock to clean the component or just say yes when mdadm asks you to continue. see: # mdadm --create /dev/strawberry --auto md --level=4 -n 6 /dev/loop1 /dev/loop2 /dev/loop3 /dev/loop4 /dev/loop5 /dev/loop6 mdadm: /dev/loop1 appears to be part of a raid array: level=raid4 devices=6 ctime=Thu Jun 28 08:25:06 2007 blah Continue creating array? yes mdadm: array /dev/strawberry started. # mdadm --stop /dev/strawberry mdadm: stopped /dev/strawberry # mdadm --create /dev/strawberry --auto md --level=4 -n 6 /dev/loop1 /dev/loop2 /dev/loop3 /dev/loop4 /dev/loop5 /dev/loop6 mdadm: /dev/loop1 appears to be part of a raid array: level=raid4 devices=6 ctime=Thu Jun 28 09:07:29 2007 blah Continue creating array? yes mdadm: array /dev/strawberry started. David - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Fastest Chunk Size w/XFS For MD Software RAID = 1024k
On Thu, Jun 28, 2007 at 04:27:15AM -0400, Justin Piszcz wrote: On Thu, 28 Jun 2007, Peter Rabbitson wrote: Justin Piszcz wrote: mdadm --create \ --verbose /dev/md3 \ --level=5 \ --raid-devices=10 \ --chunk=1024 \ --force \ --run /dev/sd[cdefghijkl]1 Justin. Interesting, I came up with the same results (1M chunk being superior) with a completely different raid set with XFS on top: mdadm--create \ --level=10 \ --chunk=1024 \ --raid-devices=4 \ --layout=f3 \ ... Could it be attributed to XFS itself? More likely it's related to the I/O size being sent to the disks. The larger the chunk size, the larger the I/o hitting each disk. I think the maximum I/O size is 512k ATM on x86(_64), so a chunk of 1MB will guarantee that there are maximally sized I/Os being sent to the disk Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [linux-lvm] 2.6.22-rc4 XFS fails after hibernate/resume
On Wed, Jun 27, 2007 at 08:49:24PM +, Pavel Machek wrote: Hi! FWIW, I'm on record stating that sync is not sufficient to quiesce an XFS filesystem for a suspend/resume to work safely and have argued that the only Hmm, so XFS writes to disk even when its threads are frozen? They issue async I/O before they sleep and expects processing to be done on I/O completion via workqueues. safe thing to do is freeze the filesystem before suspend and thaw it after resume. This is why I originally asked you to test that with the other problem Could you add that to the XFS threads if it is really required? They do know that they are being frozen for suspend. We don't suspend the threads on a filesystem freeze - they continue run. A filesystem freeze guarantees the filesystem clean and that the in memory state matches what is on disk. It is not possible for the filesytem to issue I/O or have outstanding I/O when it is in the frozen state, so the state of the threads and/or workqueues does not matter because they will be idle. Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [linux-pm] Re: [linux-lvm] 2.6.22-rc4 XFS fails after hibernate/resume
On Fri, Jun 29, 2007 at 12:16:44AM +0200, Rafael J. Wysocki wrote: There are two solutions possible, IMO. One would be to make these workqueues freezable, which is possible, but hacky and Oleg didn't like that very much. The second would be to freeze XFS from within the hibernation code path, using freeze_bdev(). The second is much more likely to work reliably. If freezing the filesystem leaves something in an inconsistent state, then it's something I can reproduce and debug without needing to suspend/resume. FWIW, don't forget you need to thaw the filesystem on resume. Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: mdadm usage: creating arrays with helpful names?
Richard Michael wrote: How do I create an array with a helpful name? i.e. /dev/md/storage? The mdadm man page hints at this in the discussion of the --auto option in the ASSEMBLE MODE section, but doesn't clearly indicate how it's done. Must I create the device nodes by hand first using MAKEDEV? Does this help? http://linux-raid.osdl.org/index.php/Partitionable David - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Fastest Chunk Size w/XFS For MD Software RAID = 1024k
On Wed, Jun 27, 2007 at 07:20:42PM -0400, Justin Piszcz wrote: For drives with 16MB of cache (in this case, raptors). That's four (4) drives, right? If so, how do you get a block read rate of 578MB/s from 4 drives? That's 145MB/s per drive Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: limits on raid
On Fri, 22 Jun 2007, David Greaves wrote: That's not a bad thing - until you look at the complexity it brings - and then consider the impact and exceptions when you do, eg hardware acceleration? md information fed up to the fs layer for xfs? simple long term maintenance? Often these problems are well worth the benefits of the feature. I _wonder_ if this is one where the right thing is to just say no :) In this case I think the advantages of a higher level system knowing what efficiant blocks to do writes/reads in can potentially be a HUGE advantage. if the uppper levels know that you ahve a 6 disk raid 6 array with a 64K chunk size then reads and writes in 256k chunks (aligned) should be able to be done at basicly the speed of a 4 disk raid 0 array. what's even more impressive is that this could be done even if the array is degraded (if you know the drives have failed you don't even try to read from them and you only have to reconstruct the missing info once per stripe) the current approach doesn't give the upper levels any chance to operate in this mode, they just don't have enough information to do so. the part about wanting to know raid 0 chunk size so that the upper layers can be sure that data that's supposed to be redundant is on seperate drives is also possible storage technology is headed in the direction of having the system do more and more of the layout decisions, and re-stripe the array as conditions change (similar to what md can already do with enlarging raid5/6 arrays) but unless you want to eventually put all that decision logic into the md layer you should make it possible for other layers to make queries to find out what's what and then they can give directions for what they want to have happen. so for several reasons I don't see this as something that's deserving of an atomatic 'no' David Lang - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: limits on raid
Bill Davidsen wrote: David Greaves wrote: [EMAIL PROTECTED] wrote: On Fri, 22 Jun 2007, David Greaves wrote: If you end up 'fiddling' in md because someone specified --assume-clean on a raid5 [in this case just to save a few minutes *testing time* on system with a heavily choked bus!] then that adds *even more* complexity and exception cases into all the stuff you described. A few minutes? Are you reading the times people are seeing with multi-TB arrays? Let's see, 5TB at a rebuild rate of 20MB... three days. Yes. But we are talking initial creation here. And as soon as you believe that the array is actually usable you cut that rebuild rate, perhaps in half, and get dog-slow performance from the array. It's usable in the sense that reads and writes work, but for useful work it's pretty painful. You either fail to understand the magnitude of the problem or wish to trivialize it for some reason. I do understand the problem and I'm not trying to trivialise it :) I _suggested_ that it's worth thinking about things rather than jumping in to say oh, we can code up a clever algorithm that keeps track of what stripes have valid parity and which don't and we can optimise the read/copy/write for valid stripes and use the raid6 type read-all/write-all for invalid stripes and then we can write a bit extra on the check code to set the bitmaps.. Phew - and that lets us run the array at semi-degraded performance (raid6-like) for 3 days rather than either waiting before we put it into production or running it very slowly. Now we run this system for 3 years and we saved 3 days - hmmm IS IT WORTH IT? What happens in those 3 years when we have a disk fail? The solution doesn't apply then - it's 3 days to rebuild - like it or not. By delaying parity computation until the first write to a stripe only the growth of a filesystem is slowed, and all data are protected without waiting for the lengthly check. The rebuild speed can be set very low, because on-demand rebuild will do most of the work. I am not saying you are wrong. I ask merely if the balance of benefit outweighs the balance of complexity. If the benefit were 24x7 then sure - eg using hardware assist in the raid calcs - very useful indeed. I'm very much for the fs layer reading the lower block structure so I don't have to fiddle with arcane tuning parameters - yes, *please* help make xfs self-tuning! Keeping life as straightforward as possible low down makes the upwards interface more manageable and that goal more realistic... Those two paragraphs are mutually exclusive. The fs can be simple because it rests on a simple device, even if the simple device is provided by LVM or md. And LVM and md can stay simple because they rest on simple devices, even if they are provided by PATA, SATA, nbd, etc. Independent layers make each layer more robust. If you want to compromise the layer separation, some approach like ZFS with full integration would seem to be promising. Note that layers allow specialized features at each point, trading integration for flexibility. That's a simplistic summary. You *can* loosely couple the layers. But you can enrich the interface and tightly couple them too - XFS is capable (I guess) of understanding md more fully than say ext2. XFS would still work on a less 'talkative' block device where performance wasn't as important (USB flash maybe, dunno). My feeling is that full integration and independent layers each have benefits, as you connect the layers to expose operational details you need to handle changes in those details, which would seem to make layers more complex. Agreed. What I'm looking for here is better performance in one particular layer, the md RAID5 layer. I like to avoid unnecessary complexity, but I feel that the current performance suggests room for improvement. I agree there is room for improvement. I suggest that it may be more fruitful to write a tool called raid5prepare that writes zeroes/ones as appropriate to all component devices and then you can use --assume-clean without concern. That could look to see if the devices are scsi or whatever and take advantage of the hyperfast block writes that can be done. David - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: limits on raid
On Fri, 22 Jun 2007, Bill Davidsen wrote: By delaying parity computation until the first write to a stripe only the growth of a filesystem is slowed, and all data are protected without waiting for the lengthly check. The rebuild speed can be set very low, because on-demand rebuild will do most of the work. I'm very much for the fs layer reading the lower block structure so I don't have to fiddle with arcane tuning parameters - yes, *please* help make xfs self-tuning! Keeping life as straightforward as possible low down makes the upwards interface more manageable and that goal more realistic... Those two paragraphs are mutually exclusive. The fs can be simple because it rests on a simple device, even if the simple device is provided by LVM or md. And LVM and md can stay simple because they rest on simple devices, even if they are provided by PATA, SATA, nbd, etc. Independent layers make each layer more robust. If you want to compromise the layer separation, some approach like ZFS with full integration would seem to be promising. Note that layers allow specialized features at each point, trading integration for flexibility. My feeling is that full integration and independent layers each have benefits, as you connect the layers to expose operational details you need to handle changes in those details, which would seem to make layers more complex. What I'm looking for here is better performance in one particular layer, the md RAID5 layer. I like to avoid unnecessary complexity, but I feel that the current performance suggests room for improvement. they both have have benifits, but it shouldn't have to be either-or if you build the seperate layers and provide for ways that the upper layers can query the lower layers to find what's efficiant then you can have some uppoer layers that don't care about this and trat the lower layer as a simple block device, while other upper layers find out what sort of things are more efficiant to do and use the same lower layer in a more complex manner the alturnative is to duplicate effort (and code) to have two codebases that try to do the same thing, one stand-alone, and one as a part of an integrated solution (and it gets even worse if there end up being multiple integrated solutions) David Lang - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: limits on raid
Neil Brown wrote: This isn't quite right. Thanks :) Firstly, it is mdadm which decided to make one drive a 'spare' for raid5, not the kernel. Secondly, it only applies to raid5, not raid6 or raid1 or raid10. For raid6, the initial resync (just like the resync after an unclean shutdown) reads all the data blocks, and writes all the P and Q blocks. raid5 can do that, but it is faster the read all but one disk, and write to that one disk. How about this: Initial Creation When mdadm asks the kernel to create a raid array the most noticeable activity is what's called the initial resync. Raid level 0 doesn't have any redundancy so there is no initial resync. For raid levels 1,4,6 and 10 mdadm creates the array and starts a resync. The raid algorithm then reads the data blocks and writes the appropriate parity/mirror (P+Q) blocks across all the relevant disks. There is some sample output in a section below... For raid5 there is an optimisation: mdadm takes one of the disks and marks it as 'spare'; it then creates the array in degraded mode. The kernel marks the spare disk as 'rebuilding' and starts to read from the 'good' disks, calculate the parity and determines what should be on the spare disk and then just writes to it. Once all this is done the array is clean and all disks are active. This can take quite a time and the array is not fully resilient whilst this is happening (it is however fully useable). Also is raid4 like raid5 or raid6 in this respect? - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: raid5 recover after a 2 disk failure
Frank Jenkins wrote: So here's the /proc/mdstat prior to the array failure: I'll take a look through this and see if I can see any problems Frank. Bit busy now - give me a few minutes. David - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [linux-lvm] 2.6.22-rc5 XFS fails after hibernate/resume
David Greaves wrote: I'm going to have to do some more testing... done David Chinner wrote: On Mon, Jun 18, 2007 at 08:49:34AM +0100, David Greaves wrote: David Greaves wrote: So doing: xfs_freeze -f /scratch sync echo platform /sys/power/disk echo disk /sys/power/state # resume xfs_freeze -u /scratch Works (for now - more usage testing tonight) Verrry interesting. Good :) Now, not so good :) What you were seeing was an XFS shutdown occurring because the free space btree was corrupted. IOWs, the process of suspend/resume has resulted in either bad data being written to disk, the correct data not being written to disk or the cached block being corrupted in memory. That's the kind of thing I was suspecting, yes. If you run xfs_check on the filesystem after it has shut down after a resume, can you tell us if it reports on-disk corruption? Note: do not run xfs_repair to check this - it does not check the free space btrees; instead it simply rebuilds them from scratch. If xfs_check reports an error, then run xfs_repair to fix it up. OK, I can try this tonight... This is on 2.6.22-rc5 So I hibernated last night and resumed this morning. Before hibernating I froze and sync'ed. After resume I thawed it. (Sorry Dave) Here are some photos of the screen during resume. This is not 100% reproducable - it seems to occur only if the system is shutdown for 30mins or so. Tejun, I wonder if error handling during resume is problematic? I got the same errors in 2.6.21. I have never seen these (or any other libata) errors other than during resume. http://www.dgreaves.com/pub/2.6.22-rc5-resume-failure.jpg (hard to read, here's one from 2.6.21 http://www.dgreaves.com/pub/2.6.21-resume-failure.jpg I _think_ I've only seen the xfs problem when a resume shows these errors. Ok, to try and cause a problem I ran a make and got this back at once: make: stat: Makefile: Input/output error make: stat: clean: Input/output error make: *** No rule to make target `clean'. Stop. make: stat: GNUmakefile: Input/output error make: stat: makefile: Input/output error I caught the first dmesg this time: Filesystem dm-0: XFS internal error xfs_btree_check_sblock at line 334 of file fs/xfs/xfs_btree.c. Caller 0xc01b58e1 [c0104f6a] show_trace_log_lvl+0x1a/0x30 [c0105c52] show_trace+0x12/0x20 [c0105d15] dump_stack+0x15/0x20 [c01daddf] xfs_error_report+0x4f/0x60 [c01cd736] xfs_btree_check_sblock+0x56/0xd0 [c01b58e1] xfs_alloc_lookup+0x181/0x390 [c01b5b06] xfs_alloc_lookup_le+0x16/0x20 [c01b30c1] xfs_free_ag_extent+0x51/0x690 [c01b4ea4] xfs_free_extent+0xa4/0xc0 [c01bf739] xfs_bmap_finish+0x119/0x170 [c01e3f4a] xfs_itruncate_finish+0x23a/0x3a0 [c02046a2] xfs_inactive+0x482/0x500 [c0210ad4] xfs_fs_clear_inode+0x34/0xa0 [c017d777] clear_inode+0x57/0xe0 [c017d8e5] generic_delete_inode+0xe5/0x110 [c017da77] generic_drop_inode+0x167/0x1b0 [c017cedf] iput+0x5f/0x70 [c01735cf] do_unlinkat+0xdf/0x140 [c0173640] sys_unlink+0x10/0x20 [c01040a4] syscall_call+0x7/0xb === xfs_force_shutdown(dm-0,0x8) called from line 4258 of file fs/xfs/xfs_bmap.c. Return address = 0xc021101e Filesystem dm-0: Corruption of in-memory data detected. Shutting down filesystem: dm-0 Please umount the filesystem, and rectify the problem(s) so I cd'ed out of /scratch and umounted. I then tried the xfs_check. haze:~# xfs_check /dev/video_vg/video_lv ERROR: The filesystem has valuable metadata changes in a log which needs to be replayed. Mount the filesystem to replay the log, and unmount it before re-running xfs_check. If you are unable to mount the filesystem, then use the xfs_repair -L option to destroy the log and attempt a repair. Note that destroying the log may cause corruption -- please attempt a mount of the filesystem before doing this. haze:~# mount /scratch/ haze:~# umount /scratch/ haze:~# xfs_check /dev/video_vg/video_lv Message from [EMAIL PROTECTED] at Tue Jun 19 08:47:30 2007 ... haze kernel: Bad page state in process 'xfs_db' Message from [EMAIL PROTECTED] at Tue Jun 19 08:47:30 2007 ... haze kernel: page:c1767bc0 flags:0x80010008 mapping: mapcount:-64 count:0 Message from [EMAIL PROTECTED] at Tue Jun 19 08:47:30 2007 ... haze kernel: Trying to fix it up, but a reboot is needed Message from [EMAIL PROTECTED] at Tue Jun 19 08:47:30 2007 ... haze kernel: Backtrace: Message from [EMAIL PROTECTED] at Tue Jun 19 08:47:30 2007 ... haze kernel: Bad page state in process 'syslogd' Message from [EMAIL PROTECTED] at Tue Jun 19 08:47:30 2007 ... haze kernel: page:c1767cc0 flags:0x80010008 mapping: mapcount:-64 count:0 Message from [EMAIL PROTECTED] at Tue Jun 19 08:47:30 2007 ... haze kernel: Trying to fix it up, but a reboot is needed Message from [EMAIL PROTECTED] at Tue Jun 19 08:47:30 2007 ... haze kernel: Backtrace: ugh. Try again haze:~# xfs_check /dev/video_vg/video_lv haze:~# whilst running a top reported this as roughly the peak memory usage: 8759 root
Re: [linux-lvm] 2.6.22-rc5 XFS fails after hibernate/resume
Rafael J. Wysocki wrote: This is on 2.6.22-rc5 Is the Tejun's patch http://www.sisk.pl/kernel/hibernation_and_suspend/2.6.22-rc5/patches/30-block-always-requeue-nonfs-requests-at-the-front.patch applied on top of that? 2.6.22-rc5 includes it. (but, when I was testing rc4, I did apply this patch) David - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: limits on raid
On Tue, 19 Jun 2007, Lennart Sorensen wrote: On Mon, Jun 18, 2007 at 02:56:10PM -0700, [EMAIL PROTECTED] wrote: yes, I'm useing promise drive shelves, I have them configured to export the 15 drives as 15 LUNs on a single ID. I'm going to be useing this as a huge circular buffer that will just be overwritten eventually 99% of the time, but once in a while I will need to go back into the buffer and extract and process the data. I would guess that if you ran 15 drives per channel on 3 different channels, you would resync in 1/3 the time. Well unless you end up saturating the PCI bus instead. hardware raid of course has an advantage there in that it doesn't have to go across the bus to do the work (although if you put 45 drives on one scsi channel on hardware raid, it will still be limited). I fully realize that the channel will be the bottleneck, I just didn't understand what /proc/mdstat was telling me. I thought that it was telling me that the resync was processing 5M/sec, not that it was writing 5M/sec on each of the two parity locations. David Lang - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [linux-lvm] 2.6.22-rc4 XFS fails after hibernate/resume
David Greaves wrote: David Robinson wrote: David Greaves wrote: This isn't a regression. I was seeing these problems on 2.6.21 (but 22 was in -rc so I waited to try it). I tried 2.6.22-rc4 (with Tejun's patches) to see if it had improved - no. Note this is a different (desktop) machine to that involved my recent bugs. The machine will work for days (continually powered up) without a problem and then exhibits a filesystem failure within minutes of a resume. snip OK, that gave me an idea. Freeze the filesystem md5sum the lvm hibernate resume md5sum the lvm snip So the lvm and below looks OK... I'll see how it behaves now the filesystem has been frozen/thawed over the hibernate... And it appears to behave well. (A few hours compile/clean cycling kernel builds on that filesystem were OK). Historically I've done: sync echo platform /sys/power/disk echo disk /sys/power/state # resume and had filesystem corruption (only on this machine, my other hibernating xfs machines don't have this problem) So doing: xfs_freeze -f /scratch sync echo platform /sys/power/disk echo disk /sys/power/state # resume xfs_freeze -u /scratch Works (for now - more usage testing tonight) David - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: XFS Tunables for High Speed Linux SW RAID5 Systems?
David Chinner wrote: On Fri, Jun 15, 2007 at 04:36:07PM -0400, Justin Piszcz wrote: Hi, I was wondering if the XFS folks can recommend any optimizations for high speed disk arrays using RAID5? [sysctls snipped] None of those options will make much difference to performance. mkfs parameters are the big ticket item here Is there anywhere you can point to that expands on this? Is there anything raid specific that would be worth including in the Wiki? David - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [linux-lvm] 2.6.22-rc4 XFS fails after hibernate/resume
On Mon, Jun 18, 2007 at 08:49:34AM +0100, David Greaves wrote: David Greaves wrote: OK, that gave me an idea. Freeze the filesystem md5sum the lvm hibernate resume md5sum the lvm snip So the lvm and below looks OK... I'll see how it behaves now the filesystem has been frozen/thawed over the hibernate... And it appears to behave well. (A few hours compile/clean cycling kernel builds on that filesystem were OK). Historically I've done: sync echo platform /sys/power/disk echo disk /sys/power/state # resume and had filesystem corruption (only on this machine, my other hibernating xfs machines don't have this problem) So doing: xfs_freeze -f /scratch sync echo platform /sys/power/disk echo disk /sys/power/state # resume xfs_freeze -u /scratch Works (for now - more usage testing tonight) Verrry interesting. What you were seeing was an XFS shutdown occurring because the free space btree was corrupted. IOWs, the process of suspend/resume has resulted in either bad data being written to disk, the correct data not being written to disk or the cached block being corrupted in memory. If you run xfs_check on the filesystem after it has shut down after a resume, can you tell us if it reports on-disk corruption? Note: do not run xfs_repair to check this - it does not check the free space btrees; instead it simply rebuilds them from scratch. If xfs_check reports an error, then run xfs_repair to fix it up. FWIW, I'm on record stating that sync is not sufficient to quiesce an XFS filesystem for a suspend/resume to work safely and have argued that the only safe thing to do is freeze the filesystem before suspend and thaw it after resume. This is why I originally asked you to test that with the other problem that you reported. Up until this point in time, there's been no evidence to prove either side of the argument.. Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: resync to last 27h - usually 3. what's this?
Dexter Filmore wrote: 1661 minutes is *way* too long. it's a 4x250GiB sATA array and usually takes 3 hours to resync or check, for that matter. So, what's this? kernel, mdadm verisons? I seem to recall a long fixed ETA calculation bug some time back... David - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: limits on raid
On Mon, 18 Jun 2007, Brendan Conoboy wrote: [EMAIL PROTECTED] wrote: in my case it takes 2+ days to resync the array before I can do any performance testing with it. for some reason it's only doing the rebuild at ~5M/sec (even though I've increased the min and max rebuild speeds and a dd to the array seems to be ~44M/sec, even during the rebuild) With performance like that, it sounds like you're saturating a bus somewhere along the line. If you're using scsi, for instance, it's very easy for a long chain of drives to overwhelm a channel. You might also want to consider some other RAID layouts like 1+0 or 5+0 depending upon your space vs. reliability needs. I plan to test the different configurations. however, if I was saturating the bus with the reconstruct how can I fire off a dd if=/dev/zero of=/mnt/test and get ~45M/sec whild only slowing the reconstruct to ~4M/sec? I'm putting 10x as much data through the bus at that point, it would seem to proove that it's not the bus that's saturated. David Lang - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: limits on raid
On Mon, 18 Jun 2007, Lennart Sorensen wrote: On Mon, Jun 18, 2007 at 10:28:38AM -0700, [EMAIL PROTECTED] wrote: I plan to test the different configurations. however, if I was saturating the bus with the reconstruct how can I fire off a dd if=/dev/zero of=/mnt/test and get ~45M/sec whild only slowing the reconstruct to ~4M/sec? I'm putting 10x as much data through the bus at that point, it would seem to proove that it's not the bus that's saturated. dd 45MB/s from the raid sounds reasonable. If you have 45 drives, doing a resync of raid5 or radi6 should probably involve reading all the disks, and writing new parity data to one drive. So if you are writing 5MB/s, then you are reading 44*5MB/s from the other drives, which is 220MB/s. If your resync drops to 4MB/s when doing dd, then you have 44*4MB/s which is 176MB/s or 44MB/s less read capacity, which surprisingly seems to match the dd speed you are getting. Seems like you are indeed very much saturating a bus somewhere. The numbers certainly agree with that theory. What kind of setup is the drives connected to? simple ultra-wide SCSI to a single controller. I didn't realize that the rate reported by /proc/mdstat was the write speed that was takeing place, I thought it was the total data rate (reads + writes). the next time this message gets changed it would be a good thing to clarify this. David Lang - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: limits on raid
On Mon, 18 Jun 2007, Brendan Conoboy wrote: [EMAIL PROTECTED] wrote: I plan to test the different configurations. however, if I was saturating the bus with the reconstruct how can I fire off a dd if=/dev/zero of=/mnt/test and get ~45M/sec whild only slowing the reconstruct to ~4M/sec? I'm putting 10x as much data through the bus at that point, it would seem to proove that it's not the bus that's saturated. I am unconvinced. If you take ~1MB/s for each active drive, add in SCSI overhead, 45M/sec seems reasonable. Have you look at a running iostat while all this is going on? Try it out- add up the kb/s from each drive and see how close you are to your maximum theoretical IO. I didn't try iostat, I did look at vmstat, and there the numbers look even worse, the bo column is ~500 for the resync by itself, but with the DD it's ~50,000. when I get access to the box again I'll try iostat to get more details Also, how's your CPU utilization? ~30% of one cpu for the raid 6 thread, ~5% of one cpu for the resync thread David Lang - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: limits on raid
On Mon, 18 Jun 2007, Lennart Sorensen wrote: On Mon, Jun 18, 2007 at 11:12:45AM -0700, [EMAIL PROTECTED] wrote: simple ultra-wide SCSI to a single controller. Hmm, isn't ultra-wide limited to 40MB/s? Is it Ultra320 wide? That could do a lot more, and 220MB/s sounds plausable for 320 scsi. yes, sorry, ultra 320 wide. I didn't realize that the rate reported by /proc/mdstat was the write speed that was takeing place, I thought it was the total data rate (reads + writes). the next time this message gets changed it would be a good thing to clarify this. Well I suppose itcould make sense to show rate of rebuild which you can then compare against the total size of tha raid, or you can have rate of write, which you then compare against the size of the drive being synced. Certainly I would expect much higer speeds if it was the overall raid size, while the numbers seem pretty reasonable as a write speed. 4MB/s would take for ever if it was the overall raid resync speed. I usually see SATA raid1 resync at 50 to 60MB/s or so, which matches the read and write speeds of the drives in the raid. as I read it right now what happens is the worst of the options, you show the total size of the array for the amount of work that needs to be done, but then show only the write speed for the rate pf progress being made through the job. total rebuild time was estimated at ~3200 min David Lang - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: limits on raid
On Mon, 18 Jun 2007, Brendan Conoboy wrote: [EMAIL PROTECTED] wrote: yes, sorry, ultra 320 wide. Exactly how many channels and drives? one channel, 2 OS drives plus the 45 drives in the array. yes I realize that there will be bottlenecks with this, the large capacity is to handle longer history (it's going to be a 30TB circular buffer being fed by a pair of OC-12 links) it appears that my big mistake was not understanding what /proc/mdstat is telling me. David Lang - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: limits on raid
On Sat, Jun 16, 2007 at 07:59:29AM +1000, Neil Brown wrote: Combining these thoughts, it would make a lot of sense for the filesystem to be able to say to the block device That blocks looks wrong - can you find me another copy to try?. That is an example of the sort of closer integration between filesystem and RAID that would make sense. I think that this would only be useful on devices that store discrete copies of the blocks on different devices i.e. mirrors. If it's an XOR based RAID, you don't have another copy you can retreive Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: limits on raid
On Sat, 16 Jun 2007, Neil Brown wrote: It would be possible to have a 'this is not initialised' flag on the array, and if that is not set, always do a reconstruct-write rather than a read-modify-write. But the first time you have an unclean shutdown you are going to resync all the parity anyway (unless you have a bitmap) so you may as well resync at the start. And why is it such a big deal anyway? The initial resync doesn't stop you from using the array. I guess if you wanted to put an array into production instantly and couldn't afford any slowdown due to resync, then you might want to skip the initial resync but is that really likely? in my case it takes 2+ days to resync the array before I can do any performance testing with it. for some reason it's only doing the rebuild at ~5M/sec (even though I've increased the min and max rebuild speeds and a dd to the array seems to be ~44M/sec, even during the rebuild) I want to test several configurations, from a 45 disk raid6 to a 45 disk raid0. at 2-3 days per test (or longer, depending on the tests) this becomes a very slow process. also, when a rebuild is slow enough (and has enough of a performance impact) it's not uncommon to want to operate in degraded mode just long enought oget to a maintinance window and then recreate the array and reload from backup. David Lang - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.
On Thu, May 31, 2007 at 08:26:45AM +0200, Jens Axboe wrote: On Thu, May 31 2007, David Chinner wrote: IOWs, there are two parts to the problem: 1 - guaranteeing I/O ordering 2 - guaranteeing blocks are on persistent storage. Right now, a single barrier I/O is used to provide both of these guarantees. In most cases, all we really need to provide is 1); the need for 2) is a much rarer condition but still needs to be provided. if I am understanding it correctly, the big win for barriers is that you do NOT have to stop and wait until the data is on persistant media before you can continue. Yes, if we define a barrier to only guarantee 1), then yes this would be a big win (esp. for XFS). But that requires all filesystems to handle sync writes differently, and sync_blockdev() needs to call blkdev_issue_flush() as well So, what do we do here? Do we define a barrier I/O to only provide ordering, or do we define it to also provide persistent storage writeback? Whatever we decide, it needs to be documented The block layer already has a notion of the two types of barriers, with a very small amount of tweaking we could expose that. There's absolutely zero reason we can't easily support both types of barriers. That sounds like a good idea - we can leave the existing WRITE_BARRIER behaviour unchanged and introduce a new WRITE_ORDERED behaviour that only guarantees ordering. The filesystem can then choose which to use where appropriate Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.
On Thu, 31 May 2007, Jens Axboe wrote: On Thu, May 31 2007, Phillip Susi wrote: David Chinner wrote: That sounds like a good idea - we can leave the existing WRITE_BARRIER behaviour unchanged and introduce a new WRITE_ORDERED behaviour that only guarantees ordering. The filesystem can then choose which to use where appropriate So what if you want a synchronous write, but DON'T care about the order? They need to be two completely different flags which you can choose to combine, or use individually. If you have a use case for that, we can easily support it as well... Depending on the drive capabilities (FUA support or not), it may be nearly as slow as a real barrier write. true, but a real barrier write could have significant side effects on other writes that wouldn't happen with a synchronous wrote (a sync wrote can have other, unrelated writes re-ordered around it, a barrier write can't) David Lang - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html