Re: RFC: mdadm and bringing up raid sets from initrd (dracut)
On Wednesday July 15, dan.j.willi...@intel.com wrote: > [ Cc: Neil ] > > On Tue, Jul 14, 2009 at 7:30 AM, David Zeuthen wrote: > > On Tue, 2009-07-14 at 12:59 +0200, Hans de Goede wrote: > >> Currently the udev rules use incremental assembly like this: > >> mdadm -I /dev/mdraid-member > >> > >> There are 2 problems with this: > >> 1) When doing this for native mdraid metadata arrays, if only > >> one disk is present the set never gets activated > >> 2) When doing this for imsm metadata arrays, as soon as the > >> first disk is incrementally added, the set gets activated > >> in degraded mode and stays that way, the second disk > >> will get added to the container, but not to the actual > >> sets in the container > > > > FWIW, this incremental assembly business in mdadm is actually not a very > > good idea. At least not the current implementation. I'm not sure whether > > it's still a Fedora-ism or whether it's something that's in upstream > > mdadm yet. I'm talking about this udev rule > > > > /lib/udev/rules.d/65-md-incremental.rules: > > # This file causes block devices with Linux RAID (mdadm) signatures to > > # automatically cause mdadm to be run. > > # See udev(8) for syntax > > > > SUBSYSTEM=="block", ACTION=="add", ENV{ID_FS_TYPE}=="linux_raid_member", \ > > IMPORT{program}="/sbin/mdadm --examine --export $tempnode", \ > > RUN+="/bin/bash -c '[ ! -f /dev/.in_sysinit ] && mdadm -I > > $env{DEVNAME}'" > > > > For example if the user plugs in a random old disk that happens to > > contain half of a RAID1 mirror, then the incremental assembly bits sets > > up an inert md-device and the user is now left to his own devices as to > > sort this out when he's told by partitioning tools etc. that the disk > > (or partition of) he just plugged in, is "busy" (it is claimed by the > > inert md node). > > > > I actually had to add some extra code to the GNOME Disk Utility bits to > > handle such things (stop inert md devices) - makes the user experience > > quite a bit worse since there's now an extra state to worry about. And > > most current users don't use the UI bits yet for this so they get extra > > confused when trying to use e.g. parted(8) or fdisk(8) on the device. > > > > FWIW, I'd wish people would stop playing games like this. If you want to > > do auto-assembly at the system-level, at the very least don't leave the > > system in a state like this. For example, one way to do auto-assembly > > without such bugs would be to use libudev to enumerate all md component > > devices with the same MD_UUID. Then you count the number of components > > and only start the array if the number of components equals MD_DEVICES. > > That's much better than incrementally adding to an md device node that > > might never get used. Yes: auto-assembly is hard, and easy to get wrong. While I don't claim that the current scheme is at all perfect, I don't think your suggestion is a clear improvement. The whole point of RAID is to survive drive failure, and that includes drives being missing. So I don't think "completely ignore the array if not all expected drives are present" is the correct answer. It is very easy to remove unwanted raid metadata (mdadm --zero-superblock), and making that easily accessible from a GUI would probably be a good and useful thing, and might solve some problems for some people. One thing that I have contemplated is for md to not claim exclusive ownership of drives until the array is activated and switch to read-write. That would address the 'my drive was stolen by md' problem, but it may well create other problems in its place. My general goal at present is to make mdadm sufficiently flexible that a distro can choose a suitable policy implement it. If someone comes up with a policy that works convincingly well, I could then make that the default approach that mdadm takes. There is certainly still room for improvement and I am happy to discuss possibilities. NeilBrown -- To unsubscribe from this list: send the line "unsubscribe initramfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RFC: mdadm and bringing up raid sets from initrd (dracut)
On Wednesday July 15, dan.j.willi...@intel.com wrote: > > mdadm-3.0 has facilities to prevent assembly of certain metadata types > [1] or arrays with certain uuids [2]. I wonder if we also need a > facility to prevent auto-assembly of arrays *not* listed in > mdadm.conf? So the mdadm.conf file installed in the initramfs would > only identify the root array and all other randomly identified md > devices would be ignored (rather than assembled with a foreign name). > > Thoughts? This is exactly the functionality that the 'AUTO' line provides. Arrays that are listed in mdadm.conf will always be assembled when they are found. Arrays that are not listed may or may not be assembled depending on their metadata type and the setting of 'AUTO'. So AUTO -all will disable all auto-assemble of arrays that are not listed in mdadm.conf. NeilBrown -- To unsubscribe from this list: send the line "unsubscribe initramfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RFC: mdadm and bringing up raid sets from initrd (dracut)
On 07/14/2009 05:00 PM, David Zeuthen wrote: On Tue, 2009-07-14 at 10:14 -0400, Doug Ledford wrote: On Jul 14, 2009, at 11:02 AM, Hans de Goede wrote: Hi, On 07/14/2009 03:39 PM, Doug Ledford wrote: On Jul 14, 2009, at 6:59 AM, Hans de Goede wrote: Hi, As you probably know I'm working on making Fedora 12 use mdraid instead of dmraid for Intel BIOS-RAID setups. The installer (anaconda) part is mostly done (needs more testing) and now I'm looking at implementing support for this in dracut (the new mkinitrd for Fedora 12). So I've been testing how this works for both imsm mdraid sets and native mdraid metadata sets, in both cases using a 2 disk mirror, so that the set can also be brought up in degraded mode. Currently the udev rules use incremental assembly like this: mdadm -I /dev/mdraid-member Hmmm...does dracut use udev during initramfs time? Yes, it uses udev for everything, making discovery of / consistent with the discovery of other storage devices. I'm not sure I like or agree with that philosophy. I absolutely *don't* want my / filesystem or raid device treated like some plug in, temporary, roaming raid device. They *aren't* the same, not in terms of importance to the running of the machine and not in terms of reliability requirements. By using mdadm -A in the mkinitrd calls, I was able to put in an mdadm.conf file and limit what arrays get started to arrays found non-ambiguously in that mdadm.conf file and identified by UUID. When you switch to incremental assembly for root, you risk the possibility of name space collisions and non- deterministic bring up of your / array. I'm concerned about this too. To be more specific, I'm concerned about both automatically assembling things like RAID arrays / LVM logical volumes and also automounting devices [1]. Anyway, my point with all this is that maybe we are going about things wrong in the initramfs. My understanding is that dracut roughly works this way (please let me know if this is wrong) 1. when generating the initramfs image, we leave information in the kernel command-line about the root filesystem - typically the UUID - e.g. root=UUID=786263c4-5e28-4cdc-97b8-1ab6e221c344 2. when the initramfs starts, we trigger all uevents and wait for things to settle 3. Autoassembly / magic: - If we see e.g. md components, we activate them via udev rules - If we see e.g. LUKS devices, we unlock them (by interacting with the user asking for the passphrase) via udev rules. - Ditto for e.g. LVM 5. if we see the rootfs (matching on e.g. the UUID passed on the kernel command line) we create the /dev/root symlink 6. when the system has settled (e.g. no more uevents) we mount /dev/root and transition to non-early user space. If there is no /dev/root link, we bail out Now, my beef is 3. above. I think it is way too optimistic to just auto-assemble / unlock etc. everything. E.g. we end up doing a lot of work not related to the rootfs that is better done in non-early user space. Instead, just like we specify the UUID for rootfs on the command-line, we need to leave some instructions to the initramfs logic on _exactly_ what things should be autoassembled / unlocked / etc. in order to find the rootfs. So the kernel command-line wouldn't really be "just" the UUID of rootfs; it would be a whole recipe of actions to do. E.g. ROOTFS=UUID=1234 \ # this the UUID of my rootfs MD_ASSEMBLE=UUID=4567 \ # assemble MD array with UUID 4567 LUKS_UNLOCK=UUID=89ab # unlock LUKS device with UUID 89ab which would work for e.g. cases where rootfs is on a LUKS device which is on a MD array. In other words, we'd need a whole "recipe" passed to the initramfs (the mkinitrd tool would generate this recipe), not just the UUID of the rootfs. Coincidentally, if we had something like this and the format of the "recipe" was documented somewhere, it would be easy to e.g. implement "rescue" functionality as described here http://www.redhat.com/archives/fedora-desktop-list/2009-July/msg00019.html since graphical disk utilities would just find /etc/grub.conf (or similar), read the recipe and then start assembling/unlocking bits and mount them as appropriate in /mnt/rescue/. Actually this is very close to what Doug is asking for when he says (paraphrased) "just include mdadm.conf instead of this magic". The key difference, however, is that the user _won't_ have to use mdadm.conf or care about config files - it's all taken care of by the mkinitrd binary when building the recipe. This is a good thing as having one less config file to worry about is good. Thanks for considering, and sorry for the long mail, David [1] : As some background information, I've spent a good chunk of my life, five years or so, dealing with end users complaining about how plain block devices got automounted when they were plugged in. FWIW, the complaints ranges from both non-sensical (irritated users: "these desktop k
Re: RFC: mdadm and bringing up raid sets from initrd (dracut)
On Wed, Jul 15, 2009 at 7:16 PM, Jeremy Katz wrote: > On Wednesday, July 15 2009, Dan Williams said: >> mdadm-3.0 has facilities to prevent assembly of certain metadata types >> [1] or arrays with certain uuids [2]. I wonder if we also need a >> facility to prevent auto-assembly of arrays *not* listed in >> mdadm.conf? So the mdadm.conf file installed in the initramfs would >> only identify the root array and all other randomly identified md >> devices would be ignored (rather than assembled with a foreign name). >> >> Thoughts? > > There is no mdadm.conf in the initramfs -- in fact, the initramfs may > not even be generated on the system that you're booting and instead be > "generic" for the kernel in question Still, it sounds like a good feature to be added for a --hostonly initramfs. (especially on systems that are attaching to iscsi and/or fibre channel luns) > Jeremy > -- > To unsubscribe from this list: send the line "unsubscribe initramfs" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- To unsubscribe from this list: send the line "unsubscribe initramfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RFC: mdadm and bringing up raid sets from initrd (dracut)
On Wednesday, July 15 2009, Dan Williams said: > mdadm-3.0 has facilities to prevent assembly of certain metadata types > [1] or arrays with certain uuids [2]. I wonder if we also need a > facility to prevent auto-assembly of arrays *not* listed in > mdadm.conf? So the mdadm.conf file installed in the initramfs would > only identify the root array and all other randomly identified md > devices would be ignored (rather than assembled with a foreign name). > > Thoughts? There is no mdadm.conf in the initramfs -- in fact, the initramfs may not even be generated on the system that you're booting and instead be "generic" for the kernel in question Jeremy -- To unsubscribe from this list: send the line "unsubscribe initramfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RFC: mdadm and bringing up raid sets from initrd (dracut)
[ Cc: Neil ] On Tue, Jul 14, 2009 at 7:30 AM, David Zeuthen wrote: > On Tue, 2009-07-14 at 12:59 +0200, Hans de Goede wrote: >> Currently the udev rules use incremental assembly like this: >> mdadm -I /dev/mdraid-member >> >> There are 2 problems with this: >> 1) When doing this for native mdraid metadata arrays, if only >> one disk is present the set never gets activated >> 2) When doing this for imsm metadata arrays, as soon as the >> first disk is incrementally added, the set gets activated >> in degraded mode and stays that way, the second disk >> will get added to the container, but not to the actual >> sets in the container > > FWIW, this incremental assembly business in mdadm is actually not a very > good idea. At least not the current implementation. I'm not sure whether > it's still a Fedora-ism or whether it's something that's in upstream > mdadm yet. I'm talking about this udev rule > > /lib/udev/rules.d/65-md-incremental.rules: > # This file causes block devices with Linux RAID (mdadm) signatures to > # automatically cause mdadm to be run. > # See udev(8) for syntax > > SUBSYSTEM=="block", ACTION=="add", ENV{ID_FS_TYPE}=="linux_raid_member", \ > IMPORT{program}="/sbin/mdadm --examine --export $tempnode", \ > RUN+="/bin/bash -c '[ ! -f /dev/.in_sysinit ] && mdadm -I > $env{DEVNAME}'" > > For example if the user plugs in a random old disk that happens to > contain half of a RAID1 mirror, then the incremental assembly bits sets > up an inert md-device and the user is now left to his own devices as to > sort this out when he's told by partitioning tools etc. that the disk > (or partition of) he just plugged in, is "busy" (it is claimed by the > inert md node). > > I actually had to add some extra code to the GNOME Disk Utility bits to > handle such things (stop inert md devices) - makes the user experience > quite a bit worse since there's now an extra state to worry about. And > most current users don't use the UI bits yet for this so they get extra > confused when trying to use e.g. parted(8) or fdisk(8) on the device. > > FWIW, I'd wish people would stop playing games like this. If you want to > do auto-assembly at the system-level, at the very least don't leave the > system in a state like this. For example, one way to do auto-assembly > without such bugs would be to use libudev to enumerate all md component > devices with the same MD_UUID. Then you count the number of components > and only start the array if the number of components equals MD_DEVICES. > That's much better than incrementally adding to an md device node that > might never get used. > > I've complained to Doug about this already for Fedora but, since it's > still broken and, AFAICT, up it's way to upstream mdadm, it's worth > reiterating the complaint. > > Thanks, > David > > [1] : And, except for booting, it's not clear to me that you want to > have policy like auto-assembling RAID arrays at the system. I'd leave > such policy to desktop bits where the user can control it and the > software can actually interact with the user. And where it's easy to > turn off features like this. > mdadm-3.0 has facilities to prevent assembly of certain metadata types [1] or arrays with certain uuids [2]. I wonder if we also need a facility to prevent auto-assembly of arrays *not* listed in mdadm.conf? So the mdadm.conf file installed in the initramfs would only identify the root array and all other randomly identified md devices would be ignored (rather than assembled with a foreign name). Thoughts? Thanks, Dan [1]: http://neil.brown.name/git?p=mdadm;a=commitdiff;h=31015d57 [2]: http://neil.brown.name/git?p=mdadm;a=commitdiff;h=112cace6 -- To unsubscribe from this list: send the line "unsubscribe initramfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RFC: mdadm and bringing up raid sets from initrd (dracut)
On Tue, 2009-07-14 at 12:59 +0200, Hans de Goede wrote: > Currently the udev rules use incremental assembly like this: > mdadm -I /dev/mdraid-member > > There are 2 problems with this: > 1) When doing this for native mdraid metadata arrays, if only > one disk is present the set never gets activated > 2) When doing this for imsm metadata arrays, as soon as the > first disk is incrementally added, the set gets activated > in degraded mode and stays that way, the second disk > will get added to the container, but not to the actual > sets in the container FWIW, this incremental assembly business in mdadm is actually not a very good idea. At least not the current implementation. I'm not sure whether it's still a Fedora-ism or whether it's something that's in upstream mdadm yet. I'm talking about this udev rule /lib/udev/rules.d/65-md-incremental.rules: # This file causes block devices with Linux RAID (mdadm) signatures to # automatically cause mdadm to be run. # See udev(8) for syntax SUBSYSTEM=="block", ACTION=="add", ENV{ID_FS_TYPE}=="linux_raid_member", \ IMPORT{program}="/sbin/mdadm --examine --export $tempnode", \ RUN+="/bin/bash -c '[ ! -f /dev/.in_sysinit ] && mdadm -I $env{DEVNAME}'" For example if the user plugs in a random old disk that happens to contain half of a RAID1 mirror, then the incremental assembly bits sets up an inert md-device and the user is now left to his own devices as to sort this out when he's told by partitioning tools etc. that the disk (or partition of) he just plugged in, is "busy" (it is claimed by the inert md node). I actually had to add some extra code to the GNOME Disk Utility bits to handle such things (stop inert md devices) - makes the user experience quite a bit worse since there's now an extra state to worry about. And most current users don't use the UI bits yet for this so they get extra confused when trying to use e.g. parted(8) or fdisk(8) on the device. FWIW, I'd wish people would stop playing games like this. If you want to do auto-assembly at the system-level, at the very least don't leave the system in a state like this. For example, one way to do auto-assembly without such bugs would be to use libudev to enumerate all md component devices with the same MD_UUID. Then you count the number of components and only start the array if the number of components equals MD_DEVICES. That's much better than incrementally adding to an md device node that might never get used. I've complained to Doug about this already for Fedora but, since it's still broken and, AFAICT, up it's way to upstream mdadm, it's worth reiterating the complaint. Thanks, David [1] : And, except for booting, it's not clear to me that you want to have policy like auto-assembling RAID arrays at the system. I'd leave such policy to desktop bits where the user can control it and the software can actually interact with the user. And where it's easy to turn off features like this. -- To unsubscribe from this list: send the line "unsubscribe initramfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RFC: mdadm and bringing up raid sets from initrd (dracut)
On Tue, 2009-07-14 at 10:14 -0400, Doug Ledford wrote: > On Jul 14, 2009, at 11:02 AM, Hans de Goede wrote: > > Hi, > > On 07/14/2009 03:39 PM, Doug Ledford wrote: > >> On Jul 14, 2009, at 6:59 AM, Hans de Goede wrote: > >>> Hi, > >>> > >>> As you probably know I'm working on making Fedora 12 use mdraid > >>> instead of dmraid for Intel BIOS-RAID setups. > >>> > >>> The installer (anaconda) part is mostly done (needs more testing) > >>> and now I'm looking at implementing support for this in dracut > >>> (the new mkinitrd for Fedora 12). > >>> > >>> So I've been testing how this works for both imsm mdraid sets > >>> and native mdraid metadata sets, in both cases using a 2 disk > >>> mirror, so that the set can also be brought up in degraded mode. > >>> > >>> Currently the udev rules use incremental assembly like this: > >>> mdadm -I /dev/mdraid-member > >> > >> Hmmm...does dracut use udev during initramfs time? > > > > Yes, it uses udev for everything, making discovery of / consistent > > with the discovery of other storage devices. > > I'm not sure I like or agree with that philosophy. I absolutely > *don't* want my / filesystem or raid device treated like some plug in, > temporary, roaming raid device. They *aren't* the same, not in terms > of importance to the running of the machine and not in terms of > reliability requirements. By using mdadm -A in the mkinitrd calls, I > was able to put in an mdadm.conf file and limit what arrays get > started to arrays found non-ambiguously in that mdadm.conf file and > identified by UUID. When you switch to incremental assembly for root, > you risk the possibility of name space collisions and non- > deterministic bring up of your / array. I'm concerned about this too. To be more specific, I'm concerned about both automatically assembling things like RAID arrays / LVM logical volumes and also automounting devices [1]. Anyway, my point with all this is that maybe we are going about things wrong in the initramfs. My understanding is that dracut roughly works this way (please let me know if this is wrong) 1. when generating the initramfs image, we leave information in the kernel command-line about the root filesystem - typically the UUID - e.g. root=UUID=786263c4-5e28-4cdc-97b8-1ab6e221c344 2. when the initramfs starts, we trigger all uevents and wait for things to settle 3. Autoassembly / magic: - If we see e.g. md components, we activate them via udev rules - If we see e.g. LUKS devices, we unlock them (by interacting with the user asking for the passphrase) via udev rules. - Ditto for e.g. LVM 5. if we see the rootfs (matching on e.g. the UUID passed on the kernel command line) we create the /dev/root symlink 6. when the system has settled (e.g. no more uevents) we mount /dev/root and transition to non-early user space. If there is no /dev/root link, we bail out Now, my beef is 3. above. I think it is way too optimistic to just auto-assemble / unlock etc. everything. E.g. we end up doing a lot of work not related to the rootfs that is better done in non-early user space. Instead, just like we specify the UUID for rootfs on the command-line, we need to leave some instructions to the initramfs logic on _exactly_ what things should be autoassembled / unlocked / etc. in order to find the rootfs. So the kernel command-line wouldn't really be "just" the UUID of rootfs; it would be a whole recipe of actions to do. E.g. ROOTFS=UUID=1234 \ # this the UUID of my rootfs MD_ASSEMBLE=UUID=4567 \ # assemble MD array with UUID 4567 LUKS_UNLOCK=UUID=89ab # unlock LUKS device with UUID 89ab which would work for e.g. cases where rootfs is on a LUKS device which is on a MD array. In other words, we'd need a whole "recipe" passed to the initramfs (the mkinitrd tool would generate this recipe), not just the UUID of the rootfs. Coincidentally, if we had something like this and the format of the "recipe" was documented somewhere, it would be easy to e.g. implement "rescue" functionality as described here http://www.redhat.com/archives/fedora-desktop-list/2009-July/msg00019.html since graphical disk utilities would just find /etc/grub.conf (or similar), read the recipe and then start assembling/unlocking bits and mount them as appropriate in /mnt/rescue/. Actually this is very close to what Doug is asking for when he says (paraphrased) "just include mdadm.conf instead of this magic". The key difference, however, is that the user _won't_ have to use mdadm.conf or care about config files - it's all taken care of by the mkinitrd binary when building the recipe. This is a good thing as having one less config file to worry about is good. Thanks for considering, and sorry for the long mail, David [1] : As some background information, I've spent a good chunk of my life, five years or so, dealing with end users complaining about how plain block devices got automounted when they were plugged
Re: RFC: mdadm and bringing up raid sets from initrd (dracut)
On Jul 14, 2009, at 11:02 AM, Hans de Goede wrote: Hi, On 07/14/2009 03:39 PM, Doug Ledford wrote: On Jul 14, 2009, at 6:59 AM, Hans de Goede wrote: Hi, As you probably know I'm working on making Fedora 12 use mdraid instead of dmraid for Intel BIOS-RAID setups. The installer (anaconda) part is mostly done (needs more testing) and now I'm looking at implementing support for this in dracut (the new mkinitrd for Fedora 12). So I've been testing how this works for both imsm mdraid sets and native mdraid metadata sets, in both cases using a 2 disk mirror, so that the set can also be brought up in degraded mode. Currently the udev rules use incremental assembly like this: mdadm -I /dev/mdraid-member Hmmm...does dracut use udev during initramfs time? Yes, it uses udev for everything, making discovery of / consistent with the discovery of other storage devices. I'm not sure I like or agree with that philosophy. I absolutely *don't* want my / filesystem or raid device treated like some plug in, temporary, roaming raid device. They *aren't* the same, not in terms of importance to the running of the machine and not in terms of reliability requirements. By using mdadm -A in the mkinitrd calls, I was able to put in an mdadm.conf file and limit what arrays get started to arrays found non-ambiguously in that mdadm.conf file and identified by UUID. When you switch to incremental assembly for root, you risk the possibility of name space collisions and non- deterministic bring up of your / array. Are we going to be totally changing this with dracut and F12? This method very nicely resolves the issues you posted. Yes. Regards, Hans -- Doug Ledford GPG KeyID: CFBFF194 http://people.redhat.com/dledford InfiniBand Specific RPMS http://people.redhat.com/dledford/Infiniband PGP.sig Description: This is a digitally signed message part
Re: RFC: mdadm and bringing up raid sets from initrd (dracut)
Hi, On 07/14/2009 03:39 PM, Doug Ledford wrote: On Jul 14, 2009, at 6:59 AM, Hans de Goede wrote: Hi, As you probably know I'm working on making Fedora 12 use mdraid instead of dmraid for Intel BIOS-RAID setups. The installer (anaconda) part is mostly done (needs more testing) and now I'm looking at implementing support for this in dracut (the new mkinitrd for Fedora 12). So I've been testing how this works for both imsm mdraid sets and native mdraid metadata sets, in both cases using a 2 disk mirror, so that the set can also be brought up in degraded mode. Currently the udev rules use incremental assembly like this: mdadm -I /dev/mdraid-member Hmmm...does dracut use udev during initramfs time? Yes, it uses udev for everything, making discovery of / consistent with the discovery of other storage devices. Are we going to be totally changing this with dracut and F12? This method very nicely resolves the issues you posted. Yes. Regards, Hans -- To unsubscribe from this list: send the line "unsubscribe initramfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RFC: mdadm and bringing up raid sets from initrd (dracut)
On Jul 14, 2009, at 6:59 AM, Hans de Goede wrote: Hi, As you probably know I'm working on making Fedora 12 use mdraid instead of dmraid for Intel BIOS-RAID setups. The installer (anaconda) part is mostly done (needs more testing) and now I'm looking at implementing support for this in dracut (the new mkinitrd for Fedora 12). So I've been testing how this works for both imsm mdraid sets and native mdraid metadata sets, in both cases using a 2 disk mirror, so that the set can also be brought up in degraded mode. Currently the udev rules use incremental assembly like this: mdadm -I /dev/mdraid-member Hmmm...does dracut use udev during initramfs time? mkinitrd didn't, so this would be a change. In particular, I didn't have these problems with mkinitrd because I didn't use udev rules in the initrd, I ran mdadm -A instead. In fact, the F11 method of bringup of raid devices is as such: initrd: use mdadm -As --run in /etc/mdadm.conf> rc.sysinit: use mdadm -As --run (no md device name, which means all arrays listed in mdadm.conf will get brought up, plus extra arrays not listed in mdadm.conf but which can be found and identified by metadata) udev: in 65-md-incremental.rules use mdadm -I (but only if /dev/.in.rcsysinit does not exist, so we don't run udev incremental rules until after the system is up and running, which means for hot plugged devices...in particular we will never run the udev rule on any device that was present on boot, instead the previous two calls will catch these devices, and those previous calls will run degraded arrays, this allows me to safely refuse to run degraded arrays in the udev rules file without risking failing to boot, instead a degraded hot plugged array will need minor manual intervention, but the system will be fully up and operational no matter what) I find this setup to be a rather safe, conservative way of handling md raid array hot plug. Are we going to be totally changing this with dracut and F12? This method very nicely resolves the issues you posted. There are 2 problems with this: 1) When doing this for native mdraid metadata arrays, if only one disk is present the set never gets activated 2) When doing this for imsm metadata arrays, as soon as the first disk is incrementally added, the set gets activated in degraded mode and stays that way, the second disk will get added to the container, but not to the actual sets in the container And these 2 problems have 2 different solutions: 1) An incomplete, but potentially activatable in degraded mode set can be activated using mdadm --run /dev/md# 2) One can stop this problem by using: mdadm -I --no-degraded /dev/mdraid-member instead (this does not change anything for native mdraid metadata format sets) But if that is done, the sets in the container never get activated, this can be fixed by running mdadm -I /dev/md# on the container device So my proposed solution for this is when udev is done scanning (when the event queue is empty, detected using the same mechanism as dracut is using for dmraid), do the following: For each /dev/md# run mdadm --export --detail, and get the MD_LEVEL if MD_LEVEL == "container": mdadm -I /dev/md# else mdadm --run /dev/md# This will: 1) Bring up raid sets inside containers (such as imsm raidsets) 2) Bring up incomplete raid sets in degraded mode where possible I'll post a patch implementing this later today. Regards, Hans -- Doug Ledford GPG KeyID: CFBFF194 http://people.redhat.com/dledford InfiniBand Specific RPMS http://people.redhat.com/dledford/Infiniband PGP.sig Description: This is a digitally signed message part
RFC: mdadm and bringing up raid sets from initrd (dracut)
Hi, As you probably know I'm working on making Fedora 12 use mdraid instead of dmraid for Intel BIOS-RAID setups. The installer (anaconda) part is mostly done (needs more testing) and now I'm looking at implementing support for this in dracut (the new mkinitrd for Fedora 12). So I've been testing how this works for both imsm mdraid sets and native mdraid metadata sets, in both cases using a 2 disk mirror, so that the set can also be brought up in degraded mode. Currently the udev rules use incremental assembly like this: mdadm -I /dev/mdraid-member There are 2 problems with this: 1) When doing this for native mdraid metadata arrays, if only one disk is present the set never gets activated 2) When doing this for imsm metadata arrays, as soon as the first disk is incrementally added, the set gets activated in degraded mode and stays that way, the second disk will get added to the container, but not to the actual sets in the container And these 2 problems have 2 different solutions: 1) An incomplete, but potentially activatable in degraded mode set can be activated using mdadm --run /dev/md# 2) One can stop this problem by using: mdadm -I --no-degraded /dev/mdraid-member instead (this does not change anything for native mdraid metadata format sets) But if that is done, the sets in the container never get activated, this can be fixed by running mdadm -I /dev/md# on the container device So my proposed solution for this is when udev is done scanning (when the event queue is empty, detected using the same mechanism as dracut is using for dmraid), do the following: For each /dev/md# run mdadm --export --detail, and get the MD_LEVEL if MD_LEVEL == "container": mdadm -I /dev/md# else mdadm --run /dev/md# This will: 1) Bring up raid sets inside containers (such as imsm raidsets) 2) Bring up incomplete raid sets in degraded mode where possible I'll post a patch implementing this later today. Regards, Hans -- To unsubscribe from this list: send the line "unsubscribe initramfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html